Machine Learning Semester Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Ques : What is machine learning ? Applications & limitations ?

Ans : It is a branch of artificial intelligence based on the idea that systems can learn from data,
identify patterns and make decisions with minimal human intervention. Large amounts of data can
be used to create much more accurate Machine Learning algorithms that are actually viable in the
technical industry. And so, Machine Learning is now a buzz word in the industry despite having
existed for a long time.

This machine learning process starts with feeding them good quality data and then training
the machines by building various machine learning models using the data and different
algorithms. The choice of algorithms depends on what type of data we have and what kind of
task we are trying to automate.

As for the formal definition of Machine Learning, we can say that a Machine Learning
algorithm learns from experience E with respect to some type of task T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E.

For example, If a Machine Learning algorithm is used to play chess. Then the experience E is
playing many games of chess, the task T is playing chess with many players, and the
performance measure P is the probability that the algorithm will win in the game of chess.

This is where Machine Learning comes into action. Some of the most common examples are:

• Image Recognition
• Speech Recognition
• Recommender Systems
• Fraud Detection
• Self Driving Cars
• Medical Diagnosis
• Stock Market Trading
• Virtual Try On

Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced
in the field of Deep Learning. The task which started from classification between cats
and dog images has now evolved up to the level of Face Recognition and real-world
use cases based on that like employee attendance tracking.

Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across and
used to communicate with them. In the backend, these systems are based basically on Speech
Recognition systems. These systems are designed such that they can convert voice
instructions into text.
Recommender Systems
As our world has digitalized more and more approximately every tech giants try to provide
customized services to its users. This application is possible just because of the recommender
systems which can analyze a user’s preferences and search history and based on that they can
recommend content or services to them.

Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or
making transactions of millions of dollars everything is accessible and easy to use. But with
this process of digitization cases of fraudulent transactions and fraudulent activities have
increased. Identifying them is not that easy but machine learning systems are very efficient in
these tasks.

Self Driving Cars


It would have been assumed that there is certainly some ghost who is driving a car if we ever
saw a car being driven without a driver but all thanks to machine learning and deep learning
that in today’s world, this is possible and not a story from some fictional book. Even though
the algorithms and tech stack behind these technologies are highly advanced but at the core it
is machine learning which has made these applications possible.

Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you must have
heard about projects like breast cancer Classification, Parkinson’s Disease Classification,
Pneumonia detection, and many more health-related tasks which are performed by machine
learning models with more than 90% of accuracy.

Stock Market Trading


Stock Market has remained a hot topic among working professionals and even students
because if you have sufficient knowledge of the markets and the forces which drives them
then you can make fortune in this domain. Attempts have been made to create intelligent
systems which can predict future price trends and market value as well.
LIMITATIONS OF ML :

Ethical concerns
There are, of course, many advantages to trusting algorithms. Humanity has benefited from
relying on computer algorithms to automate processes, analyze large amounts of data, and
make complex decisions. However, trusting algorithms has its drawbacks. Algorithms can be
subject to bias at any level of development. And since algorithms are developed and trained
by humans, it’s nearly impossible to eliminate bias.

Many ethical questions still remain unanswered. For example, who is to blame if something
goes wrong? Let’s take the most obvious example — self-driving cars.

Deterministic problems
ML is a powerful technology well suited for many domains, including weather forecasting
and climate and atmospheric research. ML models can be used to help calibrate and correct
sensors that allow you to adjust the operation of sensors that measure environmental
indicators like temperature, pressure, and humidity.

Models can be programmed, for example, to simulate weather and emissions into the
atmosphere to forecast pollution. Depending on the amount of data and the complexity of the
model, this can be computationally intensive and take up to a month.

However, neural networks do not understand the physics of a weather system, nor do not
understand its laws. For example, ML can make predictions, but the calculations of such
intermediate fields as density can have negative values that are impossible under the laws of
physics. AI does not recognize cause-and-effect relationships. The neural network finds a
connection between input and output data but cannot explain the reason they are connected.
Lack of Data
Neural networks are complex architectures and require enormous amounts of training data to
produce viable results. As the size of a neural network’s architecture grows, so does its data
requirement. In such cases, some may decide to reuse the data, but this will never bring good
results.

Another problem is related to the lack of quality data. This is not the same as simply not
having data. Let’s say your neural network requires more data, and you give it a sufficient
quantity, but you give it poor quality data. This can significantly reduce the model’s
accuracy.

Lack of interpretability
One significant problem with deep learning algorithms is interpretability. Let’s say you work
for a financial firm, and you need to build a model to detect fraudulent transactions. In this
case, your model should be able to justify how it classifies transactions. A deep learning
algorithm may have good accuracy and responsiveness for this task but may not validate its
solutions.

Or maybe you work for an AI consulting firm. You want to offer your services to a client that
uses only traditional statistical methods. AI models can be powerless if they cannot be
interpreted, and the process of human interpretation involves nuances that go far beyond
technical skill.

Lack of reproducibility
Lack of reproducibility in ML is a complex and growing issue exacerbated by a lack of code
transparency and model testing methodologies. Research labs develop new models that can
be quickly deployed in real-world applications. However, even if the models are developed to
take into account the latest research advances, they may not work in real cases.

Reproducibility can help different industries and professionals implement the same model
and discover solutions to problems faster.

Ques : Types of Machine Learning ?

Ans : Types of Machine Learning


There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning

1. Supervised Machine Learning


Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised Learning algorithms learn to
map points between inputs and correct outputs. It has both training and validation datasets
labelled.

Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use the learned
algorithms and predict whether it is a dog or a cat. This is how supervised learning works,
and this is particularly an image classification.

There are two main categories of supervised learning that are mentioned below:

• Classification

• Regression

Classification

Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether
a patient has a high risk of heart disease.

Here are some classification algorithms:

• Logistic Regression

• Support Vector Machine

• Random Forest

• Decision Tree

• K-Nearest Neighbors (KNN)

• Naive Bayes

Regression

Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.

Here are some regression algorithms:

• Linear Regression

• Polynomial Regression

• Ridge Regression

• Lasso Regression

• Decision tree

• Random Forest
Advantages of Supervised Machine Learning

• Supervised Learning models can have high accuracy as they are trained on labelled data.

• The process of decision-making in supervised learning models is often interpretable.

• It can often be used in pre-trained models which saves time and resources when developing
new models from scratch.

Disadvantages of Supervised Machine Learning

• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns
that are not present in the training data.

• It can be time-consuming and costly as it relies on labeled data only.

• It may lead to poor generalizations based on new data.

Applications of Supervised Learning

Supervised learning is used in a wide variety of applications, including:

• Image classification: Identify objects, faces, and other features in images.

• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.

• Speech recognition: Convert spoken language into text.

• Recommendation systems: Make personalized recommendations to users.

2. Unsupervised Machine Learning


Unsupervised Learning Unsupervised learning is a type of machine learning technique in
which an algorithm discovers patterns and relationships using unlabeled data. Unlike
supervised learning, unsupervised learning doesn’t involve providing the algorithm with
labeled target outputs. The primary goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then be used for various purposes,
such as data exploration, visualization, dimensionality reduction, and more.

Consider that you have a dataset that contains information about the purchases you made
from the shop. Through clustering, the algorithm can group the same purchasing behavior
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:

• Clustering

• Association

Clustering

Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.

Here are some clustering algorithms:

• K-Means Clustering algorithm

• Mean-shift algorithm

• DBSCAN Algorithm

• Principal Component Analysis

• Independent Component Analysis

Association

Association rule learning is a technique for discovering relationships between items in a


dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.

Here are some association rule learning algorithms:

• Apriori Algorithm

• Eclat

• FP-growth Algorithm

Advantages of Unsupervised Machine Learning

• It helps to discover hidden patterns and various relationships between the data.

• Used for tasks such as customer segmentation, anomaly detection, and data exploration.

• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning

• Without using labels, it may be difficult to predict the quality of the model’s output.

• Cluster Interpretability may not be clear and may not have meaningful interpretations.

• It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.

Applications of Unsupervised Learning

Here are some common applications of unsupervised learning:

• Clustering: Group similar data points into clusters.

• Anomaly detection: Identify outliers or anomalies in data.

• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.

3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It’s particularly
useful when obtaining labeled data is costly, time-consuming, or resource-intensive. This
approach is useful when the dataset is expensive and time-consuming. Semi-supervised
learning is chosen when labeled data requires skills and relevant resources in order to train or
learn from it.

We use these techniques when we are dealing with data that is a little bit labeled and the rest
large portion of it is unlabeled. We can use the unsupervised techniques to predict labels and
then feed these labels to supervised techniques. This technique is mostly applicable in the
case of image data sets where usually all images are not labeled.

Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has
led to significant improvements in the quality of machine translation services.

Advantages of Semi- Supervised Machine Learning

• It leads to better generalization as compared to supervised learning, as it takes both labeled


and unlabeled data.

• Can be applied to a wide range of data.


Disadvantages of Semi- Supervised Machine Learning

• Semi-supervised methods can be more complex to implement compared to other


approaches.

• It still requires some labeled data that might not always be available or easy to obtain.

• The unlabeled data can impact the model performance accordingly.

4. Reinforcement Machine Learning


Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps on
increasing its performance using Reward Feedback to learn the behavior or pattern. These
algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where
a bot competes with humans and even itself to get better and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which is training
data. So, the more it learns the better it gets trained and hence experienced.

Applications of Reinforcement Machine Learning

Here are some applications of reinforcement learning:

• Game Playing: RL can teach agents to play games, even complex ones.

• Robotics: RL can teach robots to perform tasks autonomously.

• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.

• Recommendation Systems: RL can enhance recommendation algorithms by learning user


preferences.

• Healthcare: RL can be used to optimize treatment plans and drug discovery.

• Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.

• Finance and Trading: RL can be used for algorithmic trading.


Ques : Online vs offline ML ?

Ans :

Online machine learning is a type of machine learning where data is acquired sequentially
and is utilized to update the best predictor for future data at each step.
In other words, online machine learning means that learning takes place as data becomes
available. With online learning, the learning algorithm’s parameters are updated after
learning from each individual training instance. In online learning, each learning step is quick
and cheap, and the model can learn from new information in real-time as it arrives.

Online learning is ideal for machine learning systems that receive data as a continuous flow
and need to be able to adapt to rapidly changing conditions.

An example of one of these systems might be one that predicts the weather or analyses stock
prices. This type of machine learning is also an ideal option if computing resources are a
factor – when an online model has learned from new data instances, it no longer needs to use
them and can therefore discard them. This can save a huge amount of storage space.

Offline :

While online learning does have its uses, traditional machine learning is performed offline
using the batch learning method.

In batch learning, data is accumulated over a period of time. The machine learning model is
then trained with this accumulated data from time to time in batches. It is the direct opposite
of online learning because the model is unable to learn incrementally from a stream of live
data. In batch learning, the machine learning algorithm updates its parameters only after
consuming batches of new data.

Differences :

1. Training and complexity


In an offline machine learning model, the weights and parameters of the model are updated
while simultaneously attempting to lower the global cost function using the data used to train
the model. The machine learning model is trained and updated continuously until it is in a
state of readiness for deployment or the use case that it’s designed for.

In an online machine learning process, however, the changes of weights and parameters that
occur at a given step are dependent on the example that’s being shown. If the model has
already been deployed, the model’s current state might also be a factor.
2. Training time required
Offline batch learning is generally a lot faster than online machine learning because offline
learning only uses a dataset once throughout the entire model to modify weights and
parameters.

That said, the sheer size of modern big data streams means that it can be a time-consuming
and sometimes impossible task to feed all available data into an offline model. In this
situation, engineers can either opt for online machine learning or feed the model with data
incrementally.

Offline machine learning is often cheaper than online machine learning, too. This is because
in online machine learning, the model obtains and tunes its parameters as new data becomes
available in real-time.

3. Computational power needed

Online machine learning is an ongoing, continuous process that requires a constant input of
data. This is because model refinement and improvement can only be carried out when the
model is being fed this data.

The computational power required for online machine learning is therefore higher than
offline batch learning which in contrast requires fewer computations.

4. Use in production

Online machine learning models are a lot harder to manage in a production environment. This
is because online learning models churn through large amounts of data in real-time and learn
from them. This has an immediate impact on the machine learning model and the solution it
powers and can affect the overall performance of it, a problem known as concept drift.

While this can be controlled, i.e., by filtering out “bad data” as we mentioned earlier, it
requires a larger resource input and leads to higher costs.

With batch learning, changes to the model are only reflected when updated models that have
been trained with new data are manually pushed to production.

5. Limits in scalability

A purely online model can be difficult to deploy in a way that’s scalable. This is because
when a model is purely online, its health and performance must be constantly monitored, as
must the health of the system that’s sending data to the model. It is somewhat telling that
even large multinational companies that can do this because they have the resources choose
not to.

It’s also more difficult to get an algorithm to behave in the desired way on a purely online,
automatic basis.
Ques : Random data sampling vs Stratified data sampling ?
Ans : The method of collecting data from a population, regarding a sample on a group of
items and examining it to draw out some conclusion, is known as Sample Method. This
method is even used in the day-to-day lives of people. For example, a cook takes a spoon of
pulses to check whether the whole pulse is evenly cooked. The sampling method of collecting
data is suitable for a large population and when the investigator does not require a high level
of accuracy. It is also preferred by investigators when they do not need an intensive
examination of items.

• Methods of Sampling
• 1. Random Sampling
• 2. Purposive or Deliberate Sampling
• 3. Stratified or Mixed Sampling

• 4. Systematic Sampling

1. Random Sampling
As the name suggests, in this method of sampling, the data is collected at random. It means
that every item of the universe has an equal chance of getting selected for the investigation
purpose. In other words, each item has an equal probability of being in the sample, which
makes the method impartial. As there is no control of the investigator in selecting the sample,
the random sampling method is used for homogeneous items. As there is no tool or a large
number of people required for collecting data through random sampling, this method is
economical. There are two ways of collecting data through the random sampling method.
These are the Lottery Method and Tables of Random Numbers.

3. Stratified or Mixed Sampling


A sampling method which is suitable at times when the population has different groups with
different characteristics and an investigation is to be performed on them is known as
Stratified or Mixed Sampling. In other words, Stratified or Mixed Sampling is a method in
which the population is divided into different groups, also known as strata with different
characteristics, and from those strata some of the items are selected to represent the
population. The investigator while forming strata has to ensure that each of the stratum is
represented in a correct proportion.

For example, there are 60 students in Class 10th. Out of these 60 students, 10 opted for Arts
and Humanities, 30 opted for Commerce, and 20 opted for Science in Class 11th. It means
that the population of 60 students is divided into three strata; viz., Arts and Humanities,
Commerce, and Science, containing 10, 30, and 20 students, respectively. Now, for
investigation purpose, some of the items will be proportionately selected from each of the
strata in a way that those items forming a sample represents the entire population. Besides, an
investigator can even select the items from different strata, unproportionately.

Ques : Mean vs Median vs Mode ?

Ans : In Machine Learning (and in mathematics) there are often three values that interests us:

• Mean - The average value


• Median - The mid point value
• Mode - The most common value

Mean
The mean value is the average value.

To calculate the mean, find the sum of all values, and divide the sum by the number of
values:

(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77

Median
The median value is the value in the middle, after you have sorted all the values:

77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

It is important that the numbers are sorted before you can find the median.

If there are two numbers in the middle, divide the sum of those numbers by two.

77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103

(86 + 87) / 2 = 86.5

Mode
The Mode value is the value that appears the most number of times:

99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
Ques : ANI VS ASI VS AGI ?
Ans : Three Types (Stages) of AI - Based on Capabilities

There are various ways to create AI, depending on what we want to achieve with it and how
we will measure its success. It ranges from extremely rare and complex systems, such as self-
driving cars and robotics, to parts of our everyday lives, such as facial recognition, machine
translation, and email categorization. The path you choose will depend on what your AI goals
are and how well you understand the intricacies and feasibility of various approaches.

AI technologies are categorized according to their ability to mimic human traits, the
techniques they use to do so, their real-world applications, and theory of mind. Using these
characteristics as a reference, all AI systems — real and hypothetical — fall into one of three
categories:

• Narrow artificial intelligence (ANI), with a narrow range of capabilities;


• Artificial General Intelligence (AGI) comparable to human capabilities; or
• Artificial Superintelligence (ASI), more capable than humans.

Today, we have three different variants of AI technology; ANI, AGI, and ASI. These are the
three stages in which AI can evolve. We have only achieved narrow AI so far.

Artificial Narrow Intelligence (ANI)

In contrast to strong AI, which can learn to perform any task humans do, weak AI (or narrow
AI) is limited to one or a few specific tasks. This is the kind of artificial intelligence we
currently have. In fact, deep learning, named after the human brain (and often compared to
it), has very limited capabilities and is nowhere near what a human child's brain can do. This
is not a bad thing.

In fact, narrow AI can focus on specific tasks and do it better than humans. For example, feed
a deep learning algorithm enough pictures of skin cancer and it will be better at spotting skin
cancer than an experienced doctor. This does not mean that deep learning will replace
doctors. You need intuition, abstract thinking, and more skills to decide what is best for your
patients.

Artificial General Intelligence (AGI)

General AI (AGI) is only theoretical at this point. This is the AI that writers have made up for
years in sci-fi stories. Ultimately, when we achieve AGI, machines will have consciousness
and decision-making capabilities - full human cognitive abilities. These machines do not
require human input to be programmed to function. For all intents and purposes, this will be a
time when machines act, feel, respond and think like humans. We can say that a powerful AI
has a mind of its own and is capable of doing whatever it wants to do like any human being.
Unlike narrow AI, which classifies data and finds patterns, general AI uses clustering and
association when processing data. AGI will also be self-aware. However, like a child, AI
must learn through experience, improving knowledge and skills over time.
But while all of this talent is focused on finding a way to create a powerful AI that can
compete with the human brain, we are missing a lot of opportunities and failing to address the
threats posed by current weak AI technologies.

Artificial Super Intelligence (ASI)

ASI is a futuristic concept and idea of artificial intelligence replacing human intelligence
capabilities. For ASI to become a reality, computational programs must surpass human
intelligence in all parameters and environments. ASI will only become a reality when AI
becomes smarter than humans.

ASI with a futuristic halo seems far removed from the future of human evolution, in the sense
that this so-called perceptual AI variant shows the conceptual limits of AI technology and its
promises that it won't deliver, at least in become a reality decades later.

If ASI becomes possible and becomes a reality, the role of humans in decision-making, the
arts and humanities, and emotional understanding of all aspects of life could be at odds with
the rise of machines.

Ques : Imbalanced dataset in classification and ways to handle it ?

Ans : A classification data set with skewed class proportions is called imbalanced. Classes
that make up a large proportion of the data set are called majority classes. Those that make
up a smaller proportion are minority classes.

Balanced vs Imbalanced Dataset :

• Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of


classes in the target column.
• Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution
of classes in the target column.

Let’s understand this with the help of an example :


Example : Suppose there is a Binary Classification problem with the following training data:

• Total Observations : 1000


• Target variable class is either ‘Yes’ or ‘No’.

Case 1:
If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly
unequal distribution of the two classes. .

Case 2:
If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is
approximately equal distribution of the two classes.

Hence, there is a significant amount of difference between the sample sizes of


the two classes in an Imbalanced Dataset.

Problem with Imbalanced dataset:

• Algorithms may get biased towards the majority class and thus tend to predict output
as the majority class.
• Minority class observations look like noise to the model and are ignored by the
model.
• Imbalanced dataset gives misleading accuracy score.

Techniques to deal with Imbalanced dataset :

• Under Sampling :
In this technique, we reduce the sample size of Majority class and try to match it with
the sample size of Minority Class.

Example :
Let’s take an imbalanced training dataset with 1000 records.

Before Under Sampling :

o Target class ‘Yes’ = 900 records


o Target class ‘No’ = 100 records

• After Under Sampling :

• Target class ‘Yes’ = 100 records


• Target class ‘No’ = 100 records

Now, both classes have the same sample size.

Pros :

• Low computation power needed.

Cons :

• Some important patterns might get lost due to dropping of records.


• Only beneficial for huge datasets with millions of records.

Note : Under Sampling should only be done when we have huge number of records.

• Over Sampling :
In this technique, we increase the sample size of Minority class by replication and try to
match it with the sample size of Majority Class.

Example :
Let’s take the same imbalanced training dataset with 1000 records.

Before Over Sampling :


• Target class ‘Yes’ = 900 records
• Target class ‘No’ = 100 records

After Over Sampling :

• Target class ‘Yes’ = 900 records


• Target class ‘No’ = 900 records

Pros :

• Patterns are not lost which enhances the model performance.

Cons :

• Replication of the data can lead to overfitting.


• High computation power needed.

Ques : Reinforced learning ? model based vs model free


learning ?
Ans : RL algorithms can be either Model-free (MF) or Model-based (MB). If the agent can
learn by making predictions about the consequences of its actions, then it is MB. If it
can only learn through experience then it is MF.

RL algorithms can be mainly divided into two categories – model-based and model-free.

Model-based, as it sounds, has an agent trying to understand its environment and creating a
model for it based on its interactions with this environment. In such a system, preferences
take priority over the consequences of the actions i.e. the greedy agent will always try to
perform an action that will get the maximum reward irrespective of what that action may
cause.

On the other hand, model-free algorithms seek to learn the consequences of their actions
through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words,
such an algorithm will carry out an action multiple times and will adjust the policy (the
strategy behind its actions) for optimal rewards, based on the outcomes.

Think of it this way, if the agent can predict the reward for some action before actually
performing it thereby planning what it should do, the algorithm is model-based. While if it
actually needs to carry out the action to see what happens and learn from it, it is model-free.

This results in different applications for these two classes, for e.g. a model-based approach
may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product,
where the environment is static and getting the task done most efficiently is our main
concern. However, in the case of real-world applications such as self-driving cars, a model-
based approach might prompt the car to run over a pedestrian to reach its destination in less
time (maximum reward), but a model-free approach would make the car wait till the road is
clear (optimal way out).

To better understand this, we will explain everything with an example. In the example, we’ll
build model-free and model-based RL for tennis games. To build the model, we need an
environment for the policy to get implemented. However we won’t build the environment in
this article, we’ll import one to use for our program.

Ques : Structured vs Unstructured data ?


Ans : Data is the backbone of technological progress and business growth. Considering the
huge volume of data companies generate daily, conventional tools aren’t sufficient to process
or leverage data analytics to extract meaningful insights.

As it happens, analyzing and understanding data is a prerequisite for data processing. This is
particularly important because data comes in two different forms: structured and
unstructured. Each data type is accumulated, processed, sorted, and analyzed to derive
valuable information and improve overall decision-making. Both structured and unstructured
data are stored in different databases.

Structured data is well-organized, easy to quantify, well defined, simple to search and analyze
with software in data analytics. Structured data is usually located in a specific field within
files or records. It is easy to place structured data into a standard pattern of set rows, tables,
and columns.

A good example of handling structured data is accessing the hotel database where all the
relevant details of the inmates, like name, contact number, address, etc., can be accessed with
ease. Such types of data are structured.

Structured data is encased in RDBMS (relational databases). Any information stored in the
database can be updated by person or machines and accessed with ease by algorithms or
manual search. Structured Query Language (SQL) is the standard tool used to handle
structured data, be it locating, adding & deleting, or updating.

Pros of Structured Data


1. Easy applicability to machine learning algorithms

The well-organized and quantitative nature of structured data makes it very easy for them to
update, modify, and search for data.

2. Easy to use for business people

Anyone with basic knowledge of data and its related applications can use structured data.
Structured data facilitates the self-service mode of data access to the user. So, it is not
necessary to have in-depth knowledge of data types and their relationships.

3. More tool options

As structured data has been in use for a long time, most tools have been tested for their
efficiency in data analysis.

4. Seamless integrations

Simple and streamlined programs like Excel can be used to store and organize structured
data. Furthermore, several other analytical tools can be linked to Excel for further data
analysis as required.

5. Suitability

Structured data is highly suitable for basic organization and quantitative analysis.

Cons of Structured Data


1. Limited use

Structured data lacks versatility. It can be used only with a set vision and cannot deviate from
that as it has a predefined structure.

2. Restricted data storage

Structured data is stored in data warehouses with a rigid data storage method. Any change in
data storage will require a complete update of existing data to accommodate additional
expensive and time-consuming requirements.

3. Not suitable for detailed analysis

Structured data can offer limited insight as it works on pre-set parameters. It does not provide
the details of how and why the data analytics is carried out.

Unstructured data :
Unstructured data refers to information that is not organized and cannot be accommodated in
a set or defined framework. It can be stored only in its original form until put to use. This
feature is known as schema on read.

The majority of the data we come across is unstructured. Nearly 80% of the enterprise data is
unstructured; this percentage appears to be constantly growing. Unstructured data comes in
various formats like emails, posts on social media platforms, chats, presentations, images,
satellite feeds, and data from IoT sensors.

Naturally, companies that invest time and money in deciphering unstructured data get access
to vital and valuable business intelligence to increase their profits. It can also help them
connect to their customers more efficiently and in a personalized fashion, thereby
contributing to increased profits.
Advantages of Unstructured Data
1. Liberty to stay in the natural form

As unstructured data is accumulated in its original form (native form), it is not defined until
used. This results in a larger reserve pool as the unstructured data can adapt to any data
requirement. It also facilitates data analysts and data scientists to process and analyze only
the required information.

2. Easy and faster data gathering

Unstructured data has an impressive accumulation rate. As it does not require pre-set
parameters, it can be gathered easily and quickly.

3. Massive data storage

Cloud data lakes store unstructured data due to their impressive storage capacity. Cloud data
lakes charge on a pay-for-what-you-use basis and are highly cost-effective, flexible, and
scalable.

Disadvantages of Unstructured Data


1. Need for data science expertise

As we mentioned before, you require data science expertise to leverage unstructured data for
useful processing and analysis. So, a regular business person or user can not possibly extract
any meaningful information from unstructured data in its crude native form. Processing
unstructured data requires the knowledge of the topic related to the data and the knowledge of
linking the data to make it resourceful. Even more disadvantageous is that there is a shortage
of data science professionals despite the continually growing demand across industries.

2. Limited choice of tools

Unstructured data requires specialized tools for manipulation besides data science expertise.
Standard data analytics tools are useful and compatible with structured data, and data
engineers only have a limited choice of tools to analyze unstructured data.

Ques : Curse of Dimensionality in machine learning ?


Ans : Machine learning can effectively analyze data with several dimensions. However, it
becomes complex to develop relevant models as the number of dimensions significantly
increases. You will get abnormal results when you try to analyze data in high-dimensional
spaces. This situation refers to the curse of dimensionality in machine learning. It depicts
the need for more computational efforts to process and analyze a machine-learning model.

Dimensions are features that may be dependent or independent. The concept of dimensions in
context to the curse of dimensionality becomes easier to understand with the help of an
example. Suppose there is a dataset with 100 features. Now let’s assume you intend to build
various separate machine learning models from this dataset. The models can be model-1,
model-2, …. model-100. The difference between these models is the number of features.

With the increase in the number of features, the model’s accuracy increases. However, after a
specific threshold value, the model’s accuracy will not increase, although the number of
features increases. This is because a model is fed with a lot of information, making it
incompetent to train with correct information.

The phenomenon when a machine learning model’s accuracy decreases, although increasing
the number of features after a certain threshold, is called the curse of dimensionality.

The curse of dimensionality in machine learning is defined as follows,

As the number of dimensions or features increases, the amount of data needed to generalize
the machine learning model accurately increases exponentially. The increase in dimensions
makes the data sparse, and it increases the difficulty of generalizing the model. More training
data is needed to generalize that model better.

The higher dimensions lead to equidistant separation between points. The higher the
dimensions, the more difficult it will be to sample from because the sampling loses its
randomness.

It becomes harder to collect observations if there are plenty of features. These dimensions
make all observations in the dataset to be equidistant from all other observations. The
clustering uses Euclidean distance to measure the similarity between the observations. The
meaningful clusters can’t be formed if the distances are equidistant.

Ques : Underfitting vs Overfitting in machine learning ?


linear regression line finds best fit line using gradient
decent method ?
Ans : Underfitting in Machine Learning
A statistical model or a machine learning algorithm is said to have underfitting when a model
is too simple to capture data complexities. It represents the inability of the model to learn the
training data effectively result in poor performance both on the training and testing data. In
simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen
examples. It mainly happens when we uses very simple model with overly simplified
assumptions. To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.

Note: The underfitting model has High bias and low variance.

Reasons for Underfitting

1. The model is too simple, So it may be not capable to represent the complexities in the data.
2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.

3. The size of the training dataset used is not enough.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

Overfitting in Machine Learning


A statistical model is said to be overfitted when the model does not make accurate predictions
on testing data. When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set. And when testing with test data results in High
variance. Then the model does not categorize the data correctly, because of too many details
and noise. The causes of overfitting are the non-parametric and non-linear methods because
these types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting
is using a linear algorithm if we have linear data or using the parameters like the maximal depth
if we are using decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on


training data is different from unseen data.

Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

1. Increase training data.

2. Reduce model complexity.

3. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).

Gradient decent :

Linear regression is about finding the line of best fit for a dataset. This line can then be used
to make predictions.
Gradient descent is a tool to arrive at the line of best fit

Gradient descent way of computing line of best fit:

In gradient descent, you start with a random line. Then you change the parameters of the line
(i.e. slope and y-intercept) little by little to arrive at the line of best fit.

How do you know when you arrived at the line of best fit?

For every line you try — line A, line B, line C, etc — you calculate the sum of squares of the
errors. If line B has a smaller value than line A, then line B is a better fit, etc.

Error is your actual value minus your predicted value. The line of best fit minimizes the sum
of the squares of all the errors. In linear regression, the line of best fit we computed above
using correlation coefficient also happens to be the least squared error line. That’s why the
regression line is called the LEAST SQUARE REGRESSION LINE.

Ques : ML process flow diagram ? pre-processing stage ?


Ans : Machine learning workflow refers to the series of stages or steps involved in the
process of building a successful machine learning system.
The various stages involved in the machine learning workflow are-

• Data Collection
• Data Preparation
• Choosing Learning Algorithm

• Training Model

• Evaluating Model

• Predictions

1. Data Collection-

In this stage,

• Data is collected from different sources.


• The type of data collected depends upon the type of desired project.
• Data may be collected from various sources such as files, databases etc.
• The quality and quantity of gathered data directly affects the accuracy of the desired system.

2. Data Preparation-

In this stage,

• Data preparation is done to clean the raw data.


• Data collected from the real world is transformed to a clean dataset.
• Raw data may contain missing values, inconsistent values, duplicate instances etc.
• So, raw data cannot be directly used for building a model.

Different methods of cleaning the dataset are-

• Ignoring the missing values


• Removing instances having missing values from the dataset.
• Estimating the missing values of instances using mean, median or mode.
• Removing duplicate instances from the dataset.
• Normalizing the data in the dataset.
This is the most time consuming stage in machine learning workflow.

3. Choosing Learning Algorithm-

In this stage,

• The best performing learning algorithm is researched.


• It depends upon the type of problem that needs to solved and the type of data we have.
• If the problem is to classify and the data is labeled, classification algorithms are used.
• If the problem is to perform a regression task and the data is labeled, regression algorithms
are used.
• If the problem is to create clusters and the data is unlabeled, clustering algorithms are used.

4. Training Model-

In this stage,

• The model is trained to improve its ability.


• The dataset is divided into training dataset and testing dataset.
• The training and testing split is order of 80/20 or 70/30.
• It also depends upon the size of the dataset.
• Training dataset is used for training purpose.
• Testing dataset is used for the testing purpose.

5. Evaluating Model-

In this stage,

• The model is evaluated to test if the model is any good.


• The model is evaluated using the kept-aside testing dataset.
• It allows to test the model against data that has never been used before for training.
• Metrics such as accuracy, precision, recall etc are used to test the performance.
• If the model does not perform well, the model is re-built using different hyper parameters.

6. Predictions-

In this stage,

• The built system is finally used to do something useful in the real world.
• Here, the true value of machine learning is realized.
Pre-processing :

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data.

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

It involves below steps:

• Getting the dataset


• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

2) Importing Libraries
In order to perform data preprocessing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data preprocessing, which are:

Numpy maptpolib etc behnchod.

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:
4) Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

6) Splitting the Dataset into the Training set and Test set
In machine learning data preprocessing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data preprocessing as by doing this, we can enhance the
performance of our machine learning model.

7) Feature Scaling
Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range.

Ques : hyperparameter tuning ? grid search and model


search ?
Ans : Hyperparameter Tuning
Hyperparameter tuning is the process of selecting the optimal values for a machine learning
model’s hyperparameters. Hyperparameters are settings that control the learning process of
the model, such as the learning rate, the number of neurons in a neural network, or the kernel
size in a support vector machine. The goal of hyperparameter tuning is to find the values that
lead to the best performance on a given task.

What are Hyperparameters?

In the context of machine learning, hyperparameters are configuration variables that are set
before the training process of a model begins. They control the learning process itself, rather
than being learned from the data. Hyperparameters are often used to tune the performance of
a model, and they can have a significant impact on the model’s accuracy, generalization, and
other metrics.
Hyperparameter Tuning techniques

Models can have many hyperparameters and finding the best combination of parameters can
be treated as a search problem. The two best strategies for Hyperparameter tuning are:

1. GridSearchCV

2. RandomizedSearchCV

3. Bayesian Optimization

1. GridSearchCV
Grid search can be considered as a “brute force” approach to hyperparameter optimization.
We fit the model using all possible combinations after creating a grid of potential discrete
hyperparameter values. We log each set’s model performance and then choose the
combination that produces the best results. This approach is called GridSearchCV, because it
searches for the best set of hyperparameters from a grid of hyperparameters values.

An exhaustive approach that can identify the ideal hyperparameter combination is grid
search. But the slowness is a disadvantage. It often takes a lot of processing power and time
to fit the model with every potential combination, which might not be available.

2. RandomizedSearchCV
As the name suggests, the random search method selects values at random as opposed to the
grid search method’s use of a predetermined set of numbers. Every iteration, random search
attempts a different set of hyperparameters and logs the model’s performance. It returns the
combination that provided the best outcome after several iterations. This approach reduces
unnecessary computation.

RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a


fixed number of hyperparameter settings. It moves within the grid in a random fashion to find
the best set of hyperparameters. The advantage is that, in most cases, a random search will
produce a comparable result faster than a grid search.

3. Bayesian Optimization
Grid search and random search are often inefficient because they evaluate many unsuitable
hyperparameter combinations without considering the previous iterations’ results. Bayesian
optimization, on the other hand, treats the search for optimal hyperparameters as an
optimization problem. It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose the combination
that will likely yield the best results. This method discovers a good hyperparameter
combination in relatively few iterations.
Ques : Regression vs Classification ?
Ans : Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labeled datasets.
But the difference between both is how they are used for different machine learning
problems.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.

Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.

The task of the classification algorithm is to find the mapping function to map the input(x) to
the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:


• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Naïve Bayes

Regression:
Regression is a process of finding the correlations between dependent and independent
variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

• Simple Linear Regression


• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression

You might also like