0% found this document useful (0 votes)
14 views16 pages

ML Basics Theory

The document provides an overview of machine learning, its types (supervised, unsupervised, and reinforcement learning), and key concepts such as bias, variance, and the curse of dimensionality. It explains the differences between labeled and unlabeled data, features and labels in supervised learning, and highlights various applications of machine learning across industries. Additionally, it discusses the importance of training, validation, and test data in model development and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

ML Basics Theory

The document provides an overview of machine learning, its types (supervised, unsupervised, and reinforcement learning), and key concepts such as bias, variance, and the curse of dimensionality. It explains the differences between labeled and unlabeled data, features and labels in supervised learning, and highlights various applications of machine learning across industries. Additionally, it discusses the importance of training, validation, and test data in model development and evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

ML QUESTIONS

Questions 1 ) What is machine learning, and how does it differ from traditional programming?
Answer )

Machine Learning Traditional Programming Artificial Intelligence

Machine Learning is a subset of Artificial Intelligence involves


In traditional programming,
artificial intelligence(AI) that making the machine as much
rule-based code is written
focus on learning from data to capable, So that it can perform
by the developers depending
develop an algorithm that can the tasks that typically require
on the problem statements.
be used to make a prediction. human intelligence.

Machine Learning uses a data- Traditional programming is AI can involve many different
driven approach, It is typically typically rule-based and techniques, including Machine
trained on historical data and deterministic. It hasn’t self- Learning and Deep Learning, as
then used to make predictions learning features like well as traditional rule-based
on new data. Machine Learning and AI. programming.

Sometimes AI uses a
Traditional programming is
ML can find patterns and combination of both Data and
totally dependent on the
insights in large datasets that Pre-defined rules, which gives it
intelligence of developers.
might be difficult for humans to a great edge in solving complex
So, it has very limited
discover. tasks with good accuracy which
capability.
seem impossible to humans.

Machine Learning is the subset Traditional programming is AI is a broad field that includes
of AI. And Now it is used in often used to build many different applications,
various AI-based tasks like applications and software including natural language
Chatbot Question answering, systems that have specific processing, computer vision, and
self-driven car., etc. functionality. robotics.
Question 2 ) Can you explain the three main types of machine learning: supervised, unsupervised,
and reinforcement learning?
Answer )

* supervised :
This machine learning type got its name because the machine is “supervised” while learning, which
means you’re feeding the algorithm information to help it learn.
The outcome you provide the machine is labelled data, and the rest of the information you give is used
as input features.

In supervised learning, the machine is trained on a set of labeled data, which means that the input data is
paired with the desired output. The machine then learns to predict the output for new input data.
Supervised learning is often used for tasks such as classification, regression, and object detection.

*Unsupervised :
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input
data is not paired with the desired output.
The machine then learns to find patterns and relationships in the data. Unsupervised learning is often
used for tasks such as clustering, dimensionality reduction, and anomaly detection.

Key Points

- Unsupervised learning allows the model to discover patterns and relationships in unlabeled data.
- Clustering algorithms group similar data points together based on their inherent characteristics.
- Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
- Label association assigns categories to the clusters based on the extracted patterns and characteristics.

Supervised vs. Unsupervised Machine Learning :

Parameters Supervised machine Unsupervised machine


learning learning

Input Data Algorithms are trained using Algorithms are used against
labeled data. data that is not labeled

Computational Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known


Parameters Supervised machine Unsupervised machine
learning learning

Data Analysis Uses offline analysis Uses real-time analysis of data

Algorithms used Linear and Logistics


regression,KNN Random
K-Means clustering,
forest, multi-class
Hierarchical clustering, Apriori
classification, decision tree,
algorithm, etc.
Support Vector Machine,
Neural Network, etc.

Output Desired output is given. Desired output is not given.

Training data Use training data to infer


No training data is used.
model.

Complex model It is not possible to learn larger It is possible to learn larger and
and more complex models than more complex models with
with supervised learning. unsupervised learning.

Model We can test our model. We can not test our model.

Called as Supervised learning is also Unsupervised learning is also


called classification. called clustering.

Example Example: Optical character Example: Find a face in an


recognition. image.

Supervision Unsupervised learning does not


supervised learning needs
need any supervision to train
supervision to train the model.
the model.
Question 3 ) Key differences between labeled and unlabeled data ?

Feature Labeled Data Unlabeled Data

Data with both input features Data with only input


Definition and corresponding output features and no output
labels labels

Primarily used in supervised Primarily used in


Usage
learning unsupervised learning

Find patterns, groupings,


Train models to predict or
Application or structures without
classify based on input data
predefined labels

Annotation(the process of labeling


data to show the outcome you want Annotated with correct
No annotations or labels
your machine learning model to answers
predict.)

Images labeled with Images without any


Example
categories like "cat," "dog" category labels

More expensive and time-


Easier and cheaper to
Cost and Effort consuming due to the need
collect
for manual annotation

Not used directly in


Supervised Learning Essential for training models
training models

Essential for discovering


Unsupervised Learning Not applicable
patterns and structures

Helps in learning the Helps in uncovering


Importance relationship between input hidden patterns and
and output relationships
Question 4 ) Features and Labels in Supervised Learning: A Practical Approach .
Answer )
the terms "feature" and "label" are fundamental concepts that form the backbone of supervised learning
models.

What is a Feature?
A feature in machine learning refers to an individual measurable property or characteristic of a
phenomenon being observed. Features are the input variables that the model uses to make predictions.
They are also known as independent variables, predictors, or attributes

1. Measurable: Features are quantifiable properties that can be measured and recorded.
2. Independent: Each feature should ideally be independent of the others, providing unique information
to the model.

3. Varied Types: Features can be numerical (e.g., age, height), categorical (e.g., gender, color), or even t
text-based (e.g., reviews, comments).

What is a Label?
A label, also known as the target variable or dependent variable, is the output that the model is trained
to predict. In supervised learning, labels are the known outcomes that the model learns to associate with
the input features during training

Characteristics of Labels
1. Dependent: Labels depend on the input features and are the result of the model's prediction.
2. Categorical or Numerical: Labels can be categorical (e.g., spam or not spam) or numerical (e.g., price
of a house).

Suppose there is a basket and it is filled with some fresh fruits. The task is to arrange the same type of
fruits in one place. This time there is no information about those fruits beforehand, it’s the first time that
the fruits are being seen or discovered So how to group similar fruits without any prior knowledge about
them? First, any physical characteristic of a particular fruit is selected. Suppose colour. Then the fruits
are arranged based on the color.
The groups will be something as shown below:

RED COLOR GROUP: apples & cherry fruits.


GREEN COLOR GROUP: bananas & grapes. So now, take another physical character say, size, so now
the groups will be something like this.
RED COLOR AND BIG SIZE: apple.
RED COLOR AND SMALL SIZE: cherry fruits.
GREEN COLOR AND BIG SIZE: bananas.
GREEN COLOR AND SMALL SIZE: grapes.
Supervised Learning Unsupervised Learning

Input Data Uses Known and Labeled Data


Uses Unknown Data as input
as input

Computational Complexity Less Computational


More Computational Complex
Complexity

Real-Time Uses Real-Time Analysis of


Uses off-line analysis
Data

Number of Classes The number of Classes is The number of Classes is not


known known

Accuracy of Results Moderate Accurate and


Accurate and Reliable Results
Reliable Results

Output data The desired, output is not


The desired output is given.
given.

Model In supervised learning it is not In unsupervised learning it is


possible to learn larger and possible to learn larger and
more complex models than in more complex models than in
unsupervised learning supervised learning

Training data In supervised learning training In unsupervised learning


data is used to infer model training data is not used.

Another name Supervised learning is also Unsupervised learning is also


called classification. called clustering.

Test of model We can test our model. We can not test our model.

Example Optical Character Recognition Find a face in an image.


Question 5) What are some common use cases and applications of machine learning in various
industries?
Answer )

1. Image Recognition : Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion

Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion
with name, and the technology behind this is machine learning's face detection and recognition
algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.

2 . Speech Recognition : Speech recognition is a process of converting voice instructions into text, and
it is also known as "Speech to text", or "Computer speech recognition."
At present, machine learning algorithms are widely used by various applications of speech recognition.
Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.

3 . Traffic prediction: It predicts the traffic conditions such as whether traffic is cleared, slow-moving,
or heavily congested with the help of two ways:

Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine learning

As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning

5. Self-driving cars: Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails in our
spam box,
Below are some spam filters used by Gmail:

Content Filter
Header filter
General blacklists filter

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name
suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent transaction
can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to
detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a
fraud transaction.

Question 7 ) What is bias and variance ?


Answer )

Bias is simply defined as the inability of the model because of that there is some difference or error
occurring between the model’s predicted value and the actual value.

These differences between actual or expected values and the predicted values are known as error or bias
error or error due to bias

Let Y be the true value of a parameter, and let Y^ be an estimator of Y based on a sample of data.
Then, the bias of the estimator Y^ is given by:
Bias(Y^)=E(Y^)–Y

* Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this
case, the model will closely match the training dataset's.
* High Bias: High bias value means more assumptions are taken to build the target function. In this
case, the model will not match the training dataset's closely.

**When training error is high then that problem is under-fitting problems [high bias ] . means model
is not properly made on the training data **
Bias only worked on the training datasets
Variance only worked on the test data
Fig . white dot : training data ,, red : test data

Diag1 : high bias [model is not properly made on the training data] . it means we have underfitting
problem that mean model is not properly fitted on the training data . [ Underfittig Error ]
Now 2 cases may possibler (A) high variance (B) low vairance
A) high variance : test error high , red dots are away from the best fitted line
B) Low variance : test error Low, red dots are close to the best fitted line

Diag2 : Low bias : training error low so automatically test error also become low [ Generalized
models ]

Diag3 : TE = low tends to null that’s why Test error becomes high
Question 8 ) What is the "curse of dimensionality," and why is it a challenge in machine learning?
Answer )

Dimensions refer to the features or attributes of data , lets say if we have 4 features then our
dimensionality is 4D . if we add more feature then more dimensinality increases.

Curse of Dimensionality refers to the phenomenon where the efficiency and effectiveness of algorithms
deteriorate as the dimensionality of the data increases exponentially.

In high-dimensional spaces, data points become sparse, making it challenging to discern meaningful
patterns or relationships due to the vast amount of data required to adequately sample the space.

- As we add more dimensions to our dataset's, the volume of the space increases exponentially. This
means that the data becomes sparse .

What problems does it cause?

(1) Data sparsity. As mentioned, data becomes sparse, meaning that most of the high-dimensional space
is empty. This makes clustering and classification tasks challenging.

(2) Increased computation. More dimensions mean more computational resources and time to process
the data.

(3) Overfitting. With higher dimensions, models can become overly complex, fitting to the noise rather
than the underlying pattern. This reduces the model's ability to generalize to new data.

(4) Distances lose meaning. In high dimensions, the difference in distances between data points tends to
become negligible, making measures like Euclidean distance less meaningful.

(5) Performance degradation. Algorithms, especially those relying on distance measurements like k-
nearest neighbors, can see a drop in performance.

(6) Visualization challenges. High-dimensional data is hard to visualize, making exploratory data
analysis more difficult.

** How to Overcome the Curse of Dimensionality?


To overcome the curse of dimensionality, you can consider the following strategies:

1. Dimensionality Reduction Techniques:

Feature Selection: Identify and select the most relevant features from the original dataset while
discarding irrelevant or redundant ones. This reduces the dimensionality of the data, simplifying the
model and improving its efficiency.
Feature Extraction: Transform the original high-dimensional data into a lower-dimensional space by
creating new features that capture the essential information. Techniques such as Principal Component
Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for
feature extraction.

2. Data Preprocessing:

Normalization: Scale the features to a similar range to prevent certain features from dominating others,
especially in distance-based algorithms.
Handling Missing Values: Address missing data appropriately through imputation or deletion to ensure
robustness in the model training process.

Question 9 ) Explain the terms "training data," "validation data," and "test data" in the context
of model development and evaluation ?
Answer )

Training data. : This type of data builds up the machine learning algorithm. The data scientist feeds the
algorithm input data, which corresponds to an expected output. The model evaluates the data repeatedly
to learn more about the data’s behavior and then adjusts itself to serve its intended purpose.

“ Training data are collections of examples or samples that are used to 'teach' or 'train the machine
learning model “

Validation data. : During training, validation data infuses new data into the model that it hasn’t
evaluated before. Validation data provides the first test against unseen data, allowing data scientists to
evaluate how well the model makes predictions based on the new data. Not all data scientists use
validation data, but it can provide some helpful information to optimize hyperparameters, which
influence how the model assesses data

“ validation datasets contain different samples to evaluate trained ML models. It is still possible to tune
and control the model at this stage. Working on validation data is used to assess the model performance
and fine-tune the parameters of the model
A validation dataset tells us how well the model is learning and adapting, allowing for adjustments
and optimizations to be made to the model's parameters or hyperparameters before it's finally put to the
test. “

Test data. : After the model is built, testing data once again validates that it can make accurate
predictions. If training and validation data include labels to monitor performance metrics of the model,
the testing data should be unlabeled. Test data provides a final, real-world check of an unseen dataset to
confirm that the ML algorithm was trained effectively.

“a test data set is a separate sample, an unseen data set, to provide an unbiased final evaluation of a
model fit. The inputs in the test data are similar to the previous stages but not the same data”
Question 10 ) What is the difference between bias and variance in machine learning, and how do
they relate to model performance?

Answer )

High Bias, Low Variance: A model with high bias and low variance is said to be underfitting.

High Variance, Low Bias: A model with high variance and low bias is said to be overfitting.

High-Bias, High-Variance: A model has both high bias and high variance, which means that the model
is not able to capture the underlying patterns in the data (high bias) and is also too sensitive to changes
in the training data (high variance). As a result, the model will produce inconsistent and inaccurate
predictions on average.

Low Bias, Low Variance: A model that has low bias and low variance means that the model is able to
capture the underlying patterns in the data (low bias) and is not too sensitive to changes in the training
data (low variance). This is the ideal scenario for a machine learning model, as it is able to generalize
well to new, unseen data and produce consistent and accurate predictions. But in practice, it’s not
possible.

Question 11 ) What are some common evaluation metrics used to assess the performance of
machine learning models, and when should each be used?
Answer )

Classification Accuracy
Logarithmic loss
Area under Curve
F1 score :
Precision , Recall
Confusion Matrix

Classification Accuracy :
Classification accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of correct predictions.

This is calculated as the ratio of correct predictions to the total number of input Samples.
Accuracy= [ No.ofcorrectpredictions / Total number of input sample ]
Logarithmic Loss :
It is also known as Log loss. Its basic working propaganda is by penalizing the false (False Positive)
classification
Question 12 ) Can you provide an overview of the steps involved in a typical machine learning
workflow, from data preparation to model deployment?
Answer )

Understanding the Machine Learning Workflow :


Machine learning workflows define the steps initiated during a particular machine learning
implementation. Machine learning workflows vary by project, but four basic phases are typically
included.

(1) Gathering machine learning data : Gathering data is one of the most important stages of machine
learning workflows. During data collection, you are defining the potential usefulness and accuracy of
your project with the quality of the data you collect.
To collect data, you need to identify your sources and aggregate data from those sources into a single
dataset's. This could mean streaming data from Internet of Things sensors, downloading open source
data sets, or constructing a data lake from assorted files, logs, or media.]

(2) Data pre-processing :


you need to per-process it. Pre-processing involves cleaning, verifying, and formatting data into a
usable dataset. If you are collecting data from a single source, this may be a relatively straightforward
process. However, if you are aggregating several sources you need to make sure that data formats
match, that data is equally reliable, and remove any potential duplicates.

Building datasets
This phase involves breaking processed data into three datasets—training, validating, and testing:
A) Training set—used to initially train the algorithm and teach it how to process information. This set
defines model classifications through parameters.
B ) Validation set— used to estimate the accuracy of the model. This dataset is used to finetune model
parameters.
C ) Test set— to assess the accuracy and performance of the models. This set is meant to expose any
issues or mistrainings in the model.

(3) Training and refinement :


Once you have datasets, you are ready to train your model. This involves feeding your training set to
your algorithm so that it can learn appropriate parameters and features used in classification.
.

You might also like