ML Basics Theory
ML Basics Theory
Questions 1 ) What is machine learning, and how does it differ from traditional programming?
Answer )
Machine Learning uses a data- Traditional programming is AI can involve many different
driven approach, It is typically typically rule-based and techniques, including Machine
trained on historical data and deterministic. It hasn’t self- Learning and Deep Learning, as
then used to make predictions learning features like well as traditional rule-based
on new data. Machine Learning and AI. programming.
Sometimes AI uses a
Traditional programming is
ML can find patterns and combination of both Data and
totally dependent on the
insights in large datasets that Pre-defined rules, which gives it
intelligence of developers.
might be difficult for humans to a great edge in solving complex
So, it has very limited
discover. tasks with good accuracy which
capability.
seem impossible to humans.
Machine Learning is the subset Traditional programming is AI is a broad field that includes
of AI. And Now it is used in often used to build many different applications,
various AI-based tasks like applications and software including natural language
Chatbot Question answering, systems that have specific processing, computer vision, and
self-driven car., etc. functionality. robotics.
Question 2 ) Can you explain the three main types of machine learning: supervised, unsupervised,
and reinforcement learning?
Answer )
* supervised :
This machine learning type got its name because the machine is “supervised” while learning, which
means you’re feeding the algorithm information to help it learn.
The outcome you provide the machine is labelled data, and the rest of the information you give is used
as input features.
In supervised learning, the machine is trained on a set of labeled data, which means that the input data is
paired with the desired output. The machine then learns to predict the output for new input data.
Supervised learning is often used for tasks such as classification, regression, and object detection.
*Unsupervised :
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input
data is not paired with the desired output.
The machine then learns to find patterns and relationships in the data. Unsupervised learning is often
used for tasks such as clustering, dimensionality reduction, and anomaly detection.
Key Points
- Unsupervised learning allows the model to discover patterns and relationships in unlabeled data.
- Clustering algorithms group similar data points together based on their inherent characteristics.
- Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
- Label association assigns categories to the clusters based on the extracted patterns and characteristics.
Input Data Algorithms are trained using Algorithms are used against
labeled data. data that is not labeled
Complex model It is not possible to learn larger It is possible to learn larger and
and more complex models than more complex models with
with supervised learning. unsupervised learning.
Model We can test our model. We can not test our model.
What is a Feature?
A feature in machine learning refers to an individual measurable property or characteristic of a
phenomenon being observed. Features are the input variables that the model uses to make predictions.
They are also known as independent variables, predictors, or attributes
1. Measurable: Features are quantifiable properties that can be measured and recorded.
2. Independent: Each feature should ideally be independent of the others, providing unique information
to the model.
3. Varied Types: Features can be numerical (e.g., age, height), categorical (e.g., gender, color), or even t
text-based (e.g., reviews, comments).
What is a Label?
A label, also known as the target variable or dependent variable, is the output that the model is trained
to predict. In supervised learning, labels are the known outcomes that the model learns to associate with
the input features during training
Characteristics of Labels
1. Dependent: Labels depend on the input features and are the result of the model's prediction.
2. Categorical or Numerical: Labels can be categorical (e.g., spam or not spam) or numerical (e.g., price
of a house).
Suppose there is a basket and it is filled with some fresh fruits. The task is to arrange the same type of
fruits in one place. This time there is no information about those fruits beforehand, it’s the first time that
the fruits are being seen or discovered So how to group similar fruits without any prior knowledge about
them? First, any physical characteristic of a particular fruit is selected. Suppose colour. Then the fruits
are arranged based on the color.
The groups will be something as shown below:
Test of model We can test our model. We can not test our model.
1. Image Recognition : Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion
Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion
with name, and the technology behind this is machine learning's face detection and recognition
algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
2 . Speech Recognition : Speech recognition is a process of converting voice instructions into text, and
it is also known as "Speech to text", or "Computer speech recognition."
At present, machine learning algorithms are widely used by various applications of speech recognition.
Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.
3 . Traffic prediction: It predicts the traffic conditions such as whether traffic is cleared, slow-moving,
or heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such as Amazon,
Netflix, etc., for product recommendation to the user.
Whenever we search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine learning
As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc.,
and this is also done with the help of machine learning
5. Self-driving cars: Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
Content Filter
Header filter
General blacklists filter
Bias is simply defined as the inability of the model because of that there is some difference or error
occurring between the model’s predicted value and the actual value.
These differences between actual or expected values and the predicted values are known as error or bias
error or error due to bias
Let Y be the true value of a parameter, and let Y^ be an estimator of Y based on a sample of data.
Then, the bias of the estimator Y^ is given by:
Bias(Y^)=E(Y^)–Y
* Low Bias: Low bias value means fewer assumptions are taken to build the target function. In this
case, the model will closely match the training dataset's.
* High Bias: High bias value means more assumptions are taken to build the target function. In this
case, the model will not match the training dataset's closely.
**When training error is high then that problem is under-fitting problems [high bias ] . means model
is not properly made on the training data **
Bias only worked on the training datasets
Variance only worked on the test data
Fig . white dot : training data ,, red : test data
Diag1 : high bias [model is not properly made on the training data] . it means we have underfitting
problem that mean model is not properly fitted on the training data . [ Underfittig Error ]
Now 2 cases may possibler (A) high variance (B) low vairance
A) high variance : test error high , red dots are away from the best fitted line
B) Low variance : test error Low, red dots are close to the best fitted line
Diag2 : Low bias : training error low so automatically test error also become low [ Generalized
models ]
Diag3 : TE = low tends to null that’s why Test error becomes high
Question 8 ) What is the "curse of dimensionality," and why is it a challenge in machine learning?
Answer )
Dimensions refer to the features or attributes of data , lets say if we have 4 features then our
dimensionality is 4D . if we add more feature then more dimensinality increases.
Curse of Dimensionality refers to the phenomenon where the efficiency and effectiveness of algorithms
deteriorate as the dimensionality of the data increases exponentially.
In high-dimensional spaces, data points become sparse, making it challenging to discern meaningful
patterns or relationships due to the vast amount of data required to adequately sample the space.
- As we add more dimensions to our dataset's, the volume of the space increases exponentially. This
means that the data becomes sparse .
(1) Data sparsity. As mentioned, data becomes sparse, meaning that most of the high-dimensional space
is empty. This makes clustering and classification tasks challenging.
(2) Increased computation. More dimensions mean more computational resources and time to process
the data.
(3) Overfitting. With higher dimensions, models can become overly complex, fitting to the noise rather
than the underlying pattern. This reduces the model's ability to generalize to new data.
(4) Distances lose meaning. In high dimensions, the difference in distances between data points tends to
become negligible, making measures like Euclidean distance less meaningful.
(5) Performance degradation. Algorithms, especially those relying on distance measurements like k-
nearest neighbors, can see a drop in performance.
(6) Visualization challenges. High-dimensional data is hard to visualize, making exploratory data
analysis more difficult.
Feature Selection: Identify and select the most relevant features from the original dataset while
discarding irrelevant or redundant ones. This reduces the dimensionality of the data, simplifying the
model and improving its efficiency.
Feature Extraction: Transform the original high-dimensional data into a lower-dimensional space by
creating new features that capture the essential information. Techniques such as Principal Component
Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for
feature extraction.
2. Data Preprocessing:
Normalization: Scale the features to a similar range to prevent certain features from dominating others,
especially in distance-based algorithms.
Handling Missing Values: Address missing data appropriately through imputation or deletion to ensure
robustness in the model training process.
Question 9 ) Explain the terms "training data," "validation data," and "test data" in the context
of model development and evaluation ?
Answer )
Training data. : This type of data builds up the machine learning algorithm. The data scientist feeds the
algorithm input data, which corresponds to an expected output. The model evaluates the data repeatedly
to learn more about the data’s behavior and then adjusts itself to serve its intended purpose.
“ Training data are collections of examples or samples that are used to 'teach' or 'train the machine
learning model “
Validation data. : During training, validation data infuses new data into the model that it hasn’t
evaluated before. Validation data provides the first test against unseen data, allowing data scientists to
evaluate how well the model makes predictions based on the new data. Not all data scientists use
validation data, but it can provide some helpful information to optimize hyperparameters, which
influence how the model assesses data
“ validation datasets contain different samples to evaluate trained ML models. It is still possible to tune
and control the model at this stage. Working on validation data is used to assess the model performance
and fine-tune the parameters of the model
A validation dataset tells us how well the model is learning and adapting, allowing for adjustments
and optimizations to be made to the model's parameters or hyperparameters before it's finally put to the
test. “
Test data. : After the model is built, testing data once again validates that it can make accurate
predictions. If training and validation data include labels to monitor performance metrics of the model,
the testing data should be unlabeled. Test data provides a final, real-world check of an unseen dataset to
confirm that the ML algorithm was trained effectively.
“a test data set is a separate sample, an unseen data set, to provide an unbiased final evaluation of a
model fit. The inputs in the test data are similar to the previous stages but not the same data”
Question 10 ) What is the difference between bias and variance in machine learning, and how do
they relate to model performance?
Answer )
High Bias, Low Variance: A model with high bias and low variance is said to be underfitting.
High Variance, Low Bias: A model with high variance and low bias is said to be overfitting.
High-Bias, High-Variance: A model has both high bias and high variance, which means that the model
is not able to capture the underlying patterns in the data (high bias) and is also too sensitive to changes
in the training data (high variance). As a result, the model will produce inconsistent and inaccurate
predictions on average.
Low Bias, Low Variance: A model that has low bias and low variance means that the model is able to
capture the underlying patterns in the data (low bias) and is not too sensitive to changes in the training
data (low variance). This is the ideal scenario for a machine learning model, as it is able to generalize
well to new, unseen data and produce consistent and accurate predictions. But in practice, it’s not
possible.
Question 11 ) What are some common evaluation metrics used to assess the performance of
machine learning models, and when should each be used?
Answer )
Classification Accuracy
Logarithmic loss
Area under Curve
F1 score :
Precision , Recall
Confusion Matrix
Classification Accuracy :
Classification accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of correct predictions.
This is calculated as the ratio of correct predictions to the total number of input Samples.
Accuracy= [ No.ofcorrectpredictions / Total number of input sample ]
Logarithmic Loss :
It is also known as Log loss. Its basic working propaganda is by penalizing the false (False Positive)
classification
Question 12 ) Can you provide an overview of the steps involved in a typical machine learning
workflow, from data preparation to model deployment?
Answer )
(1) Gathering machine learning data : Gathering data is one of the most important stages of machine
learning workflows. During data collection, you are defining the potential usefulness and accuracy of
your project with the quality of the data you collect.
To collect data, you need to identify your sources and aggregate data from those sources into a single
dataset's. This could mean streaming data from Internet of Things sensors, downloading open source
data sets, or constructing a data lake from assorted files, logs, or media.]
Building datasets
This phase involves breaking processed data into three datasets—training, validating, and testing:
A) Training set—used to initially train the algorithm and teach it how to process information. This set
defines model classifications through parameters.
B ) Validation set— used to estimate the accuracy of the model. This dataset is used to finetune model
parameters.
C ) Test set— to assess the accuracy and performance of the models. This set is meant to expose any
issues or mistrainings in the model.