0% found this document useful (0 votes)

8 views8 pages

Aiml Ut2 QB Solution

The document discusses the importance of feature engineering and exploratory data analysis (EDA) in model building, highlighting techniques such as imputation, one-hot encoding, and various plotting methods. It outlines steps for data cleaning and preparation, including removing duplicates and handling missing data, as well as optimization techniques like the bisection method and steepest descent method. Additionally, it lists algorithms for non-linear dimensionality reduction, such as Kernel PCA and t-SNE.

Uploaded by

yashvardhan983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Aiml Ut2 QB Solution

Uploaded by

yashvardhan983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

AI/ML UT2 QB SOLUTION

1. Why is feature engineering important in model building? List out some of the
techniques used for feature engineering.

Answer:

Feature engineering aids in better communicating a fundamental issue to predictive models,

increasing the model's accuracy for unobserved data. The feature engineering method chooses
the most practical predictor variables for the model, which is composed of predictor variables
and an outcome variable. An effective Feature Engineering implies:

 Higher efficiency of the model

 Easier Algorithms that fit the data
 Easier for Algorithms to detect patterns in the data
 Greater Flexibility of the features

Feature engineering techniques include:

i. Imputation: A typical problem in machine learning is missing values in the data sets, which
affects the way machine learning algorithms Imputation is the process of replacing missing
data with statistical estimates of the missing values, which produces a complete data set to
use to train machine learning models.
ii. One-hot encoding: A process by which categorical data is converted into a form that the
machine learning algorithm understands so it can make better predictions.
iii. Bag of words: A counting algorithm that calculates how many times a word is repeated in a
document. It can be used to determine similarities and differences in documents for such
applications as search and document classification.
iv. Automated feature engineering: This technique pulls out useful and meaningful features
using a framework that can be applied to any problem. Automated feature engineering
enables data scientists to be more productive by allowing them to spend more time on other
components of machine learning. This technique also allows citizen data scientists to do
feature engineering using a framework-based approach.
v. Binning: Binning, or grouping data, is key to preparing numerical data for machine learning.
This technique can be used to replace a column of numbers with categorical values
representing specific ranges.
vi. N-grams: Help predict the next item in a sequence. In sentiment analysis, the n-gram model
helps analyze the sentiment of the text or document.
vii. Feature crosses: A way to combine two or more categorical features into one. This
technique is particularly useful when certain features together denote a property better than
they do by themselves.
AI/ML UT2 QB SOLUTION
2. Why is an exploratory data analysis important? What are the components of EDA?

Answer:

EDA makes it simple to comprehend the structure of a dataset, making data modelling easier.
The primary goal of EDA is to make data ‘clean’ implying that it should be devoid of
redundancies. It aids in identifying incorrect data points so that they may be readily removed and
the data cleaned. Furthermore, it aids us in comprehending the relationship between the
variables, providing us with a broader view of the data and allowing us to expand on it by
leveraging the relationship between the variables. It also aids in the evaluation of the dataset’s
statistical measurements.

Outliers or abnormal occurrences in a dataset can have an impact on the accuracy of machine
learning models. The dataset might also contain some missing or duplicate values. EDA may be
used to eliminate or resolve all of the dataset’s undesirable qualities.

Steps Involved in Exploratory Data Analysis

i. Data Collection: Data collection is an essential part of exploratory data analysis. It refers
to the process of finding and loading data into our system. Good, reliable data can be
found on various public sites or bought from private organizations. Some reliable sites for
data collection are Kaggle, Github, Machine Learning Repository, etc.

ii. Data Cleaning: Data cleaning refers to the process of removing unwanted variables and
values from your dataset and getting rid of any irregularities in it. Such anomalies can
disproportionately skew the data and hence adversely affect the results. Some steps that
can be done to clean data are:
 Removing missing values, outliers, and unnecessary rows/ columns.
 Re-indexing and reformatting our data.

iii. Univariate Analysis: In Univariate Analysis, you analyze data of just one variable. A
variable in your dataset refers to a single feature/ column. You can do this either with
graphical or non-graphical means by finding specific mathematical values in the data.
Some visual methods include:
 Histograms: The frequency of data is represented with rectangle bars.
 Box-plots: Here the information is represented in the form of boxes.

iv. Bivariate Analysis: Here, you use two variables and compare them. This way, you can
find how one feature affects the other. It is done with scatter plots, which plot individual
data points or correlation matrices that plot the correlation in hues. You can also use
boxplots.
AI/ML UT2 QB SOLUTION
3. What are the various methods to plot the dataset?

Answer:

i. Bar Graph: A bar graph is a graph that presents categorical data with rectangle-shaped
bars. The heights or lengths of these bars are proportional to the values that they
represent. The bars can be vertical or horizontal. A vertical bar graph is sometimes called
a column graph.
ii. Line Graph: It displays a sequence of data points as markers. The points are ordered
typically by their x-axis value. These points are joined with straight line segments. A line
graph is used to visualize a trend in data over intervals of time.
iii. Pie Chart: A pie chart is a circular statistical graphic. To illustrate numerical proportion,
it is divided into slices. In a pie chart, for every slice, each of its arc lengths is
proportional to the amount it represents. The central angles, and area are also
proportional. It is named after a sliced pie.
iv. Histogram: A histogram is an approximate representation of the distribution of
numerical data. The data is divided into non-overlapping intervals called bins and
buckets. A rectangle is erected over a bin whose height is proportional to the number of
data points in the bin. Histograms give a feel of the density of the distribution of the
underlying data.
v. Area Chart: It is represented by the area between the lines and the axis. The area is
proportional to the amount it represents.
vi. Dot Graph: A dot graph consists of data points plotted as dots on a graph. There are two
types of these:
 The Wilkinson Dot Graph: In this dot graph, the local displacement is used to
prevent the dots on the plot from overlapping.
 Cleaveland Dot Graph: This is a scatterplot-like chart that displays data
vertically in a single dimension.
vii. Scatter Plot: It is a type of plot using Cartesian coordinates to display values for two
variables for a set of data. It is displayed as a collection of points. Their position on the
horizontal axis determines the value of one variable. The position on the vertical axis
determines the value of the other variable.
AI/ML UT2 QB SOLUTION
4. Explain the steps involved in cleaning and preparing the data.

Answer:

i. Remove duplicate or irrelevant observation: Remove unwanted observations from

your dataset, including duplicate observations or irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well as
creating a more manageable and more performant dataset.
ii. Fix structural errors: Structural errors are when you measure or transfer data and notice
strange naming conventions, typos, or incorrect capitalization. These inconsistencies can
cause mislabelled categories or classes. For example, you may find “N/A” and “Not
Applicable” both appear, but they should be analysed as the same category.
iii. Filter unwanted outliers: Often, there will be one-off observations where, at a glance,
they do not appear to fit within the data you are analysing. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or
is a mistake, consider removing it.
iv. Handle missing data: There are a couple of ways to deal with missing data. As a first
option, you can drop observations that have missing values. As a second option, you can
input missing values based on other observations. As a third option, you might alter the
way the data is used to effectively navigate null values.
v. Validate and QA: At the end of the data cleaning process, you should be able to answer
these questions as a part of basic validation:
 Does the data make sense?
 Does the data follow the appropriate rules for its field?
 Does it prove or disprove your working theory, or bring any insight to light?
 Can you find trends in the data to help you form your next theory?
 If not, is that because of a data quality issue?

5. Compare constrained and unconstrained optimization techniques.

Answer:
AI/ML UT2 QB SOLUTION
6. Explain the bracketing methods.

Answer:

Bracketing methods determine successively smaller intervals (brackets) that contain a root.
When the interval is small enough, then a root has been found. They generally use the
intermediate value theorem, which asserts that if a continuous function has values of opposite
signs at the end points of an interval, then the function has at least one root in the interval.
Therefore, they require to start with an interval such that the function takes opposite signs at the
end points of the interval.

i. Bisection method: The simplest root-finding algorithm is the bisection method. Let f be
a continuous function, for which one knows an interval [a, b] such that f(a) and f(b) have
opposite signs (a bracket). Let c = (a +b)/2 be the middle of the interval (the midpoint or
the point that bisects the interval). Then either f(a) and f(c), or f(c) and f(b) have opposite
signs, and one has divided by two the size of the interval. Although the bisection method
is robust, it gains one and only one bit of accuracy with each iteration. Other methods,
under appropriate conditions, can gain accuracy faster.

ii. False position (regula falsi): The false position method, also called the regula falsi
method, is similar to the bisection method, but instead of using bisection search's middle
of the interval it uses the x-intercept of the line that connects the plotted function values
at the endpoints of the interval, that is:

𝑎𝑓(𝑏) − 𝑏𝑓(𝑎)
𝑐=
𝑓(𝑏) − 𝑓(𝑎)

False position is similar to the secant method, except that, instead of retaining the last two
points, it makes sure to keep one point on either side of the root. The false position
method can be faster than the bisection method and will never diverge like the secant
method.

iii. ITP method: The ITP method is the only known method to bracket the root with the
same worst case guarantees of the bisection method while guaranteeing a superlinear
convergence to the root of smooth functions as the secant method. It is also the only
known method guaranteed to outperform the bisection method on the average for any
continuous distribution on the location of the root. It does so by keeping track of both the
bracketing interval as well as the minmax interval in which any point therein converges
as fast as the bisection method. The construction of the queried point c follows three
steps: interpolation, truncation and then projection onto the minmax interval.
AI/ML UT2 QB SOLUTION
7. Explain the bisection method.

Answer:

In the bisection method, if 𝑓(𝑎)𝑓(𝑏) < 0 , an estimate for the root of the equation 𝑓(𝑥) = 0 can
be found as the average of a and b:

𝑎+𝑏
𝑥𝑖 =
2
Upon evaluating 𝑓(𝑥𝑖), the next iteration would be to set either 𝑎 = 𝑥𝑖 or 𝑏 = 𝑥𝑖 such that for
the next iteration the root 𝑥𝑖+1 is between a and b. The following describes an algorithm for the
bisection method given 𝑎 < 𝑏, 𝑓(𝑥), 𝜀𝑠 , and maximum number of iterations:

Step 1: Evaluate 𝑓(𝑎) and 𝑓(𝑏) to ensure that 𝑓(𝑎)𝑓(𝑏) < 0. Otherwise, exit with an error.
𝑎+𝑏
Step 2: Calculate the value of the root in iteration i as 𝑥𝑖 = . Check which of the following
2
applies:

i. If 𝑓(𝑥𝑖) = 0, then the root has been found, the value of the error 𝜀𝑟 = 0. Exit.
ii. If 𝑓(𝑥𝑖)𝑓(𝑎𝑖) < 0, then for the next iteration, 𝑥𝑖+1 is bracketed between 𝑎𝑖 and 𝑥𝑖 . The
𝑥 −𝑥𝑖
value of 𝜀𝑟 = 𝑖+1
𝑥 𝑖+1
iii. If 𝑓(𝑥_𝑖)𝑓(𝑏_𝑖) < 0, then for the next iteration, 𝑥𝑖+1 is bracketed between 𝑥𝑖 and 𝑏𝑖 . The
𝑥𝑖+1 −𝑥𝑖
value of 𝜀𝑟 = 𝑥𝑖+1

Step 3: Set 𝑖 = 𝑖 + 1. If i reaches the maximum number of iterations or if 𝜀𝑟 ≤ 𝜀𝑠 , then the

iterations are stopped. Otherwise, return to step 2 with the new interval 𝑎𝑖+1 and 𝑏𝑖+1 .
AI/ML UT2 QB SOLUTION
8. Explain the steepest descent method.

Answer:

An algorithm for finding the nearest local minimum of a function which presupposes that the
gradient of the function can be computed. The method of steepest descent, also called the
gradient descent method, starts at a point 𝑃0 and, as many times as needed, moves from 𝑃𝑖 to
𝑃𝑖+1 by minimizing along the line extending from 𝑃𝑖 in the direction of −∇𝑓(𝑃𝑖 ), the local
downhill gradient.

When applied to a 1-dimensional function f(x), the method takes the form of iterating

𝑥𝑖 = 𝑥𝑖−1 − 𝜖 𝑓′(𝑥𝑖−1 )

from a starting point 𝑥0 for some small 𝜖 > 0 until a fixed point is reached. The results are
illustrated above for the function 𝑓(𝑥) = 𝑥 3 − 2𝑥 2 + 2 with 𝜖 = 0.1 and starting points 𝑥0 = 2
and 0.01, respectively.

This method has the severe drawback of requiring a great many iterations for functions which
have long, narrow valley structures.
AI/ML UT2 QB SOLUTION
9. List the algorithms used for non-linear dimensionality reduction.

Answer:

i. Kernel PCA: Kernel PCA is a non-linear dimensionality reduction technique that uses
kernels. It can also be considered as the non-linear form of normal PCA. Kernel PCA
works well with non-linear datasets where normal PCA cannot be used efficiently.
ii. t-distributed Stochastic Neighbor Embedding (t-SNE): This is also a non-linear
dimensionality reduction method mostly used for data visualization. In addition to that, it
is widely used in image processing and NLP. The Scikit-learn documentation
recommends you to use PCA or Truncated SVD before t-SNE if the number of features
in the dataset is more than 50.
iii. Multidimensional Scaling (MDS): MDA is another non-linear dimensionality reduction
technique that tries to preserve the distances between instances while reducing the
dimensionality of non-linear data. There are two types of MDS algorithms: Metric and
Non-metric. The MDS() class in the Scikit-learn implements both by setting the metric
hyperparameter to True (for Metric type) or False (for Non-metric type).
iv. Isometric mapping (Isomap): This method performs non-linear dimensionality
reduction through Isometric mapping. It is an extension of MDS or Kernel PCA. It
connects each instance by calculating the curved or geodesic distance to its nearest
neighbors and reduces dimensionality. The number of neighbors to consider for each
point can be specified through the n_neighbors hyperparameter of the Isomap() class
which implements the Isomap algorithm in the Scikit-learn.

10. Write short notes on the following:

i. PCA: Principal Component Analysis (PCA) is a statistical procedure that uses an
orthogonal transformation that converts a set of correlated variables to a set of
uncorrelated variables. PCA is the most widely used tool in exploratory data analysis and
in machine learning for predictive models. Moreover, PCA is an unsupervised statistical
technique used to examine the interrelations among a set of variables. It is also known as
a general factor analysis where regression determines a line of best fit.

ii. LDA: Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant

Function Analysis is a dimensionality reduction technique that is commonly used for
supervised classification problems. It is used for modelling differences in groups i.e.
separating two or more classes. It is used to project the features in higher dimension
space into a lower dimension space. For example, we have two classes and we need to
separate them efficiently. Classes can have multiple features. Using only a single feature
to classify them may result in some overlapping as shown in the below figure. So, we will
keep on increasing the number of features for proper classification.

Catalog Amp Ruang Teknik Group
100% (1)
Catalog Amp Ruang Teknik Group
23 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Hannah Arendt-Banality of Evil
50% (2)
Hannah Arendt-Banality of Evil
2 pages
Soccer Training For Goalkeepers
86% (7)
Soccer Training For Goalkeepers
170 pages
HeadRush Amp & Effect List
No ratings yet
HeadRush Amp & Effect List
10 pages
Unit 1
No ratings yet
Unit 1
44 pages
Dpa-Set - 2
No ratings yet
Dpa-Set - 2
4 pages
REFLEX ACT III™ Quick User Guide v12
100% (1)
REFLEX ACT III™ Quick User Guide v12
20 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
District Survey Report For Latur District FOR
No ratings yet
District Survey Report For Latur District FOR
146 pages
DS End Sem.
No ratings yet
DS End Sem.
31 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Chương
No ratings yet
Chương
12 pages
Job Application Letter Volunteer
100% (1)
Job Application Letter Volunteer
6 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Down 2
No ratings yet
Down 2
61 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
SurgeTesting EARbasics 0716
100% (1)
SurgeTesting EARbasics 0716
2 pages
DSA Question Bank
No ratings yet
DSA Question Bank
22 pages
Neuromuscular Assessments of Form and Function (Neuromethods, 204) (Philip J. Atherton (Editor) Etc.) (Z-Library)
No ratings yet
Neuromuscular Assessments of Form and Function (Neuromethods, 204) (Philip J. Atherton (Editor) Etc.) (Z-Library)
323 pages
Mineral Resources of RP
No ratings yet
Mineral Resources of RP
140 pages
Godavarman Case
No ratings yet
Godavarman Case
9 pages
Dev U2
No ratings yet
Dev U2
96 pages
Dpa-Set - A
No ratings yet
Dpa-Set - A
29 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
VB7
No ratings yet
VB7
44 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
EDA Question Bank Answers
No ratings yet
EDA Question Bank Answers
24 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Self Assessment and Reflection 1
100% (2)
Self Assessment and Reflection 1
7 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Assignment Questions - Data Analysis and Visualization Using Power BI and Tableau
No ratings yet
Assignment Questions - Data Analysis and Visualization Using Power BI and Tableau
2 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Mining
No ratings yet
Data Mining
5 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
500D High Pressure Syringe Pump Datasheet PDF
No ratings yet
500D High Pressure Syringe Pump Datasheet PDF
2 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
FDS - 2 Solved
No ratings yet
FDS - 2 Solved
14 pages
ML 4
No ratings yet
ML 4
17 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
FDS 1
No ratings yet
FDS 1
5 pages
DS Module 1 Notes
No ratings yet
DS Module 1 Notes
25 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Sfds Aat
No ratings yet
Sfds Aat
8 pages
Unit 2
No ratings yet
Unit 2
58 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
FDSMSE Imp
No ratings yet
FDSMSE Imp
6 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit 2b AI Project Cycle
No ratings yet
Unit 2b AI Project Cycle
26 pages
Eda 25-26 Ai&Ml 5 Sem Syllabus
No ratings yet
Eda 25-26 Ai&Ml 5 Sem Syllabus
3 pages
Machine
No ratings yet
Machine
10 pages
Dev Core
No ratings yet
Dev Core
7 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit 1
No ratings yet
Unit 1
8 pages
Usg Plasters Hydrocal Gypsum Cements Sealers Parting Compounds Brochure en IG515
No ratings yet
Usg Plasters Hydrocal Gypsum Cements Sealers Parting Compounds Brochure en IG515
2 pages
Pran Yog
No ratings yet
Pran Yog
3 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
CLAIND Hygen en 2021 Brochure
No ratings yet
CLAIND Hygen en 2021 Brochure
4 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
AWS S3 Cheatsheet
No ratings yet
AWS S3 Cheatsheet
10 pages
Social Work in A Digital Age - Ethical and Risk Management Challenges
No ratings yet
Social Work in A Digital Age - Ethical and Risk Management Challenges
12 pages
NOTES LIFE PROCESSES (Respiration, Excretion
No ratings yet
NOTES LIFE PROCESSES (Respiration, Excretion
3 pages
Sambhav Daksh Syed Abhimanyu
No ratings yet
Sambhav Daksh Syed Abhimanyu
10 pages
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
No ratings yet
Kisi-Kisi SOAL UJIAN AKHIR SEKOLAH BAHASA INGGRIS 2024
12 pages
T2222-Advanced Operation Research
No ratings yet
T2222-Advanced Operation Research
3 pages
Juliani 2
No ratings yet
Juliani 2
4 pages
1239915-Fairwinds Festival of Delights - The Homebrewery
No ratings yet
1239915-Fairwinds Festival of Delights - The Homebrewery
7 pages
Cambridge IGCSE: Travel & Tourism 0471/21
No ratings yet
Cambridge IGCSE: Travel & Tourism 0471/21
12 pages
Meaning of The Term Childhood As The Happiest Period of Life
No ratings yet
Meaning of The Term Childhood As The Happiest Period of Life
2 pages
Engineer Pros Backend Level 2 Course
No ratings yet
Engineer Pros Backend Level 2 Course
12 pages
Circadian Rhythms
No ratings yet
Circadian Rhythms
10 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Plate Fin Heat Ex Changers
No ratings yet
Plate Fin Heat Ex Changers
16 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet