Assignment - DADS303 - MBA 3 - Set 1 and 2
Assignment - DADS303 - MBA 3 - Set 1 and 2
ASSIGNMENT SET - 1
Q.1. What do you mean by Machine Learning? Discuss the relevance of Machine
Learning in Business.
Ans. Machine Learning (ML), a subfield of Artificial Intelligence (AI), focuses on
understanding data structure and representing it through models. These models enable
analysis and application, allowing users to derive desired outcomes.
The term "Machine Learning" was coined in 1959 by Arthur Samuel, who defined it
as the study of algorithms that allow computers to learn without explicit programming.
Professor Tom Mitchell of Carnegie Mellon University offered a more recent
definition, describing it as a computer's ability to improve its performance on a task
through experience and performance measurement.
ML is a field built upon decades of research, integrating principles and techniques
from diverse disciplines. Probability, statistics, and computer science form the
theoretical foundation of modern machine learning. Furthermore, it draws inspiration
from biology, genetics, clinical trials, and various social sciences.
ML tasks are categorized based on the learning approach, specifically how the system
learns from existing data or makes predictions using feedback datasets. The following
are the commonly used classifications:
Supervised Learning: The algorithm learns from labeled data, where a "response"
column acts as a teacher. It learns from provided examples and applies that
knowledge to new, unseen data.
Unsupervised Learning: The algorithm learns without labeled data or a "teacher." It
analyzes data features to identify patterns independently.
Reinforcement Learning: Inspired by behavioral psychology, this approach presents
data points sequentially. The algorithm receives rewards for correct actions and
penalties for incorrect ones, learning through trial and error.
The applications of machine learning are vast, particularly with the widespread use of
smartphones. Smartphones are deeply integrated with various ML processes, and ML
is becoming increasingly popular in business through mobile applications.
Businesses are leveraging ML to enhance their operations, automate tasks, and gain
valuable insights. For example:
Uber uses ML algorithms to optimize pick-up and drop-off times.
Spotify employs ML to personalize marketing and music recommendations.
Dell utilizes ML to gather employee and customer feedback and improve its practices.
Beyond these examples, ML can benefit businesses in numerous other areas. To
effectively implement ML, a well-defined strategy and policy are essential. Here are
some potential applications:
Monitoring of Social Media Content
Customer Care Services
Image Processing
Virtual Assistance
Product Recommendation
Stock Market Trading and Investment
Clinical Decision Support Systems
Data Deduplication
Cyber Security Enhancement
Q.2. What is Support Vector Machine? What are the various steps in using
Support Vector Machine?
Ans. Support Vector Machines (SVMs) are supervised machine learning algorithms
used for both classification and regression. Their primary goal is to identify the
optimal hyperplane that effectively separates data points belonging to different classes
or accurately predicts continuous values. This hyperplane maximizes the margin
between the classes, and the data points closest to it are termed "support vectors."
SVMs are highly valued in machine learning and data analysis for their ability to
handle high-dimensional data and generalize well to unseen data. Their relative
simplicity in implementation and efficient training make them a popular choice in
diverse real-world applications such as computer vision, natural language processing,
and finance.
w.x + b = c1
w.x + b = c2
After apllying formula of distance between parallel lines
|𝐶 −𝐶 |
m= 2|𝑊| 1
If C1=1 and C2=-1, by means of L2 norm of w offers us the algebraic expression for
the margin of the SVM.
2
Margin = m = |𝑤|
The goal was to maximize the margin m, but one need to additionally make certain
that it efficiently classifies crosses and circles. In different words, for the most
advantageous w vector this is observed, one desires to make certain crosses visit the
left of the crimson line and circles go to the proper of the blue line. be aware, however,
that the appealing amount inside the objective function, the L2 norm of the w vector,
appears inside the denominator and has a square root. both features add headaches to
solving a mathematical problem.
Linear Relationship: The model must exhibit linearity in its parameters, meaning
the beta coefficients, which are fundamental to linear regression, have a linear
nature. A linear relationship should exist between the independent variable (X)
and the mean of the dependent variable (Y). Moreover, due to the sensitivity of
linear regression to outliers, it is essential to identify and address them. Scatter
plots are an effective tool for assessing the linearity assumption.
Multivariate Normality: Linear regression models assume that the data represents
a random sample from the underlying population, with errors that are
uncorrelated and statistically independent. All variables should follow a
multivariate normal distribution. This assumption can be assessed using
histograms or Q-Q plots, and normality can be verified through goodness-of-fit
tests like the Kolmogorov-Smirnov test. If data is not normally distributed, non-
linear transformations (e.g., a log transformation) might be necessary.
Little to No Multicollinearity: A key assumption of linear regression is the
absence, or minimization, of multicollinearity. Multicollinearity occurs when
independent variables are highly correlated with each other.
No Autocorrelation: Linear regression analysis requires little to no
autocorrelation in the data. Autocorrelation occurs when residuals are not
independent, often observed in time series data where current values depend on
past values (e.g., stock prices). The Durbin-Watson test can detect autocorrelation,
and scatter plots can visually reveal it. The Durbin-Watson test assesses the null
hypothesis that residuals are not linearly auto-correlated. The test statistic 'd'
ranges from 0 to 4, with values near 2 indicating no autocorrelation. Generally,
values between 1.5 and 2.5 suggest no significant autocorrelation. Note that the
Durbin-Watson test primarily examines first-order effects and linear
autocorrelation between immediate neighbors.
Homoscedasticity: Homoscedasticity, meaning "same variance," is fundamental
to linear regression. It assumes that the error term (the random disturbance in the
relationship between independent and dependent variables) has the same variance
across all values of the independent variables. Scatter plots are a suitable method
for checking for homoscedasticity in the data.
ASSIGNMENT SET - 2
1. Initialization: The algorithm begins by randomly selecting 'k' data points to serve as
the initial cluster centers. These points represent the centroids of the clusters.
2. Assignment: Each data point is then assigned to the cluster whose center is nearest,
according to a distance metric (e.g., Euclidean distance). This creates initial clusters.
3. Update: The algorithm then calculates new cluster centers by computing the mean
of all the data points assigned to each cluster. These new means become the updated
cluster centers.
4. Iteration: Steps 2 and 3 (assignment and update) are repeated iteratively. In each
iteration, data points are reassigned to the nearest cluster center, and the cluster
centers are recalculated.
5. Convergence: The iterative process continues until a stable solution is reached.
Stability is defined as the point where re-running the algorithm no longer results in
points changing cluster memberships (or minimal changes occur).
Bringing data onto the same scale is critical. A common scaling method is Z-
standardization (or Z-transform). This involves:
The resulting values are called Z-scores. Z-standardization transforms the data to have
a mean of zero and a standard deviation of one, allowing for valid comparisons across
different distributions.
Computation of Z-Scores
The above parent indicates the z transformation of the income and kids’s information.
As you could
have a look at, the values are in a not unusual range, i.e., -1 to at least one.
Q.5. Discuss various validation measures used for Machine Learning in detail.
Ans. Evaluating a model's performance on an unseen dataset using validation
measures is essential for assessing its generalization ability. These measures differ
depending on the type of problem, such as classification or regression, and offer
insights into the quality of the model's predictions.
For regression problems, validation measures like R² and adjusted R² are commonly
used:
A Confusion Matrix offers a detailed breakdown of true positives, false positives, true
negatives, and false negatives, from which other metrics like accuracy, precision, and
recall are derived.
For binary classification with imbalanced data, the AUC-PR (Precision-Recall curve)
is often more informative than ROC-AUC. The AUC-PR summarizes the classifier's
performance by combining precision and recall at different thresholds, with 1
indicating perfect performance and 0.5 suggesting no better than random guessing.
Choosing the appropriate validation measure depends on the specific problem and
data characteristics. Measures like accuracy, precision, recall, and F1 score are useful
for classification, while R² and adjusted R² serve regression tasks. The AUC-ROC and
AUC-PR provide further insights, particularly in cases of imbalanced datasets.
Splitting Criteria:
The core of decision tree construction lies in splitting criteria, which identifies the
best attribute to divide a node into purer sub-nodes. The goal is to maximize
information gain, resulting in sub-nodes that are more homogeneous with respect to
the target variable. Various splitting metrics exist, each with its own advantages and
disadvantages.
For classification, common metrics include Gini impurity and information gain (based
on entropy). Gini impurity measures the probability of misclassifying a randomly
selected element if it were labeled based on the class distribution of the node. Lower
Gini impurity signifies a more homogeneous node. Information gain, conversely,
quantifies the reduction in entropy (a measure of disorder) achieved by the split. The
attribute yielding the highest information gain is chosen for the split.
For regression, variance reduction is often used. This involves selecting the attribute
that minimizes the variance of the target variable within the resulting sub-nodes. By
reducing variance, more homogeneous groups are created, improving prediction
accuracy.
The choice of splitting criterion impacts the tree's structure and performance.
Information gain tends to favor attributes with many values, while Gini impurity is
computationally cheaper.
Merging Criteria:
Merging criteria, though less frequently used, allow for combining nodes or branches
after initial tree growth. This technique, known as pruning, aims to simplify the tree
and prevent overfitting. After an initial tree is built, merging criteria assesses whether
combining nodes improves performance. This involves comparing the performance of
the reduced tree with the original. The criterion often uses statistical tests to determine
if merging would significantly negatively impact accuracy. If not, the nodes are
combined.
Stopping Criteria:
Stopping criteria define the conditions under which tree growth is halted. Without
these, the tree could grow indefinitely, leading to overfitting and poor generalization.
Several common stopping criteria are used.