ML Mid1 Notes
ML Mid1 Notes
The field of machine learning is concerned with the question of how to construct computer
programs that improve their performance at some task through experience. Machine learning is
about making computers modify or adapt their actions (whether the task is making predictions,
or controlling a robot) so that these actions get more accurate with experience, where accuracy is
measured by how well the chosen actions reflect the correct ones. Put more precisely [1],
A computer program is said to learn from experience with respect to some class of tasks and
performance measure, if the performance at the tasks, as measured by performance measure,
improves with the experience.
In general, to have a well-defined learning problem, we must identify these three features:
• The learning task
• The measure of performance
• The task experience
The key concept that we will need to think about for our machines is learning from experience.
Important aspects of ‘learning from experience’ behavior of humans and other animals
embedded in machine learning are remembering, adapting, and generalizing.
• Remembering and Adapting: Recognizing that last time in a similar situation, a certain action
(that resulted in this output) was attempted and had worked; therefore, it should be tried again
or this same action failed in the last attempt in a similar situation, and so something different
should be tried.
• Generalizing: This aspect is regarding recognizing similarity between different situations.
This makes learning useful because we can use our knowledge in situations unseen earlier.
Given a situation not faced earlier, recognizing similarity with the situations faced earlier, we
take a decision for the new situation—a generalizing capability of animal learning.
Machine learning concerns getting computers to alter or adapt their actions in a way that those
actions improve in terms of accuracy, with experience. Machine learning, like animal learning,
relies heavily on the notion of similarity in its search for valuable knowledge in data. The
computer program is the ‘machine’ in our context. The computer program is designed employing
learning from the task experience. Equivalently, we say that the machine is trained using task
experience, or machine learns from task experience. The terms: learning machine, learning
algorithm, learned knowledge, all refer to a computer program design with respect to the
assigned task. In case of any software system, understanding the inputs and outputs is of greater
importance than being aware of what takes place in between, and machine learning does just
that. The input is defined by the learning task. Four different types of learning tasks appear in the
real-world applications (details given later in Section 1.7). In classification learning, the
expectation is that the machine will learn a technique of classifying examples of
measurements/observations. In association learning, any relation between observations is
required, not merely association capable of predicting a specific class value. In clustering,
groups of observations that belong together are sought. In regression, the output to be predicted
is not a discrete class but a continuous numeric quantity. The classification and regression tasks
are carried out through the process of directed/supervised learning. For the examples of
measurements/observations, the outcome is known ‘a priori’; for classification problems, the
outcome is the class to which the example belongs; and for regression problems, the outcome is
the numeric value on the approximating curve that fits the data. The other form of learning is
undirected/unsupervised, wherein the outcome is not known ‘a priori’; clustering and association
learning belong to this category, as we shall see in later chapters. The experience with which the
machine will be trained (from which the machine will learn) may be available in the form of data
collected in databases. Most of the information that surrounds us, manifests itself in the form of
data that can be as basic as a set of measurements or observations, characterized by vectors with
numerical values; or may be in forms which are more difficult to characterize in the form of
numerical vectors—set of images, documents, audio clips, video clips, graphs, etc. For different
forms of raw data (text, images, waveforms, and so forth), it is common to represent data in
standard fixed length vector formats with numerical values. Such abstractions typically involve
significant loss of information, yet they are essential for a well-defined learning problem.
Thus, though the raw data is an agglomerated mass that cannot be fragmented accurately into
individual experience examples characterized by numerical vectors—yet it is very useful for
learning many things.
Numerical form of data representation allows us to deal with patterns geometrically, and thus we
shall study learning algorithms using linear algebra and analytic geometry. Characterizing the
similarity of patterns in state space can be done through some form of metric (distance) measure:
distance between two vectors is a measure of similarity between two corresponding patterns.
Many measures of ‘distance’ have been proposed in the literature. In another class of machine
learning problems, the input (experience) is available in the form of nominal (or categorical)
data, described in linguistic form (not numerical). For nominal form of data, there is no natural
notion of similarity. Each learning algorithm based on nominal data employs some nonmetric
method of similarity. In an alternative learning option, there is no training dataset, but human
knowledge (experience, expertise, heuristics) is available in linguistic form. This form of human
knowledge, when properly structured as a set of IF-THEN rules, can be embedded into a
learning machine. Having described the input to the software system, let us now look at the
output description. The output of an algorithm represents the learned knowledge. This
knowledge is in the form of a model of the structural patterns in the data. The model is deployed
by the user for decision-making; it gives the prediction with respect to the assigned task for
measurements/observations not in the task experience; a good model will generalize well to
observations unseen by the machine during training. A block diagrammatic representation of a
learning machine is shown in Fig. 1.1.
2. Applications of Machine learning in diverse fields
Machine learning is a growing technology used to mine knowledge from data (popularly known
as data mining field. Wherever data exists, things can be learned from it. Whenever there is
excess of data, the mechanics of learning must be automatic. Machine learning technology is
meant for automatic learning from voluminous datasets.
Google is by far the most popular and extensively used of all search engines. It offers access to
information from billions of web pages, which have been indexed on its server.
While going through the results of our Google query, many different advertisements show up
relating to our query. To tailor ads to match the interests of the users is a strategy by Google and
is one of the typical services that every Internet search provider tries to offer. Mining
information on the World Wide Web is an area that is fast growing, almost exploding.
Many organizations use data mining for customer relationship management (CRM), which
facilitates the provision of more customized and personal service, addressing individual
requirements of customers. It is possible for organizations to tailor ads and promotions to the
profiles of customers by closely studying the patterns of browsing and buying on web stores.
Banks were fast enough to embrace data mining technology to examine the issue of fickle
customers. That is, there is a likelihood of them defecting. As they successfully used machine
learning to assess credit, it was possible to reduce customer attrition. Cellular phone companies
handle churn by identifying behavioral patterns that gain from new services, and then promoting
such services in order to retain their customer base.
Data mining has greatly impacted the ways in which people use computers. On getting on to the
Internet, for instance, let us say we feel like checking our email. Unknown to us, many irritating
emails have already been noticed using spam filters that use machine learning to identify spam.
Computer network security is a continually rising issue. While protectors keep hardening
networks, operating systems, and applications, attackers keep discovering weak spots in all these
areas. Systems for detecting intrusions are able to detect unusual patterns of activity. Data
mining is being applied to this issue in an attempt to find out semantic connections among
attacker traces in computer network data. Privacy-preserving data mining assists in protecting
privacy-sensitive information, such as credit card transaction records, healthcare records,
personal financial records, biological features, criminal/justice investigations, and ethnicity.
Of late, huge data collection and storage technologies have altered the landscape of scientific
data analysis. Major examples include applications which involve natural resources, the
prediction of floods and droughts, meteorology, astronomy, geography, geology, biology, and
other scientific and engineering data. Machine learning/data mining is present in all of these
examples.
Machine Vision: It is a field where pattern recognition has been applied with major successes.
A machine vision system captures images through a camera and analyzes these to be able to
describe the image. A machine vision system is applicable in the manufacturing industry, for
automating the assembly line or for automated visual inspection.
Biometric Recognition: It has been made clear by decades of research in pattern recognition
that the level of visual understanding and recognition that humans exhibit cannot be matched by
computer algorithms. Certain problems, such as biometric recognition (fingerprints
identification, face and gesture recognition, etc.) are being handled with success, but general
purpose image-representation systems are still not visible on the horizon.
Handwriting Recognition: It is another area where pattern recognition can be applied, with
major consequences in automation and information handling. Take first the simpler problem of
printed character recognition. The commercially available Optical Character Recognition or
OCR system has a light source, a document transport, as well as a detector. At the point where
the light-sensitive detector gives output, light intensity variation is translated into ‘numbers’. On
this image array, image processing and pattern recognition methods are applied to identify the
characters—that is, to categorize each character into the correct ‘letter’, ‘number’, and
‘punctuation’ class.
Medical Diagnosis: It also uses pattern recognition. Doctors make use of it while making
diagnostic decisions. The ultimate diagnosis is, of course, made by the doctor. Computer-aided
diagnosis has been applied to, and is of interest for, a range of medical data—X-rays, computed
tomographic images, ultrasound images, electrocardiograms (ECGs), and electroencephalograms
(EEGs). Alignment of Biological Sequences: Alignment of sequences is done on the basis of the
fact that all living organisms are related by evolution. This means, nucleotide (DNA, RNA) and
amino acid (proteins) sequences of species that have evolved close to each other, should display
more similarities. An alignment is the procedure of lining up sequences to obtain a maximum
identity level, which also expresses the level of similarity between sequences. Biological
sequence analysis is significant in bioinformatics and modern biology.
Drug Design: It is usually based on a long and expensive process involving complex chemical
experiments to check whether or not a particular chemical compound could be a good candidate
for a specific drug, which would be a positive result involving further clinical experiments. For
several years, a new scheme based on computational simulations has been emerging.
Speech Recognition: It is an area that has been well researched. Speech is the most natural
means by which humans share, convey and exchange information. Intelligent machines that
recognize spoken information can be used in numerous applications, for example, to help control
machines by talking to them—entering data into a computer via a microphone. Speech
recognition can enhance our ability to communicate with the deaf and dumb.
Text Mining: It concerns identification of patterns in text. The procedure involves analysis of
text for extraction of useful information for specific purposes. The way information available on
the Web and on corporate intranets, digital libraries, and news wires is spread or propagated, is
overwhelming. Integration of this information into the decision-making process, at a fast pace, is
essential in order to help businesses stay competitive in today’s market. Text mining has reached
the industrial world and is helping to exploit knowledge that, due to its shear size, is often
beyond human consumption.
Natural Language Processing: Ever since the computer age dawned, computer science research
has been attempting to understand human language. In 1950, immediately following the first
invention of the computer, Alan Turing, one of the greatest computer scientists of the twentieth
century, suggested a test for computer intelligence. In a paper titled “Computing Machinery and
Intelligence”, he introduced this machine. Over sixty years later, computers could perform
extraordinary actions that Alan Turing probably never imagined could be possible. Language is
obviously a critical component of how people communicate and how information is stored in the
business world and beyond. The goal of Natural Language Processing (NLP) is to analyze,
understand, and generate languages that humans use naturally so that eventually a computer will
‘naturally’ be able to interpret what the other person is saying. Voice automation is just starting
with robot vacuum cleaners that respond to cleaning orders; telephones and household
appliances that obey voice commands.
Fault Diagnostics: Preventive upkeep of motors and generators and other electromechanical
devices, can delay malfunctions that may otherwise interrupt industrial procedures. Typical
defects or flaws include misalignment of shaft, mechanical slackening, defective bearings, and
unbalanced pumps.
Load Forecasting: It is quite essential to establish future power demand in the electricity supply
industry. In fact, the earlier the demand is known, the better. Precise estimates can be made with
the help of machine learning methods for the maximum and minimum load for each hour, day,
month, season, and year.
Control and Automation: A quiet revolution is ongoing in the manufacturing world which is
changing the look of factories. Computers are controlling and monitoring manufacturing
processes with a high degree of automation, facilitated by machine learning techniques. The
computer control includes control of all types of processes such as Computerized Numerical
Control (CNC), welding, electrochemical machining, etc., and control of industrial robots. High
degree of automation is applied in today’s Flexible Manufacturing Systems (FMS) that can be
readily rearranged to handle new market requirements. Flexible manufacturing systems,
combined with automatic assembly and product inspection on one hand, and CAD/CAM system
on the other, are the basic components of the modern Computer Integrated Manufacturing
System.
Business Intelligence: It is essential for businesses to be able to comprehend the commercial
control of their organization well, in terms of customer base, market, supply and resources, and
competition. Business Intelligence (BI) technologies offer not only historical and current
information but also predictive views of business operations. Data mining is the fundamental
core of business intelligence. In the absence of data mining, many businesses may be unable to
effectively perform market analyses, compare customer feedback on similar products, find the
strengths and weaknesses of their competitors, retain extremely valuable customers, and arrive at
intelligent business decisions.
Robotics and automation
A robot is a machine capable of carrying out a series of complex tasks automatically,
programmed by computers. Eg. Automated visual inspection
3. Occam's Razor Principle
Occam’s Razor is a principle that likes simplicity. It says that the simplest solution is usually the best one.
In machine learning, this means that if we have two models that work about as well as each other, we should
choose the simpler one.
1. Start with Simpler Models: Rather than starting with a complex model, start with a simpler one. You
could begin with a linear regression or decision tree before moving to more complex models like
random forests or neural networks. This gives you a baseline to compare against and helps you
understand if the additional complexity is justified.
2. Regularization: Regularization techniques such as L1 (Lasso) and L2 (Ridge) can help prevent
overfitting by adding a penalty term to the loss function that constrains the magnitude of the
parameters. This discourages the model from relying too heavily on any one feature and makes the
model simpler and more generalizable.
3. Pruning: Pruning techniques are used in decision trees and neural networks to remove unnecessary
complexity. In decision trees, pruning can remove unimportant branches. In neural networks, pruning
can remove unnecessary weights or neurons.
4. Cross-Validation: Cross-validation helps you understand how well your model generalizes to unseen
data. If a model performs well on the training data but poorly on the validation data, it’s likely
overfitting, which indicates that the model might be too complex.
5. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or t-SNE can
reduce the number of features in your data, simplifying the model and helping to prevent overfitting.
6. Feature Selection: By selecting only the most important features for your model, you can reduce
complexity and improve interpretability. Techniques for feature selection include mutual information,
correlation coefficients, and recursive feature elimination.
7. Hyperparameter Tuning: Many machine learning models have hyperparameters that control their
complexity. For example, the depth of a decision tree, or the penalty term in a regularized regression.
Tuning these hyperparameters can help you find the right balance between simplicity and accuracy.
The following represents some of the benefits of understanding and applying Occam’s Razor for data
scientists
Enhancing Interpretability: Simpler models are often more interpretable, which means it’s easier to
understand how they’re making predictions. This can be important for trust, transparency, and even
legal reasons in certain industries. For example, in healthcare or finance, being able to explain why a
model made a certain prediction could be crucial.
Avoiding Overfitting: As mentioned before, complex models can often fit the training data very well,
but they can also capture the noise in the data, leading to overfitting. An overfitted model performs well
on the training data but poorly on unseen data, which is a problem because the goal of machine learning
is to make accurate predictions on new, unseen data. By keeping models simpler, data scientists can
reduce the risk of overfitting.
Improving Generalizability: Simpler models are more likely to generalize well to unseen data. This is
because they are less likely to fit the noise in the training data and more likely to capture the underlying
trend or relationship.
Reducing Computational Resources: Simpler models typically require less computational resources
to train and predict. This can be a significant advantage in real-world settings, where resources might be
limited or expensive.
4. Discuss various Supervised learning algorithms.
• Supervised learning is the machine learning task of learning a function that maps an input
to an output based on example input-output pairs.
• Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses. The most common unsupervised
learning method is cluster analysis, which is used for exploratory data analysis to find hidden
patterns or grouping in data.
• Classification
In machine learning and statistics, classification is the problem of identifying to
which of a set of categories (sub-populations) a new observation belongs, on the basis of a
training set of data containing observations (or instances) whose category membership is
known.
• Regression
Linear Regression is a machine learning algorithm based on supervised learning.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So,this regression technique finds out a linear relationship between x
(input) and y(output).
Predictive modeling or supervised learning aims at constructing models that can predict the
value of a target (dependent) variable from the known values of at- tributes (independent
variables).
Subgroup discovery is a data mining technique that discovers interesting associations
among different variables with respect to a property of interest.
Descriptive clustering
Clustering is an unsupervised machine learning approach, but can it be used to
improve the accuracy of supervised machine learning algorithms as well by clustering the
data points into similar groups and using these cluster labels as independent variables in the
supervised machine learning algorithm.
Associative rule discovery
Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases.
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrence, in a database.
Support is an indication of how frequently the items appear in the data.
Confidence indicates the number of times the if-then statements are found true.
Precision and Recall
Precision can be seen as a measure of exactness or quality, whereas recall is a measure
of completeness or quantity
Precision = Total number of documents retrieved that are relevant/Total number of
documents that are retrieved.
Recall = Total number of documents retrieved that are relevant/Total number of
relevant documents in the database.
Predictive clustering: unlabled
Descriptive clustering
Inductive learning
Inductive learning takes the traditional sequence of a lesson and reverses things. Instead of
saying, “Here is the knowledge; now go practice it,” inductive learning says, “Here are some
objects, some data, some artifacts, some experiences… what knowledge can we gain from
them?”
Supervised learning algorithms
Linear regression (Ordinary Least Squares Regression or OLS Regression) is perhaps one of the most well-
known and best-understood algorithms in statistics and machine learning. Linear regression is a linear
model, e.g., a model that assumes a linear relationship between the input variables (x) and the single output
variable (y). The goal of linear regression is to train a linear model to predict a new y given a previously
unseen x with as little error as possible.
Implementation in Python
from sklearn.linear_model
import LinearRegression
model = LinearRegression()
model.fit(X, Y)
Linear regression
Logistic Regression
Logistic regression is one of the most widely used algorithms for classification. The logistic regression
model arises from the desire to model the probabilities of the output classes given a function that is linear
in x, at the same time ensuring that output probabilities sum up to one and remain between zero and one as
we would expect from probabilities.
The objective of the support vector machine (SVM) algorithm is to maximize the margin (shown as shaded
area in Figure 4-3), which is defined as the distance between the separating hyperplane (or decision
boundary) and the training samples that are closest to this hyperplane, the so-called support vectors. The
margin is calculated as the perpendicular distance from the line to only the closest points, as shown in Fig.
Hence, SVM calculates a maximum-margin boundary that leads to a homogeneous partition of all data
points.
he SVM regression and classification models can be constructed using the sklearn package of Python, as
shown in the following code snippets:
Regression
from sklearn.svm
model.fit(X, Y)
Classification
from sklearn.svm
model.fit(X, Y)
K-Nearest Neighbors
K-nearest neighbors (KNN) is considered a “lazy learner,” as there is no learning required in the model. For
a new data point, predictions are made by searching through the entire training set for the K most similar
instances (the neighbors) and summarizing the output variable for those K instances.
To determine which of the K instances in the training dataset are most similar to a new input, a distance
measure is used. The most popular distance measure is Euclidean distance, which is calculated as the square
root of the sum of the squared differences between a point a and a point b across all input attributes i, and
which is represented as
Euclidean distance is a good distance measure to use if the input variables are similar in type.
Another distance metric is Manhattan distance, in which the distance between point a and point b is
represented as
Manhattan distance is a good measure to use if the input variables are not similar in type.
KNN regression and classification models can be constructed using the sklearn package of Python, as shown
in the following code:
Classification
Regression
The model can be represented by a binary tree (or decision tree), where each node is an input variable x with
a split point and each leaf contains an output variable y for prediction.
Figure 4-4 shows an example of a simple classification tree to predict whether a person is a male or a
female based on two inputs of height (in centimeters) and weight (in kilograms).
Classification
from sklearn.tree
import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, Y)
Regression
from sklearn.tree
import DecisionTreeRegressor
model = DecisionTreeRegressor ()
model.fit(X, Y)
Random forest
Random forest is a tweaked version of bagged decision trees. In order to understand a random forest
algorithm, let us first understand the bagging algorithm. Assuming we have a dataset of one thousand
instances, the steps of bagging are:
Random forest regression and classification models can be constructed using the sklearn package of Python,
as shown in the following code:
Classification
from sklearn.ensemble
import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, Y)
Regression
model = RandomForestRegressor()
model.fit(X, Y)
Given a new dataset, calculate the average prediction from each model and aggregate the prediction by each
tree to assign the final label by majority vote.
Random forest regression and classification models can be constructed using the sklearn package of Python,
as shown in the following code:
Classification
model = RandomForestClassifier()
model.fit(X, Y)
Regression
model = RandomForestRegressor()
model.fit(X, Y)
Adaptive Boosting or AdaBoost is a boosting technique in which the basic idea is to try predictors
sequentially, and each subsequent model attempts to fix the errors of its predecessor. At each iteration, the
AdaBoost algorithm changes the sample distribution by modifying the weights attached to each of the
instances. It increases the weights of the wrongly predicted instances and decreases the ones of the correctly
predicted instances.
This process is repeated until the error function does not change, or until the maximum limit of the number
of estimators is reached.
Implementation in Python
AdaBoost regression and classification models can be constructed using the sklearn package of Python, as
shown in the following code snippet:
Classification
model = AdaBoostClassifier()
model.fit(X, Y)
Regression
model = AdaBoostRegressor()
model.fit(X, Y)
Gradient boosting method (GBM) is another boosting technique similar to AdaBoost, where the general idea
is to try predictors sequentially. Gradient boosting works by sequentially adding the previous underfitted
predictions to the ensemble, ensuring the errors made previously are corrected.
A model (which can be referred to as the first weak learner) is built on a subset of data. Using this model,
predictions are made on the whole dataset.
Errors are calculated by comparing the predictions and actual values, and the loss is calculated using the loss
function.
A new model is created using the errors of the previous step as the target variable. The objective is to find
the best split in the data to minimize the error. The predictions made by this new model are combined with
the predictions of the previous. New errors are calculated using this predicted value and actual value.
This process is repeated until the error function does not change or until the maximum limit of the number
of estimators is reached.
Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method tries to fit the
new predictor to the residual errors made by the previous predictor.
Classification
model = GradientBoostingClassifier()
model.fit(X, Y)
Regression
model = GradientBoostingRegressor()
model.fit(X, Y)
Classification
For simplicity, we will mostly discuss things in terms of a binary classification problem (i.e., only two
outcomes, such as true or false); some common terms are:
The difference between three commonly used evaluation metrics for classification, accuracy, precision, and
recall, is illustrated in Figure 4-8.
Accuracy
Accuracy is the number of correct predictions made as a ratio of all predictions made.
Precision
Precision is the percentage of positive instances out of the total predicted positive instances.
Recall
Recall (or sensitivity or true positive rate) is the percentage of positive instances out of the total actual
positive instances. Therefore, the denominator (true positive + false negative) is the actual number of
positive instances present in the dataset.
Area under ROC curve (AUC) is an evaluation metric for binary classification problems. ROC is a
probability curve, and AUC represents degree or measure of separability. It tells how much the model is
capable of distinguishing between classes.
ROC stands for Receiver Operating Characteristics, and the ROC curve is the graphical representation of the
effectiveness of the binary classification model. It plots the true positive rate (TPR) vs the false positive rate
(FPR) at different classification thresholds.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular document
belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.
Steps to implement:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is similar as
we did in data-pre-processing. The code for this is given below:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the code
for it:
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can also use
other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and will use
the predict function to make the predictions.
The above output shows the result for prediction vector y_pred and real vector y_test. We can see that some
predications are different from the real values, which are the incorrect predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is the code
for it:
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and 65+25=90
correct predictions.
Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
Output:
The above output is final output for test set data. As we can see the classifier has created a Gaussian curve to
divide the "purchased" and "not purchased" variables. There are some wrong predictions which we have
calculated in Confusion matrix. But still it is pretty good classifier.
9. Give a brief note on structured and unstructured data analysis in Machine learning.
Structured data
Structured data is particularly useful when you’re dealing with discrete, numeric data. Examples of this type
of data include financial operations, sales and marketing figures, and scientific modeling. You can also use
structured data in any case where records with multiple, short-entry text, numeric, and enumerated fields are
required, such as HR records, inventory listings, and housing data.
Unstructured data
Unstructured data is used when a record is required and the data won’t fit into a structured data format.
Examples include video monitoring, company documents, and social media posts. You can also use
unstructured data where it isn’t efficient to store the data in a structured format, such as Internet of Things
(IoT) sensor data, computer system logs, and chat transcripts.
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set of
algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and on the
basis of training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression, Classification, Forecasting,
Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which are:
In this topic, we will provide a detailed description of the types of Machine Learning along with their
respective algorithms:
As its name suggests, Supervised machine learning is based on supervision. It means in the supervised
learning technique, we train the machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to
the output. More preciously, we can say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog
images. So, first, we will provide the training to the machine to understand the images, such as the shape &
size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify the object and predict
the output. Now, the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection,
Spam filtering, etc.
Supervised machine learning can be classified into two types of problems, which are given below:
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms predict
the categories present in the dataset. Some real-world examples of classification algorithms are Spam
Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship between
input and output variables. These are used to predict continuous output variables, such as market trends,
weather prediction, etc.
Advantages:
o Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image classification
is performed on different image data with pre-defined labels.
o Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using
medical images and past labelled data with labels for disease conditions. With such a process, the
machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the patterns that can
lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These algorithms
classify an email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The
algorithm is trained with voice data, and various identifications can be done using the same, such as
voice-activated passwords, voice commands, etc.
Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no
need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled
dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the
model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset
according to the similarities, patterns, and differences. Machines are instructed to find the hidden
patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we
input it into the machine learning model. The images are totally unknown to the model, and the task of the
machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape difference,
and predict the output when it is tested with the test dataset.
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way to group
the objects into a cluster such that the objects with the most similarities remain in one group and have fewer
or no similarities with the objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations among
variables within a large dataset. The main aim of this learning algorithm is to find the dependency of one
data item on another data item and map those variables accordingly so that it can generate maximum profit.
This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production,
etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Advantages:
o These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as
compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.
o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning techniques
for building recommendation applications for different web applications and e-commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning, which
can identify unusual data points within the dataset. It is used to discover fraudulent transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular
information from the database. For example, extracting information of each user located at a
particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled
training data) and Unsupervised learning (with no labelled training data) algorithms and uses the
combination of labelled and unlabeled datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and
operates on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly,
but for corporate purposes, they may have few labels. It is completely different from supervised and
unsupervised learning as they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept
of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively use
all the available data, rather than only labelled data like in supervised learning. Initially, similar data is
clustered along with an unsupervised learning algorithm, and further, it helps to label the unlabeled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is under the
supervision of an instructor at home and college. Further, if that student is self-analysing the same concept
without any help from the instructor, it comes under unsupervised learning. Under semi-supervised learning,
the student has to revise himself after analyzing the same concept under the guidance of an instructor at
college.
Advantages:
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their
experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns various things by
experiences in his day-to-day life. An example of reinforcement learning is to play a game, where the Game
is the environment, moves of an agent at each step define states, and the goal of the agent is to get a high
score. Agent receives feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as Game theory,
Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the
agent constantly interacts with the environment and performs actions; at each action, the environment
responds and generates a new state.
o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human performance.
Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to use RL
in computer to automatically learn and schedule resources to wait for different jobs in order to
minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There are
different industries that have their vision of building intelligent robots using AI and Machine
learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help of
Reinforcement Learning by Salesforce company.
Advantages
o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results can
be found.
o Helps in achieving long term results.
Disadvantage
The curse of dimensionality limits reinforcement learning for real physical systems.
Descriptive statistics
After collecting data, one of the first things to do is to graph the data, calculate the mean and get an
overview of the distributions of the data. This is the task of descriptive statistics.
Thus, the goal of descriptive statistics is to gain an overview of the distribution of data sets. Descriptive
statistics helps to describe and illustrate data sets.
Definition
The term descriptive statistics covers statistical methods for describing data using statistical characteristics,
charts, graphics or tables.
It is important here that only the properties of the respective sample are described and evaluated. However,
no conclusions are drawn about other points in time or the population. This is the task of inferential statistics
or concluding statistics.
The first group of Descriptive Statistics are location parameter like the mean and mode. They are used to
express a central tendency of the data set. They therefore describe where the center of a sample is located or
where a large part of the sample is located.
The second group are measures of dispersion. They provide information about how much the values of a
variable in a sample differ from each other. Measures of dispersion can therefore describe how strongly the
values of a variable deviate from the mean value: Are the values rather close together, i.e. are they similar,
or are they far apart and thus differ greatly? A classic example is the standard deviation.
Which measures of location or dispersion are suitable for describing the data depends on the
respective scales of measurement of the variable. Here, a distinction can be made
between metric, ordinal and nominal scales of measurement.
Finally, a large area of descriptive statistics is diagrams such as the bar chart, the pie chart, or the histogram.
Tip
On DATAtab you can create charts directly in your browser, e.g. you can create a bar chart online or create
a boxplot online. Just try it out!
A random sample of 10 male basketball players will be drawn, whose height will be measured in meters.
Playe
r Body height
1 1.62
2 1.72
3 1.55
4 1.7
5 1.78
6 1.65
Playe
r Body height
7 1.64
8 1.64
9 1.66
10 1.74
DATAtab will now give you the following table of descriptive statistics (relevant dispersion measures and
location measures) on the height of the players.
Inferential statistics
What's inferential statistics? In contrast to descriptive statistics, inferential statistics want to make a
statement about the population. However, since it is almost impossible in most cases to survey the entire
population, a sample is used, i.e. a small data set originating from the population. With this sample a
statement about the population can be made. An example would be if a sample of 1,000 citizens is taken
from the population of all Canadian citizens.
Depending on which statement is to be made about the population or which question is to be answered about
the population, different statistical methods or hypothesis tests are used. The best known are the hypothesis
tests with which a group difference can be tested, such as the t-test, the chi-square test or the analysis of
variance. Then there are the hypothesis tests with which a correlation of variables can be tested, such as
correlation analysis and regression.
In the Hypothesis Test Calculator on DATAtab you can easily calculate these tests from the inference
statistics directly online in your browser.
Inferential statistics is a branch of statistics that uses various analytical tools to draw conclusions about the
population from sample data. For a given hypothesis about the population, inferential statistics uses a sample
and gives an indication of the validity of the hypothesis based on the sample collected.
In the example above, a sample of 10 basketball players was drawn and then exactly this sample was
described, this is the task of descriptive statistics. If you want to make a statement about the population you
need the inferential statistics. For example, it could be of interest if basketball players are larger than the
average male population. To test this hypothesis a t-Test is calculated, the t-test compares the sample mean
with the mean of the population.
Furthermore, the question could arise whether basketball players are larger than football players. For this
purpose, a sample of football players is drawn, and then the mean value of the basketball players can be
compared with the mean value of the football players using an independent t-test. Now a statement can be
made, for example, whether basketball players are larger than football players in the population or not.
Since this statement is only made based on the samples and it can also be pure coincidence that the
basketball players are larger in exactly this sample, the statement can only be confirmed or re-submitted
with a certain probability.
Inferential statistics helps to develop a good understanding of the population data by analyzing the samples
obtained from it. It helps in making generalizations about the population by using various analytical tests
and tools. In order to pick out random samples that will represent the population accurately many sampling
techniques are used. Some of the important methods are simple random sampling, stratified sampling,
cluster sampling, and systematic sampling techniques.
Inferential statistics can be defined as a field of statistics that uses analytical tools for drawing conclusions
about a population by examining random samples. The goal of inferential statistics is to make
generalizations about a population. In inferential statistics, a statistic is taken from the sample data (e.g., the
sample mean) that used to make inferences about the population parameter (e.g., the population mean).
Inferential statistics can be classified into hypothesis testing and regression analysis. Hypothesis testing also
includes the use of confidence intervals to test the parameters of a population. Given below are the different
types of inferential statistics.
Hypothesis Testing
Hypothesis testing is a type of inferential statistics that is used to test assumptions and draw conclusions
about the population from the available sample data. It involves setting up a null hypothesis and an
alternative hypothesis followed by conducting a statistical test of significance. A conclusion is drawn based
on the value of the test statistic, the critical value, and the confidence intervals. A hypothesis test can be left-
tailed, right-tailed, and two-tailed. Given below are certain important hypothesis tests that are used in
inferential statistics.
Z Test: A z test is used on data that follows a normal distribution and has a sample size greater than or equal
to 30. It is used to test if the means of the sample and population are equal when the population variance is
known. The right tailed hypothesis can be set up as follows:
T Test: A t test is used when the data follows a student t distribution and the sample size is lesser than 30. It
is used to compare the sample and population mean when the population variance is unknown. The
hypothesis test for inferential statistics is given as follows:
F Test: An f test is used to check if there is a difference between the variances of two samples or
populations. The right tailed f hypothesis test can be set up as follows:
Confidence Interval: A confidence interval helps in estimating the parameters of a population. For
example, a 95% confidence interval indicates that if a test is conducted 100 times with new samples under
the same conditions then the estimate can be expected to lie within the given interval 95 times. Furthermore,
a confidence interval is also useful in calculating the critical value in hypothesis testing.
Apart from these tests, other tests used in inferential statistics are the ANOVA test, Wilcoxon signed-rank
test, Mann-Whitney U test, Kruskal-Wallis H test, etc.
Regression Analysis
Regression analysis is used to quantify how one variable will change with respect to another variable. There
are many types of regressions available such as simple linear, multiple linear, nominal, logistic, and ordinal
regression. The most commonly used regression in inferential statistics is linear regression. Linear
regression checks the effect of a unit change of the independent variable in the dependent variable. Some
important formulas used in inferential statistics for regression analysis are as follows:
Regression Coefficients:
Inferenti
al Statistics Examples
Inferential statistics is very useful and cost-effective as it can make inferences about the population without
collecting the complete data. Some inferential statistics examples are given below:
Suppose the mean marks of 100 students in a particular country are known. Using this sample
information the mean marks of students in the country can be approximated using inferential statistics.
Suppose a coach wants to find out how many average cartwheels sophomores at his college can do
without stopping. A sample of a few students will be asked to perform cartwheels and the average will
be calculated. Inferential statistics will use this data to make a conclusion regarding how many cartwheel
sophomores can perform on average.
Inferential Statistics vs Descriptive Statistics
Descriptive and inferential statistics are used to describe data and make generalizations about the population
from samples. The table given below lists the differences between inferential statistics and descriptive
statistics.
Measures of central
Hypothesis testing and
tendency and measures of
regression analysis are
dispersion are the important
the analytical tools used.
tools used.