Unit 3
Unit 3
Reinforcement learning
Reinforcement learning is a machine learning model that is similar to supervised learning, but
the algorithm isn’t trained using sample data. This model learns as it goes by using trial and
error. A sequence of successful outcomes will be reinforced to develop the best recommendation
or policy for a given problem.
● Neural networks: Neural networks simulate the way the human brain works, with a
huge number of linked processing nodes. Neural networks are good at recognizing
patterns and play an important role in applications including natural language translation,
image recognition, speech recognition, and image creation.
● Linear regression: This algorithm is used to predict numerical values, based on a
linear relationship between different values. For example, the technique could be used to
predict house prices based on historical data for the area.
● Logistic regression: This supervised learning algorithm makes predictions for
categorical response variables, such as “yes/no” answers to questions. It can be used for
applications such as classifying spam and quality control on a production line.
● Clustering: Using unsupervised learning, clustering algorithms can identify patterns in
data so that it can be grouped. Computers can help data scientists by identifying
differences between data items that humans have overlooked.
● Decision trees: Decision trees can be used for both predicting numerical values
(regression) and classifying data into categories. Decision trees use a branching sequence
of linked decisions that can be represented with a tree diagram. One of the advantages of
decision trees is that they are easy to validate and audit, unlike the black box of the neural
network.
● Random forests: In a random forest, the machine learning algorithm predicts a value or
category by combining the results from a number of decision trees.
For example, decision tree algorithms are used for both predicting numerical values (regression
problems) and classifying data into categories. Decision trees use a branching sequence of linked
decisions that may be represented with a tree diagram. A prime advantage of decision trees is
that they are easier to validate and audit than a neural network. The bad news is that they can be
more unstable than other decision predictors.
Overall, there are many advantages to machine learning that businesses can leverage for new
efficiencies. These include machine learning identifying patterns and trends in massive volumes
of data that humans might not spot at all. And this analysis requires little human intervention:
just feed in the dataset of interest and let the machine learning system assemble and refine its
own algorithms, which will continually improve with more data input over time. Customers and
users can enjoy a more personalized experience as the model learns more with every experience
with that person.
On the downside, machine learning requires large training datasets that are accurate and
unbiased. GIGO is the operative factor: garbage in / garbage out. Gathering sufficient data and
having a system robust enough to run it might also be a drain on resources. Machine learning can
also be prone to error, depending on the input. With too small a sample, the system could
produce a perfectly logical algorithm that is completely wrong or misleading. To avoid wasting
budget or displeasing customers, organizations should act on the answers only when there is high
confidence in the output.
Technological singularity
While this topic garners a lot of public attention, many researchers are not concerned with the
idea of AI surpassing human intelligence in the near future. Technological singularity is also
referred to as strong AI or superintelligence. Philosopher Nick Bostrum defines superintelligence
as “any intellect that vastly outperforms the best human brains in practically every field,
including scientific creativity, general wisdom, and social skills.”
Despite the fact that superintelligence is not imminent in society, the idea of it raises some
interesting questions as we consider the use of autonomous systems, like self-driving cars. It’s
unrealistic to think that a driverless car would never have an accident, but who is responsible and
liable under those circumstances? Should we still develop autonomous vehicles, or do we limit
this technology to semi-autonomous vehicles which help people drive safely? The jury is still out
on this, but these are the types of ethical debates that are occurring as new, innovative AI
technology develops.
AI impact on jobs
While a lot of public perception of artificial intelligence centers around job losses, this concern
should probably be reframed. With every disruptive, new technology, we see that the market
demand for specific job roles shifts. For example, when we look at the automotive industry,
many manufacturers, like GM, are shifting to focus on electric vehicle production to align with
green initiatives. The energy industry isn’t going away, but the source of energy is shifting from
a fuel economy to an electric one.
In a similar way, artificial intelligence will shift the demand for jobs to other areas. There will
need to be individuals to help manage AI systems. There will still need to be people to address
more complex problems within the industries that are most likely to be affected by job demand
shifts, such as customer service. The biggest challenge with artificial intelligence and its effect
on the job market will be helping people to transition to new roles that are in demand.
Privacy
Privacy tends to be discussed in the context of data privacy, data protection, and data security.
These concerns have allowed policymakers to make more strides in recent years. For example, in
2016, GDPR legislation was created to protect the personal data of people in the European Union
and European Economic Area, giving individuals more control of their data. In the United States,
individual states are developing policies, such as the California Consumer Privacy Act (CCPA),
which was introduced in 2018 and requires businesses to inform consumers about the collection
of their data. Legislation such as this has forced companies to rethink how they store and use
personally identifiable information (PII). As a result, investments in security have become an
increasing priority for businesses as they seek to eliminate any vulnerabilities and opportunities
for surveillance, hacking, and cyberattacks.
Instances of bias and discrimination across a number of machine learning systems have raised
many ethical questions regarding the use of artificial intelligence. How can we safeguard against
bias and discrimination when the training data itself may be generated by biased human
processes? Bias and discrimination aren’t limited to the human resources function either; they
can be found in a number of applications from facial recognition software to social media
algorithms.
As businesses become more aware of the risks with AI, they’ve also become more active in this
discussion around AI ethics and values. For example, IBM has sunset its general purpose facial
recognition and analysis products. IBM CEO Arvind Krishna wrote: “IBM firmly opposes and
will not condone uses of any technology, including facial recognition technology offered by
other vendors, for mass surveillance, racial profiling, violations of basic human rights and
freedoms, or any purpose which is not consistent with our values and Principles of Trust and
Transparency.”
Accountability
Since there isn’t significant legislation to regulate AI practices, there is no real enforcement
mechanism to ensure that ethical AI is practiced. The current incentives for companies to be
ethical are the negative repercussions of an unethical AI system on the bottom line. To fill the
gap, ethical frameworks have emerged as part of a collaboration between ethicists and
researchers to govern the construction and distribution of AI models within society. However, at
the moment, these only serve to guide. Some research shows that the combination of distributed
responsibility and a lack of foresight into potential consequences aren’t conducive to preventing
harm to society.
How to choose the right AI platform for machine learning
Selecting a platform can be a challenging process, as the wrong system can drive up costs, or
limit the use of other valuable tools or technologies. When reviewing multiple vendors to select
an AI platform, there is often a tendency to think that more features = a better system. Maybe so,
but reviewers should start by thinking through what the AI platform will be doing for their
organization. What machine learning capabilities need to be delivered and what features are
important to accomplish them? One missing feature might doom the usefulness of an entire
system. Here are some features to consider.
● a content generator that can generate text, images and other content based on the data it
was trained on?
● automated classification to read and classify written input, such as evaluating and sorting
customer complaints or reviewing customer feedback sentiment?
● a summary generator that can transform dense text into a high-quality summary, capture
key points from financial reports, and generate meeting transcriptions?
● a data extraction capability to sort through complex details and quickly pull the necessary
information from large documents?
Regression analysis
Further you could now investigate whether these three variables have an influence on a
person's salary. If so, you can predict a person's salary by using the highest education
level, the weekly working hours and the age of a person.
Thus, in the example above, salary is the dependent variable and highest educational
attainment, weekly hours worked, and age are the independent variables.
The regression analysis thus provides information about how the value of the dependent
variable changes if one of the independent variables is changed.
Linear Regression
What is a linear regression analysis?
Linear regression analysis is used to create a model that describes the relationship
between a dependent variable and one or more independent variables. Depending on
whether there are one or more independent variables, a distinction is made between
simple and multiple linear regression analysis.
In the case of a simple linear regression, the aim is to examine the influence of an
independent variable on one dependent variable. In the second case, a multiple linear
regression, the influence of several independent variables on one dependent variable is
analyzed. Example: Does the height have an influence on the weight of a person?
The goal of a simple linear regression is to predict the value of a dependent variable
based on an independent variable. The greater the linear relationship between the
independent variable and the dependent variable, the more accurate is the prediction. This
goes along with the fact that the greater the proportion of the dependent variable's
variance that can be explained by the independent variable is, the more accurate is the
prediction. Visually, the relationship between the variables can be shown in a scatter plot.
The greater the linear relationship between the dependent and independent variables, the
more the data points lie on a straight line.
The task of simple linear regression is to exactly determine the straight line which best
describes the linear relationship between the dependent and independent variable. In
linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.
ŷ is the respective estimate of the y-value. This means that for each x-value the
corresponding y-value is estimated. In our example, this means that the height of people
is used to estimate their weight.
Marketing example:
For a video streaming service you should predict how many times a month a person
streams videos. For this you get a record of user's data (age, income, gender, ...).
Medical example:
You want to find out which factors have an influence on the cholesterol level of patients.
For this purpose, you analyze a patient data set with cholesterol level, age, hours of sport
per week and so on.
The equation necessary for the calculation of a multiple regression is obtained with k
dependent variables as:
The coefficients can now be interpreted similarly to the linear regression equation. If all
independent variables are 0, the resulting value is a. If an independent variable changes
by one unit, the associated coefficient indicates by how much the dependent variable
changes. So if the independent variable x increases by one unit, the dependent variable y
increases by bi.
Logistic Regression
Logistic regression is a special case of regression analysis and is used when the
dependent variable is nominally scaled. This is the case, for example, with the variable
purchase decision with the two values buys a product and does not buy a product.
Logistical regression analysis is thus the counterpart of linear regression, in which the
dependent variable of the regression model must at least be interval-scaled.
With logistic regression, it is now possible to explain the dependent variable or estimate
the probability of occurrence of the categories of the variable.
Business example:
For an online retailer, you need to predict which product a particular customer is most
likely to buy. For this, you receive a data set with past visitors and their purchases from
the online retailer.
Medical example:
You want to investigate whether a person is susceptible to a certain disease or not. For
this purpose, you receive a data set with diseased and non-diseased persons as well as
other medical parameters.
Political example:
Would a person vote for party A if there were elections next weekend?
In medicine, for example, a frequent application is to find out which variables have an
influence on a disease. In this case, 0 could stand for not diseased and 1 for diseased.
Subsequently, the influence of age, gender and smoking status (smoker or not) on this
particular disease could be examined.
Logistic regression and probabilities
In linear regression, the independent variables (e.g., age and gender) are used to estimate
the specific value of the dependent variable (e.g., body weight).
To do this, it is necessary to restrict the value range for the prediction to the range
between 0 and 1. To ensure that only values between 0 and 1 are possible, the logistic
function f is used.
Logistic(Sigmoid) function
The logistic model is based on the logical function. The special thing about the logistic
function is that for values between minus and plus infinity, it always assumes only values
between 0 and 1.
So the logistic function is perfect to describe the probability P(y=1). If the logistic
function is now applied to the upper regression equation the result is:
This now ensures that no matter in which range the x values are located, only values
between 0 and 1 will come out. The new graph now looks like this:
The probability that for given values of the independent variable the dichotomous
dependent variable y is 0 or 1 is given by:
To calculate the probability of a person being sick or not using the logistic regression for
the example above, the model parameters b1, b2, b3 and a must first be determined. Once
these have been determined, the equation for the example above is:
Evaluation Metrics
Evaluation metric plays an important role in obtaining the best possible classifier in the
classification training. Performance metrics for model evaluation shed light on:
● Model efficiency
● Production readiness of the model
● Potential performance enhancement with more training data
● Overfitting or underfitting tendencies of the model
● Accurate
● Consistent
● Reliable on new data
Example:
A model that predicts “no fraud” 99% of the time in a dataset with 1% fraud will have
99% accuracy — but it’s useless for detecting fraud.
Confusion Matrix
The confusion matrix is a table that shows the performance of a classification model:
Imagine a model that predicts whether a person has COVID based on symptoms:
Accuracy
F1 Score
Cross-Validation
Rather than relying on a single train/test split, cross-validation helps us estimate how
well our model performs on unseen data.
K-Fold Cross-Validation:
Regression Metrics
Definition:
The average of the absolute differences between predicted and actual values. It measures
the average magnitude of errors in predictions without considering their direction.
Formula:
When to Use:
Definition:
The average of the squared differences between predicted and actual values. It penalizes
larger errors more than smaller ones.
Formula:
When to Use:
Definition:
Indicates how well the model explains the variability of the target variable. R² = 1 means
perfect prediction; R² = 0 means no explanatory power.
Formula:
When to Use:
● When you want to understand how well your model fits the data.
● Good for comparing different regression models on the same dataset.
Goal: Assign the correct class (like "spam" or "not spam", "cat" or "dog", "disease" or
"no disease").
Algorithm Notes
Logistic Regression Good for binary outcomes
(Yes/No)
Decision trees
● Root Node: The starting point of the tree, representing the entire dataset.
● Internal Nodes: Represent decision points or conditions based on features.
● Branches: Links between nodes, representing the possible outcomes of a decision.
● Leaf Nodes: Terminal nodes that represent the final predictions or classifications.
As an example, let’s imagine that you were trying to assess whether or not you should go surf,
you may use the following decision rules to make a choice:
Decision tree learning employs a divide and conquer strategy by conducting a greedy search to
identify the optimal split points within a tree. This process of splitting is then repeated in a
top-down, recursive manner until all, or the majority of records have been classified under
specific class labels.
Whether or not all data points are classified as homogenous sets is largely dependent on the
complexity of the decision tree. Smaller trees are more easily able to attain pure leaf nodes—i.e.
data points in a single class. However, as a tree grows in size, it becomes increasingly difficult to
maintain this purity, and it usually results in too little data falling within a given subtree. When
this occurs, it is known as data fragmentation, and it can often lead to overfitting.
As a result, decision trees have preference for small trees, which is consistent with the principle
of parsimony in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.”
Said differently, decision trees should add complexity only if necessary, as the simplest
explanation is often the best. To reduce complexity and prevent overfitting, pruning is usually
employed; this is a process, which removes branches that split on features with low importance.
The model’s fit can then be evaluated through the process of cross-validation.
Types of decision trees
Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology,
forms the foundation of many popular decision tree algorithms, such as the following:
- ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative
Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate
candidate splits.
- C4.5: This algorithm is considered a later iteration of ID3, which was also developed by
Quinlan. It can use information gain or gain ratios to evaluate split points within the decision
trees.
- CART: The term, CART, is an abbreviation for “classification and regression trees” and was
introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal
attribute to split on. Gini impurity measures how often a randomly chosen attribute is
misclassified. When evaluating using Gini impurity, a lower value is more ideal.
Key Concepts:
● Inductive Bias: A set of assumptions that the algorithm uses to make predictions.
● Information Gain: A measure of how much a feature helps reduce uncertainty in the
data.
● Gini Index/Gini Impurity: A measure of the impurity or disorder in a dataset, used to
determine the best split.
● Overfitting: When a tree becomes too complex and learns the training data too well,
leading to poor performance on new data.
● Pruning: A technique to remove unnecessary branches from the tree to prevent
overfitting.
Algorithm Steps:
● Feature Selection: Identifying the most relevant features for splitting the data.
● Splitting: Dividing the data based on the chosen feature, creating sub-trees.
● Recursion: Applying the same splitting process to each sub-tree until a stopping
condition is met (e.g., maximum depth, minimum number of samples per leaf).
where
Information gain represents the difference in entropy before and after a split on a given attribute.
The attribute with the highest information gain will produce the best split as it’s doing the best
job at classifying the training data according to its target classification. Information gain is
usually represented with the following formula,
For this dataset, the entropy is 0.94. This can be calculated by finding the proportion of days
where “Play Tennis” is “Yes”, which is 9/14, and the proportion of days where “Play Tennis” is
“No”, which is 5/14. Then, these values can be plugged into the entropy formula above.
We can then compute the information gain for each of the attributes individually. For example,
the information gain for the attribute, “Humidity” would be the following:
Gain (Tennis, Humidity) = (0.94)-(7/14)*(0.985) – (7/14)*(0.592) = 0.151
- 7/14 represents the proportion of values where humidity equals “high” to the total number of
humidity values. In this case, the number of values where humidity equals “high” is the same as
the number of values where humidity equals “normal”.
Then, repeat the calculation for information gain for each attribute in the table above, and select
the attribute with the highest information gain to be the first split point in the decision tree. In
this case, outlook produces the highest information gain. From there, the process is repeated for
each subtree.
Disadvantages
● Prone to overfitting: Complex decision trees tend to overfit and do not generalize well to
new data. This scenario can be avoided through the processes of pre-pruning or
post-pruning. Pre-pruning halts tree growth when there is insufficient data while
post-pruning removes subtrees with inadequate data after tree construction.
● High variance estimators: Small variations within data can produce a very different
decision tree. Bagging, or the averaging of estimates, can be a method of reducing
variance of decision trees. However, this approach is limited as it can lead to highly
correlated predictors.
● More costly: Given that decision trees take a greedy search approach during construction,
they can be more expensive to train compared to other algorithms.