Creative Programming
Spring 2025
CUL1122 Lecture #14
Statistical Problems:
Introduction to Machine Learning
Today
❖Machine Learning
▪ Traditional Programming vs. Machine Learning: A Comparison
▪ The Basic Paradigm of Machine Learning
❖Understanding Distance Measures
❖Feature Representation and Engineering
❖Exercise: Classifying Reptiles
3
Machine Learning
❖A computer program that ‘automatically learns’ something.
❖Early definition of machine learning:
▪ “A field of study that gives computers the
ability to learn without being explicitly
programmed.” – Arthur Samuel (1959)
❖Arthur Samuel, a computer pioneer,
wrote the first self-learning program,
which played checkers and learned from experience.
❖In 1956, his checker program was developed for play on the IBM 701
computer, and it was demonstrated to the public on television.
4
Traditional Programming vs. Machine Learning
❖In traditional programming, a programmer provides instructions to the
computer.
❖A program consists of a series of commands that tell the computer
what to do and in what order.
5
Traditional Programming vs. Machine Learning
❖Machine learning is an automated process enabling computers to solve
problems through data analysis, not preset programs.
❖In machine learning, we provide a sample set of input-output pairs,
which allows the system to learn a method for mapping inputs to correct
outputs, effectively creating a program.
6
Basic Paradigm of Machine Learning
❖Observe a set of samples known as training data.
❖Infer something about the process that generated the data.
❖Use this inference to make predictions on previously unseen test data.
7
Basic Paradigm of Machine Learning
❖Variations on the Paradigm:
▪ 1) Supervised Learning: Given a set of feature-label pairs, the goal is to find a
rule that predicts the label associated with a previously unseen input.
▪ 2) Unsupervised Learning: Given a set of feature vectors without labels, the
objective is to group them into “natural clusters.”
Supervised (w/ labels) Unsupervised (w/o labels)
8
Basic Paradigm
❖Examples of Two Variations in Machine Learning Techniques:
9
Supervised Learning
❖1) Classification: Predict a discrete value (label) associated with a
feature vector.
❖2) Regression: Predict a continuous value (real number) associated with
a feature vector.
10
How Should We Classify the Data?
❖We aim to determine the “similarity” of examples, with the goal of
predicting the label associated with a previously unseen input.
❖Similarity refers to the quality or
state of being similar, characterized
by likeness or resemblance, such as
a similarity of features.
❖While it is challenging to define,
similarity functions as a distance
measure in machine learning.
11
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the universe of possible objects.
The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1, O2)
gene1
gene2
0.23 3 342.7
12
Defining Distance Measures
❖Euclidian distance ❖Minkowski distance
13
Attribute-Based Labeling through Distance Measure
❖For example, in the following scenario, similarity is determined based
on ear shape and nose size.
14
Feature Representation
❖Features never fully describe a situation.
▪ For example, ear shape and nose size alone cannot fully describe dogs or cats.
❖Feature engineering involves representing examples using feature
vectors, which facilitates generalization.
❖For instance, suppose you want to use 100 existing samples to predict
which students will receive an A in this course.
▪ Some features are undoubtedly helpful, such as GPA and prior programming
experience (though not perfect predictors).
▪ However, others might lead to overfitting, such as birth month or eye color.
15
An Example Process of Feature Representation
❖Initial model with 5 features
Features Label
Name Egg-laying Scales Poisonous Cold- # legs Reptile
blooded
Cobra True True True True 0 Yes
Rattlesnake True True True True 0 Yes
Boa False True False True 0 Yes The Boa does not
constrictor
fit this model.
Chicken True True False False 2 No
Alligator True True False True 4 Yes
Dart frog True False True False 4 No
Salmon True True False True 0 No
Python True True False True 0 Yes
16
An Example Process of Feature Representation
❖Refined model with 3 features: scales, cold-blooded, 0 leg
Features Label
Name Egg-laying Scales Poisonous Cold- # legs Reptile
blooded
Cobra True True True True 0 Yes
Rattlesnake True True True True 0 Yes
Boa False True False True 0 Yes
constrictor
Chicken True True False False 2 No
Alligator True True False True 4 Yes The Alligator does
Dart frog True False True False 4 No not fit this model.
Salmon True True False True 0 No
Python True True False True 0 Yes
17
An Example Process of Feature Representation
❖Refined model with 3 features: scales, cold-blooded, 0 or 4 legs
Features Label
Name Egg-laying Scales Poisonous Cold- # legs Reptile
blooded
Cobra True True True True 0 Yes
Rattlesnake True True True True 0 Yes
Boa False True False True 0 Yes
constrictor
Chicken True True False False 2 No
Alligator True True False True 4 Yes
Dart frog True False True False 4 No
Salmon True True False True 0 No No (easy) way to
Python True True False True 0 Yes
classify salmon
and python.
18
An Example Process of Feature Representation
❖Current model: scales, cold-blooded; not perfect but no false negatives
Features Label
Name Egg-laying Scales Poisonous Cold- # legs Reptile
blooded
Cobra True True True True 0 Yes - Anything classified
as “not reptile” is
Rattlesnake True True True True 0 Yes
correctly labeled.
Boa False True False True 0 Yes
constrictor - Some animals may
be incorrectly
Chicken True True False False 2 No labeled as reptiles.
Alligator True True False True 4 Yes
Dart frog True False True False 4 No
Salmon True True False True 0 No
Python True True False True 0 Yes
19
Feature Engineering
❖We need to measure the distance between features.
❖This involves deciding which features to include and identifying those
that may add noise to the classifier.
❖Additionally, we must define how to measure distances between
training examples, which extends to classifiers and new instances.
❖Furthermore, we need to determine how to weigh the relative
importance of different dimensions of the feature vector, as this affects
the definition of distance.
20
Measuring Distance between Animals
❖We can consider our animal samples as consisting of four binary
features and one integer feature.
Name Egg-laying Scales Poisonous Cold- # legs Reptile
blooded
Rattlesnake True True True True 0 Yes
Boa False True False True 0 Yes
constrictor
Dart frog True False True False 4 No
Rattlesnake = [1,1,1,1,0]
Boa constrictor = [0,1,0,1,0]
Dart frog = [1,0,1,0,4]
21
Euclidean Distance between Animals
❖One way to distinguish between reptiles and non-reptiles is by
measuring the distance between pairs of samples and clustering nearby
samples into a common class for unlabeled data.
❖For example, using Euclidean distance, a rattlesnake and a boa
constrictor are much closer to each other than either is to a dart frog.
Rattlesnake = [1,1,1,1,0] Rattlesnake Boa constrictor Dart frog
Boa constrictor = [0,1,0,1,0] Rattlesnake - 1.414 4.243
Dart frog = [1,0,1,0,4]
Boa 1.414 - 4.472
constrictor
Dart frog 4.243 4.472 -
22
Add an Alligator
alligator = Animal(‘alligator’, [1,1,0,1,4])
animals.append(alligator)
compareAnimals(animals, 3)
23
Add an Alligator
❖The alligator is closer to the dart frog than to the snakes. Why?
▪ The alligator differs from the dart frog in three features, whereas it differs from
the boa constrictor in only two features.
▪ However, the scale for the “legs” feature ranges from 0 to 4, while the scales
for the other features range from 0 to 1.
▪ As a result, the “legs” dimension is disproportionately large.
Rattlesnake Boa constrictor Dart frog Alligator
Rattlesnake - 1.414 4.243 4.123
Boa constrictor 1.414 - 4.472 4.123
Dart frog 4.243 4.472 - 1.732
Alligator 4.243 4.123 1.732 -
24
Using Binary Features
❖Now, the alligator is closer to snakes than it is to the dart frog.
❖This highlights the importance of feature engineering!
Rattlesnake = [1,1,1,1,0] Rattlesnake = [1,1,1,1,0]
Boa constrictor = [0,1,0,1,0] Boa constrictor = [0,1,0,1,0]
Dart frog = [1,0,1,0,4] Dart frog = [1,0,1,0,1]
A11igator = [1,1,0,1,4] A11igator = [1,1,0,1,1]
Rattlesnake Boa constrictor Dart frog Alligator
Rattlesnake - 1.414 1.732 1.414
Boa constrictor 1.414 - 2.236 1.414
Dart frog 1.732 2.236 - 1.732
Alligator 1.414 1.414 1.732 -
25
Exercise: Classifying Reptiles
Exercise #1: Reptile Classification
❖Develop a reptile classification model.
▪ Create a script that categorizes animals as reptiles or non-reptiles based on five
features: egg-laying, scales, poisonous, cold-blooded, and number of legs.
▪ Use Euclidean distance to calculate the similarity between animals and display
the results in the table below.
27
Exercise #1: 1) Define the Animal Class
❖Define the Animal class, which should include a feature vector and a
method for measuring the distance between features.
class Animal(object):
def __init__(self, name, features):
# Assume name a string; features a list of numbers
self.name = name
self.features = numpy.array(features)
def distance(self, other):
# Return the Euclidean distance between feature vectors of self and other
return math.dist(self.getFeatures(), other.getFeatures())
…
28
Exercise #1: 2) Calculate Similarity between Animals
❖Define a function that computes the similarity between animals.
def compareAnimals(animals, precision):
…
# Get distances between pairs of animals
for a1 in animals: # For each row
row = []
for a2 in animals: # For each column
if a1 == a2:
row.append('--')
else:
distance = a1.distance(a2)
row.append(str(round(distance, precision)))
tableVals.append(row)
….
29
Exercise #1: 3) Add Samples
❖Provide animal samples to calculate their similarities.
rattlesnake = Animal('rattlesnake', [1, 1, 1, 1, 0])
boa = Animal(‘boa_constrictor', [0, 1, 0, 1, 0])
dartFrog = Animal('dart frog', [1, 0, 1, 0, 4])
animals = [rattlesnake, boa, dartFrog]
compareAnimals(animals, 3)
alligator = Animal('alligator', [1, 1, 0, 1, 4])
animals.append(alligator)
compareAnimals(animals, 3)
30
Exercise #1: 4) Improve Features
❖Change “Number of Legs” to a boolean indicating leg presence, and
recalculate the animal similarities.
rattlesnake = Animal('rattlesnake', [1, 1, 1, 1, 0])
boa = Animal(‘boa_constrictor', [0, 1, 0, 1, 0])
dartFrog = Animal('dart frog', [1, 0, 1, 0, 1])
alligator = Animal('alligator', [1, 1, 0, 1, 1])
animals = [rattlesnake, boa, dartFrog, alligator]
compareAnimals(animals, 3)
31
수고하셨습니다!
32