0% found this document useful (0 votes)
14 views112 pages

Exercises2lecture Tres Interessant

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views112 pages

Exercises2lecture Tres Interessant

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Exercises to ML DHBW Stuttgart – WS2020

Homework / Exercises to Lecture “ML-


Concepts & Algorithms”
by

Dr. Hermann Völlinger and Other

Status: 22 December 2022

Goal: Documentation of all Solutions to the Homework/Exercises in the Lecture “ML


Concepts & Algorithms”.

Contents
Numbers of Exercises per Chapter .................................................................................................... 3
Links to Further Literature: .................................................................................................................. 3
Exercises to Lesson ML0: General Remarks and Goals of Lecture (ML) ................................... 4
Homework H0.1- “Three Categories of Machine Learning” ................................................ 4
Exercises to Lesson ML1: Introduction to Machine Learning (ML) ............................................... 6
Homework H1.1 - “Most Popular ML Technologies + Products” ....................................... 6
Homework H1.2 - “Ethics in Artificial Intelligence” ............................................................. 17
Homework H1.3 (optional)- “Create Painting with DeepArt” ............................................ 20
Homework H1.4 (optional) - Summary of video “What is ML?” ....................................... 20
Homework H1.5 (optional)– Summary of video “Supervised- & Unsupervised-
Learning” ........................................................................................................................................ 20
Exercises to Lesson ML2: Concept Learning: Version Spaces & Candidate Elimination ....... 33
Homework H2.1– “Version Space for “EnjoySport ............................................................. 33
Homework H2.2– “Version Space – Second example*********” ........................................ 33
Exercises to Lesson ML3: Supervised and Unsupervised Learning .......................................... 34
Homework H3.1 - “Calculate Value Difference Metric”....................................................... 34
Homework H3.2 – “Bayes Learning for Text Classification” ............................................ 35
Homework H3.3 (advanced)* – “Create in IBM Cloud two services Voice Agent and
Watson Assistant Search Skill with IBM Watson Services” ............................................. 41
Homework H3.4* – “Create a K-Means Clustering in Python” ......................................... 47
Homework H3.5 – “Repeat + Calculate Measures for Association” ............................... 55
Exercises to Lesson ML4: Decision Tree Learning ....................................................................... 60
Homework H4.1 - “Calculate ID3 and CART Measures”..................................................... 60

Page |1 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Homework H4.2 - “Define the Decision Tree for UseCase “Predictive Maintenance”
(slide p.77) by calculating the GINI Indexes” ........................................................................ 75
Homework H4.3* - “Create and describe the algorithm to automate the calculation of
the Decision Tree for UseCase “Predictive Maintenance” ................................................ 80
Homework H4.4* - “Summary of the Article … prozessintegriertes
Qualitätsregelungssystem…” ................................................................................................... 84
Homework H4.5* - “Create and describe the algorithm to automate the calculation of
the Decision Tree for the Use Case “Playing Tennis” using ID3 method” .................. 87
Exercises to Lesson ML5: simple Linear Regression (sLR) & multiple Linear Regression
(mLR) .................................................................................................................................................... 91
Homework H5.1 - “sLR manual calculations of R² & Jupyter Notebook (Python)” .... 91
Homework H5.2*- “Create a Python Pgm. for sLR with Iowa Houses Data” ................ 97
Homework H5.3 – “Calculate Adj.R² for MR” ........................................................................ 98
Homework H5.4 - “mLR (k=2) manual calculations of Adj.R² & Jupyter Notebook
(Python) to check results” ......................................................................................................... 99
Homework H5.5* - Decide (SST=SSE+SSR) => optimal sLR- line? ............................... 105
Exercises to Lesson ML6: Convolutional Neural Networks (CNN) ........................................... 106
Homework H6.1 – “Power Forecasts with CNN in UC2” .................................................. 106
Homework H6.2 – “Evaluate AI Technology of UC3” ........................................................ 106
Homework H6.3* – “Create Summary to GO Article”........................................................ 106
Homework H6.4* – “Create Summary to BERT Article” ................................................... 106
Exercises to Lesson ML7: BackPropagation for Neural Networks ........................................... 110
Homework H7.1 – “Exercise of an Example with Python” .............................................. 110
Homework H7.2 – “Exercise of an Example with Python” .............................................. 110
Exercises to Lesson ML8: Support Vector Machines (SVM) ..................................................... 111
Homework H8.1 – “Exercise of an Example with Python” .............................................. 111
Homework H8.2 – “Exercise of an Example with Python” .............................................. 111
Homework H8.3 – “Exercise of an Example with Python” .............................................. 111
Homework H8.4 – “Exercise of an Example with Python” .............................................. 111

Page |2 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Numbers of Exercises per Chapter


When we count the numbers of the exercises for this document for each chapter of
the lesson, we get the following result:

Links to Further Literature:


1. [HVö-3]: Hermann Völlinger: MindMap of the Lecture "Machine Learning:
Concepts & Algorithms” “; DHBW Stuttgart; WS2020
2. [HVö-5]: Hermann Völlinger: Script of the Lecture "Machine Learning:
Concepts & Algorithms“; DHBW Stuttgart; WS2020
3. [HVö-6]: Hermann Völlinger: GitHub to the Lecture "Machine Learning:
Concepts & Algorithms“; see in: https://github.com/HVoellinger/Lecture-
Notes-to-ML-WS2020

Page |3 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML0: General Remarks and Goals of


Lecture (ML)
Homework H0.1- “Three Categories of Machine Learning”

Groupwork (2 Persons). Compare the differences of the three categories, see slide
“goal of lecture (2/2)”:

1. Supervised- (SVL)

2. Unsupervised- (USL)

3. Reinforcement-Learning (RIF)

See the information in internet, for example the following link:


https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f

Give of short descriptions of the categories and explain the differences (~5 minutes for each
category).
First Solution:

Page |4 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Second Solution: R. Scholz, N. Breuninger; WS2020

Page |5 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML1: Introduction to Machine Learning


(ML)
Homework H1.1 - “Most Popular ML Technologies + Products”

Groupwork (3 Persons). Look on the three most used ML technologies/products (see


information in internet):
1. IBM Watson Machine Learning - https://www.ibm.com/cloud/machine-learning
2. Microsoft Azure ML Studio - https://azure.microsoft.com/en-
us/services/machine-learning-studio/
3. Google Cloud Machine Learning Plattform - https://cloud.google.com/ml-
engine/docs/tensorflow/technical-overview

Give of short overview about the products and its features (~10 minutes for each)
und give a comparison matrix of the 3 products and an evaluation. What is your
favorite product (~ 5 minutes).
First Solution:

Page |6 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Page |7 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Page |8 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

Second Solution:

Page |9 Date: 22 December 2022


Exercises to ML DHBW Stuttgart – WS2020

P a g e | 10 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 11 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Third Solution: R. Mader, N. Bross, S Yurttadur; WS2020:

P a g e | 12 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 13 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 14 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 15 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 16 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H1.2 - “Ethics in Artificial Intelligence”


Groupwork (2 Persons) - evaluate the interview with Carsten Kraus (Founder
Omikron/Pforzheim, Germany): „Deep Neural Networks könnten eigene
Moralvorstellungen entwickeln“.
https://ecommerce-news-magazin.de/e-commerce-news/e-commerce-
interviews/interview-mit-carsten-kraus-deep-neural-networks-koennten-eigene-
moralvorstellungen-entwickeln/
The victory of Google-developed DeepMind-Software AlphaGo against South Korean
Go-world champion Lee Sedol does not simply ring in the next round of industrial
revolution. According to IT expert Carsten Kraus, the time of superiority of Deep
Neural Networks (DNN) with respect to human intelligence has now began.

Solution: B. Storz, L. Mack; WS2020:

P a g e | 17 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 18 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 19 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H1.3 (optional)- “Create Painting with DeepArt”


1 Person – Create your own painting by using DeepArt company in Tübingen
( https://deepart.io/ ). What ML method did you use to create “paintings”?
Solutions:

Homework H1.4 (optional) - Summary of video “What is ML?”


1 Person - summaries the results of the first YouTupe Video “What is Machine
Learning” by Andrew Ng in a Report of 10 Minutes. Create a small PowerPoint
presentation. See: https://www.youtube.com/playlist?list=PLLssT5z_DsK-
h9vYZkQkYNWcItqhlRJLN

Solutions:

Homework H1.5 (optional)– Summary of video “Supervised- &


Unsupervised-Learning”

Groupwork (2 Persons) - summaries the results of the second and third YouTupe
Video “Supervised Learning” and “Unsupervised Learning” by Andrew Ng in a Report
of 15 Minutes. Create a small PowerPoint presentation. See:
https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

P a g e | 20 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Solutions:

P a g e | 21 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 22 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 23 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 24 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 25 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 26 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 27 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 28 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Second Solution:

P a g e | 29 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 30 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 31 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 32 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML2: Concept Learning: Version Spaces &


Candidate Elimination
Homework H2.1– “Version Space for “EnjoySport
Create the Version Space for the EnjoySport concept learning problem with training
examples in the following table; see [TMitch], Ch.2 or
https://www.youtube.com/watch?v=cW03t3aZkmE

Solutions:

Homework H2.2– “Version Space – Second example*********”

*********** placeholder********************

Solutions:
….

P a g e | 33 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML3: Supervised and Unsupervised


Learning

Homework H3.1 - “Calculate Value Difference Metric”


Calculate d:= Value Difference Metric (VDM) for the fields “Refund” and “Marital
Status”. Remember the following formula and see also details of VDM in internet (1
person, 10 minutes):

Solutions:

P a g e | 34 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H3.2 – “Bayes Learning for Text Classification”

1 Person: Review the example about Bayes Learning in this lesson. Use the same
training data as in the lesson together with the new lagged text. Run the Bayes -Text
Classification calculation for the sentence “Hermann plays a TT match” and tag this
sentence.

Additional Question: What will happen if we change the target to “Hermann plays a
very clean game”
Optional*(1 P.): Define an algorithm in Python (use Jupyter Notebook) to automate
the calculations. Use description under: https://medium.com/analytics-vidhya/naive-bayes-classifier-
for-text-classification-
556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data.

Solution: by A. Gholami, J. Schwarz; ML-Lecture WS2020

P a g e | 35 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 36 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Solution to Optional: by A. Gholami, J. Schwarz; ML-Lecture WS2020

1 Naive Bayes Text Classification


We made a simple Algorithm to try and classify sentences into either Sports or Not Sports sentences.
We start with a couple sentences either classed “Sports” or “Not Sports” and try to classify new
sentences based on that. At the end we make a comparison, which class (“Sports” or “Not Sports”)
the new sentence is more likely to end up in.
1.1 What happens here:
1. import everything we need
2. Provide training data and do transformations.
3. Create dictionaries and count the words in each class.
4. Calculate probabilities of the words.
To evaluate a new sentence…
5. Vectorize and transform all sentences
6. Count all words
7. Transform new sentence
8. Perform Laplace Smoothing, so we don’t multiply with 0
9. Calculate probability of the new sentence for each class
10. Output what’s more likely

[1]: # This notebook was created by Alireza Gholami and Jannik Schwarz
# Importing everything we need

P a g e | 37 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
# Import library time to check execution with date + time information
import time
#check versions of libraries
print('pandas version is: {}'.format(pd.__version__))
import sklearn
print('sklearn version is: {}'.format(sklearn.__version__))

[2]: # Naming the columns


columns = ['sentence', 'class']
# Our training data
rows = [['A great game', 'Sports'],
['The election was over', 'Not Sports'],
['Very clean match', 'Sports'],
['A clean but forgettable game', 'Sports'],
['It was a close election', 'Not Sports'],
['A very close game', 'Sports']]

# the data inside a dataframe


training_data = pd.DataFrame(rows, columns=columns)
print('f’The training data:\n{training_data}\n')

[3]: # Turns the data into vectors


def vectorisation(my_class):
# my_docs contains the sentences for a class (sports or not sports)
my_docs = [row['sentence'] for index, row in training_data.iterrows() if row['class'] ==
my_class]
# creates a vector that counts the occurrence of words in a sentence
my_vector = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
# Token-Pattern damit einstellige Wörter wie 'a' gelesen werden
# transform the sentences
my_x = my_vector.fit_transform(my_docs)
# tdm = term_document_matrix_sport | create the matrix with the vectors for a class
tdm = pd.DataFrame(my_x.toarray(), columns=my_vector.get_feature_names())
return tdm, my_vector, my_x

[4]: # Here we are actually creating the matrix for sport and not sport sentences
tdm_sport, vector_sport, X_sport = vectorisation('Sports')
tdm_not_sport, vector_not_sport, X_not_sport = vectorisation('Not Sports')
print (f'Sport sentence matrix: \n{tdm_sport}\n')

P a g e | 38 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

print (f'Not sport sentence matrix: \n{tdm_not_sport}\n')


print (f'Amount of sport sentences: {len(tdm_sport)}')
print (f'Amount of not sport senteces: {len(tdm_not_sport)}')
print (f'Total amount of sentences: {len(rows)}')

[5]: # creates a dictionary for each class


def make_list(my_vector, my_x):
my_word_list = my_vector.get_feature_names()
my_count_list = my_x.toarray().sum(axis=0)
my_freq = dict(zip(my_word_list, my_count_list))
return my_word_list, my_count_list, my_freq

[6]: # create lists


# word_list_sport = word list ['a', 'but', 'clean', 'forgettable', 'game', 'great', 'match', 'very']
# count_list_sport = occurence of words [2 1 2 1 2 1 1 1]
# freq_sport = combining the two to create a dictionary
word_list_sport, count_list_sport, freq_sport = make_list(vector_sport, X_sport)
word_list_not_sport, count_list_not_sport, freq_not_sport = make_list(vector_not_sport,
X_not_sport)
print(f'sport dictionary: \n{freq_sport}\n')
print(f'not sport dictionary: \n{freq_not_sport}\n')

[7]: # calculate the probability of a word in a sentence of a class


def calculate_prob(my_word_list, my_count_list): my_prob = []
for my_word, my_count in zip(my_word_list, my_count_list):
my_prob.append(my_count / len(my_word_list))
prob_dict = dict(zip(my_word_list, my_prob))
return prob_dict

[8]: # probabilities of the words in a class


prob_sport_dict = calculate_prob(word_list_sport, count_list_sport)
prob_not_sport_dict = calculate_prob(word_list_not_sport, count_list_not_sport)
print(f'probabilites of words in sport sentences: \n{prob_sport_dict}\n')
print(f'probabilites of words in not sport sentences: \n{prob_not_sport_dict}')

[9]: # all sentences again


docs = [row['sentence'] for index, row in training_data.iterrows()]
# vectorizer
vector = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

P a g e | 39 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

# transform the sentences


X = vector.fit_transform(docs)
# counting the words
total_features = len(vector.get_feature_names())
total_counts_features_sport = count_list_sport.sum(axis=0)
total_counts_features_not_sport = count_list_not_sport.sum(axis=0)
print(f'Amount of distinct words: {total_features}')
print(f'Amount of distinct words in sport sentences: {total_counts_features_sport}')
print(f'Amount of distinct words in not sport sentences:
{total_counts_features_not_sport}')

[10]: # a new sentence


new_sentence = 'Hermann plays a TT match'
# gets tokenized
new_word_list = word_tokenize(new_sentence)

[11]: # We're using Laplace smoothing, # if a new word occurs the probability would be 0
# So every word counter gets incremented by one
def laplace(freq, total_count, total_feat): prob_sport_or_not = []
for my_word in new_word_list:
if my_word in freq.keys():
counter = freq[my_word]
else: counter = 0
# total_count is the amount of words in sport sentences and total_feat the total amount of words
prob_sport_or_not.append((counter + 1) / (total_count + total_feat))
return prob_sport_or_not

[12]: # probability for the new words


prob_new_sport = laplace(freq_sport, total_counts_features_sport, total_features)
prob_new_not_sport = laplace(freq_not_sport, total_counts_features_not_sport,
total_features)
print(f'probability that the word is in a sport sentence: {prob_new_sport}')
print(f'probability that the word is in a not sport sentence: {prob_new_not_sport}')

[13]: # multiplying the probabilities of each word


new_sport = list(prob_new_sport)
sport_multiply_result = 1
for i in range(0, len(new_sport)): sport_multiply_result *= new_sport[i]

# multiplying the result with the ratio of sports sentences to the total amount of sentences (here: 4/6)
P a g e | 40 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

sport_multiply_result *= ( len(tdm_sport) / len(rows) )

# multiplying the probabilities of each word


new_not_sport = list(prob_new_not_sport)
not_sport_multiply_result = 1
for i in range(0, len(new_not_sport)): not_sport_multiply_result *= new_not_sport[i]
# multiplying the result with the ratio of sports sentences to the total amount of sentences (here: 2/6)
not_sport_multiply_result *= ( len(tdm_not_sport) / len(rows) )

[14]: # comparing what’s more likely


print(f'The probability of the sentence "{new_sentence}":\nSport vs not sport\n
{sport_multiply_result} vs {not_sport_multiply_result}\n\n')
if not_sport_multiply_result < sport_multiply_result: print('Verdict: It\'s probably a sports
sentence!')
else: print('Verdict: It\'s probably not a sport sentence!')

[15]: # print current date and time


print("Date & Time:",time.strftime("%d.%m.%Y %H:%M:%S"))
print ("*** End of Homework-H3.2_Bayes-Learning... ***")

Homework H3.3 (advanced)* – “Create in IBM Cloud two services Voice


Agent and Watson Assistant Search Skill with IBM Watson Services”

Homework for 2 Persons: Log in into IBM Cloud and follow the tutorial descriptions
(see links):
1. “Voice Agent” (1 person)
a. Set up the requires IBM Cloud Services
b. Configure the TWILIO Account
c. Configure the Voice Agent on the IBM Cloud and Import Skill by uploading either
• skill-banking-balance-enquiry.json or
• skill-pizza-order-book-table.json

See tutorial: https://github.com/FelixAugenstein/digital-tech-tutorial-voice-agent

2. “Assistant Search Skill” (1 person)


a. Configuring Watson Assistant & Discovery Service on the IBM Cloud
b. Configuring Watson Assistant & Search Skill on the IBM Cloud
c. Deploy the Assistant with Search Skill

See tutorial:
https://github.com/FelixAugenstein/digital-tech-tutorial-watson- assistant-search-skill

Remark: You can integrate the two skills, such that when the dialog skill has no
answer you show the search results. The reading of texts from the search results of
P a g e | 41 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

the search skill is unfortunately not (yet) possible. Watson can only display the
search result with title/description etc. as on Google. The tutorial in the cloud docs on
the same topic is also helpful: https://cloud.ibm.com/docs/assistant?topic=assistant-
skill-search-add

Solutions:
Ad1: by Hermann Völlinger; 12.3.2020

For creating a “voice agent” I activate the 4 services “Speech2Text”, “Text2Speech”,


“Voice Agent” and Watson Assistant” on IBM Watson. See the following screenshot:

Next to have to do the Configuring of a Twilio Account, including the steps:


1. Register for Twilio and Start a free Trial.
2. Confirm your email.
3. Verify your phone number. Therefore, use the phone number you will use to call
the Watson Voice Agent later on.

You link the phone-number with your solution “Watson-Voice Agent Tutorial”, see:

P a g e | 42 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Finally, you can see the final configuration by opening the service app “Watson-Voice
Agent Tutorial”. See the following screenshot:

By opening the Watson Assistant, we see all available solutions, i.e. dialog- and
search skills. Under “my second assistant” we see the two dialog skills “hermann
skill” and “voice”:

After opening “voice” we see all intents (number=12). Some are imported by the json-
file. Other are created by myself, like #machine, #FirstExample or #SecondExample:

P a g e | 43 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

You can define questions (see #machine) and also answers of the voice assistant
(“chatbot”):

So, one gets the final flow chart of the dialog skill for the Voice-Agent Voice. See her
the response of the question “What is Machine Learning?”:

P a g e | 44 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Similar you see her the logic of the question “What is my Balance?”:

Ad2: By Niklas Gysinn & Maximilian Wegmann, DHBW Stg. SS2020 (4.3.2020)
Creating a Watson Search (Discovery) Skill using the IBM Cloud
Source used: https://github.com/FelixAugenstein/digital-tech-tutorial-watson-
assistant-search-skill

P a g e | 45 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

First of all, we created two services. One service for crawling and indexing the
website information and one for providing the assistant functionality.

The discovery service uses various news sites (e.g. German “Tagesschau”) to
retrieve the latest articles and make them available to the assistant.

P a g e | 46 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

This information can then be accessed via a "chat" provided by the IBM Watson
Assistant service.
Homework H3.4* – “Create a K-Means Clustering in Python”

Homework for 2 Persons: Create a python algorithm (in Jupyter Notebook) which
clusters the following points:

Following the description of: https://benalexkeen.com/k-means-clustering-in-python/


to come to 3 clear clusters with 3 means at the center of these clusters: We’ll do this
manually first (1 person), then show how it’s done using scikit-learn (1 person)

P a g e | 47 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Solutions: by L. Krauter und M. Limbacher; ML Lecture - WS2020

1 Create a K-Means Clustering Algorithm in Python


By: Markus Limbacher & Lucas Krauter; 20. October 2020
This solves Homework H3.4 from Lecture: “Machine Learning - Concepts & Algorithms”, DHBW
Stuttgart, WS2020
Following the implementation of Ben Keen (2017) from: “https://benalexkeen.com/k-
meansclustering-
in-python/”
1.1 Content
This notebook is split into three parts: 1. Section 1.2 2. Section 1.3: program each step manually
3. Section 1.4: use the scikit library to use the algorithm
1.1.1 Summary K-Means Algorithm:
1. Select Random Starting Points (one for each cluster) = centroids
2. Assign each Datapoint to its closest centroid
3. Use new mean of each cluster as its new centroid
4. Repeat Step 2,3 until mo more modifications to centroids are made

1.2 Preparations
1.2.1 Import of libraries
The first step is to import the necessary library packages.

[1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
P a g e | 48 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

%matplotlib inline
import copy
import sklearn as sk
from sklearn.cluster import KMeans
# to check the time of execution, import function time
import time
# check versions of libraries
print('pandas version is: {}'.format(pd.__version__))
print('numpy version is: {}'.format(np.__version__))
print('sklearn version is: {}'.format(sk.__version__))

1.2.2 Dataset
The second step is defining data to work with. The data frame contains two arrays of x and y
coordinates. These build several points in a two-dimensional space.

[2]: # Definition of Dataset (see Homework H3.4)


df = pd.DataFrame({'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69,
72], 'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24] })
# Check that the definition of dataset is OK
print ("**** data frame ****")
print ("First column = No.")
print (df)
*** data frame ***
First column = No.
x y
0 12 39
1 20 36
2 28 30
3 18 52
4 29 54
5 33 46
6 24 55
7 45 59
8 45 63
9 52 70
10 51 66
11 52 63
12 55 58
13 53 23
14 55 14
15 61 8
P a g e | 49 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

16 64 19
17 69 7
18 72 24

1.3 K-Means manually


Start with selecting the count of clusters k. Select one random Starting Point i for each cluster.
These center points are called centroids.

[3]: # Number of clusters ==> k


k=3
np.random.seed(42)
# centroids[i] = [x, y]
centroids = {
i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
for i in range(k)
}

1.3.1 Display dataset


Print the centroids and the values of the data frame in a two-dimensional coordinate system.

[4]: fig = plt.figure(figsize=(5, 5))


plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

1.3.2 Assignment Stage


Assign each Datapoint to its closest centroid. Since the step will be repeated, we will program a
function. The distance is calculated as the difference between the two points [x1,y1] and [x2,y2] by
the following formula: d=√(x1 −x2)2 −(y1 −y2)2

[5]: # Function to determine closest centroid for the dataset df


def assignment(df, centroids):
# Iterating over every centroid in centroids
for i in centroids.keys():
# calculate distance function: sqrt((x1 - x2)^2 - (y1 - y2)^2)
df['distance_from_{}'.format(i)] = (
np.sqrt( (df['x'] - centroids[i][0]) ** 2 + (df['y'] - centroids[i][1]) ** 2) )

P a g e | 50 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

# select and save closest centroid for each datapoint


centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
# select the color of the cluster depending on the centroid
df['color'] = df['closest'].map(lambda x: colmap[x])
# return data frame with additional information
return df
# call assignment function
df = assignment(df, centroids)
print(df)
x y distance_from_1 distance_from_2 distance_from_3 closest color
0 12 39 46.324939 62.625873 35.902646 3 b
1 20 36 38.013156 56.364883 38.000000 3 b
2 28 30 28.017851 52.430907 44.721360 1 r
3 18 52 50.328918 53.600373 22.090722 3 b
4 29 54 45.650849 42.426407 21.931712 3 b
5 33 46 36.715120 40.496913 30.870698 3 b
6 24 55 49.091751 47.265209 19.416488 3 b
7 45 59 45.398238 26.019224 29.154759 2 g
8 45 63 49.365980 26.172505 27.313001 2 g
9 52 70 56.008928 21.470911 32.249031 2 g
10 51 66 52.000000 20.880613 32.015621 2 g
11 52 63 49.010203 19.235384 33.837849 2 g
12 55 58 44.181444 16.124515 38.483763 2 g
13 53 23 9.219544 41.146081 60.745370 1 r
14 55 14 4.000000 48.703183 69.462220 1 r
15 61 8 11.661904 52.952809 77.698134 1 r
16 64 19 13.928388 41.593269 70.434367 1 r
17 69 7 19.313208 53.037722 83.006024 1 r
18 72 24 23.259407 36.013886 72.138755 1 r

1.3.3 Display modified dataset with color assigned to closest centroid.


Create a function to display the new data frame with the additional information. Draw each cluster
in a different color.

[6]: # Function to display the data frame


def displayDataset(df, centroids):
fig = plt.figure(figsize=(5, 5))
# display data frame

P a g e | 51 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')


# display each centroid
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()
# invoke display function
displayDataset(df, centroids)

1.3.4 Update Stage


Update the position of the centroids of the cluster. For the purpose of tracking the difference
between the positions the old positions will be saved in old_centroids. The update function
calculates a new mean of each cluster for its new centroid.

[7]: # Copies current centroids for demonstration purposes


old_centroids = copy.deepcopy(centroids)
# Calculate mean from each seperate cluster as new centroid positions
def update(k):
# for each centroid
for i in centroids.keys():
# calculate and save new mean
centroids[i][0] = np.mean(df[df['closest'] == i]['x'])
centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
return k
# start update
centroids = update(centroids)

1.3.5 Display updated centroids


Display the new positions of the centroids. The change of positions is indicated with arrows.

[8]: fig = plt.figure(figsize=(5, 5))


ax = plt.axes()
# draw datapoints
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
# draw centroids
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
# add arrows
for i in old_centroids.keys():

P a g e | 52 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

old_x = old_centroids[i][0]
old_y = old_centroids[i][1]
dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i],ec=colmap[i])
plt.show()

1.3.6 Repeat Assignment


Repeat the assignment stage with the new centroid positions.
[9]: # assign closest centroid to each point in the dataframe
df = assignment(df, centroids)
# Plot results
displayDataset(df, centroids)

1.3.7 Repeat Assignment and Update Steps


Repeat the previous steps until there is no more modification in the assignment of the closest
centroids.

[10]: # Create endless loop


while True:
# copy old centroid points
closest_centroids = df['closest'].copy(deep=True)
# calculate new means of each cluster
centroids = update(centroids)
# assign each datapoint to nearest centroid
df = assignment(df, centroids)
# if the old centroids equals the new ones => no modification made => exit loop
if closest_centroids.equals(df['closest']):
break

# display result
displayDataset(df, centroids)

1.4 K-Means using scikit-learn


Use the scikit k-Means implementation to build the cluster of the data frame.
### Preparations
Create the same data frame as above so that it is fresh.

[11]: # Dataset
df = pd.DataFrame({
'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24] })

1.4.1 K-Means training


P a g e | 53 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Invoke the imported k-Means constructor with the number of clusters (here 3). Then train the
model with the dataset.

[12]: # invoke constructor


kmeans = KMeans(n_clusters=3)
# Fitting K-Means model
print(kmeans.fit(df))

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,


n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)

1.4.2 K-Means prediction


Use the model to calculate a prediction for the same data frame. Each datapoint will be labeled
for the chosen cluster.

[13]: # create label for each datapoint in data frame


labels = kmeans.predict(df)
# save centroids of each cluster
centroids = kmeans.cluster_centers_

1.4.3 Display the result


Display the positions of the centroids and the data frame. The color depends of the assigned label
for each datapoint.
[14]: # Display result
fig = plt.figure(figsize=(5, 5))
# set color for each datapoint
colmap = {1: 'b', 2: 'g', 3: 'r'}
colors = list( map(lambda x: colmap[x+1], labels))
# draw each datapoint
plt.scatter(df['x'], df['y'],color=colors, alpha=0.5, edgecolor='k')
# draw each centroid
for idx, centroid in enumerate(centroids):
plt.scatter(*centroid, color=colmap[idx+1])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

[15]: # print current date and time


print("date & time:",time.strftime("%d.%m.%Y %H:%M:%S"))
print ("*** End of Homework-H3.4_k-Means_Clustering ***")
date & time: 19.10.2020 17:44:45
P a g e | 54 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

*** End of Homework-H3.4_k-Means_Clustering ***

Homework H3.5 – “Repeat + Calculate Measures for Association”

1. Remember and give explanations of the Measures for Association: support,


confidence and lift (1 Person, 10 min):
2. Calculate measures for the following 8 item sets of a shopping basket (1
person, 10 min):
{ Milch, Limonade, Bier }; { Milch, Apfelsaft, Bier }; { Milch, Apfelsaft,
Orangensaft };{ Milch, Bier, Orangensaft, Apfelsaft };{ Milch, Bier };{ Limonade,
Bier, Orangensaft }; { Orangensaft };{ Bier, Apfelsaft }
a. What is the support of the item set { Bier, Orangensaft }?
b. What is the confidence of { Bier } ➔ { Milch } ?
c. Which association rules have support and confidence of at least 50%?

First Solution: Dr. Hermann Völlinger DHBW Stuttgart, SS2019

To 2a.:
We have 8 market baskets -→Support(Bier=>Orangensaft)=frq(Bier,Orangensaft)/8
We see two baskets which have Bier and Orangensaft together
--→Support = 2/8=1/4 = 25%
To 2b.:
We see that frq(Bier)=6 und frq(Bier,Milch)=4 -→Conf(Bier=>Milch)=4/6=2/3= 66,7%
To 2c.:
To have a support>=50% we need items/products which occur in more than 4 baskets.
We see for example Milch is in 5 baskets (we write: #Milch=5), #Bier=6, #Apfelsaft=4,
#Orangensaft=4 and #Limonade=2.
Only the 2-pair #(Milch, Bier)=4 has minimum of 4 occurrences. We see this by
calculating the Frequency-Matric(frq(X=>Y)) for all tuples (X,Y):

It is easy to see, that there are no 3-pairs with a minimum of 4 occurrences: only
Sup(Bier,Milch) is >=50%. But for all X: Sup{Bier,Milch},X)<50% .
We see from the above matric, that: Supp(Milch=>Bier)=Supp(Bier=>Milch)4/8=1/2=50%
We now calculate: Conf(Milch=>Bier)=4/#Milch=4/5=80%
From Question 2, we know that Conf(Bier=>Milch)=66,7%

Solution: Only the two association rules (Bier=>Milch) and (Milch=>Bier) have support
and confidence >=50%.
P a g e | 55 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Second Solution: Anna-Lena Volkhardt, DHBW Stuttgart, SS2020 (4.3.2020)

P a g e | 56 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Third Solution: R. Beer & A. Joukhadar, DHBW Stuttgart, WS2020 (20.10.2020)

P a g e | 57 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 58 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 59 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML4: Decision Tree Learning


Homework H4.1 - “Calculate ID3 and CART Measures”
Groupwork (2 Persons). Calculate the measures of the decision tree “Playing Tennis
Game”:
1. ID3 (Iterative Dichotomiser 3) method using Entropy Fct. & Information Gain.
2. CART (Classification) → using Gini Index (Classification) as metric.

First Solution with ID3 (Hermann Völlinger, Feb. 2020): Missing calculations on ID3
method (see page number of the corresponding lecture slides on the right top):

P a g e | 60 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 61 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 62 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 63 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 64 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 65 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Second Solution with ID3 (Lars Gerne & Nils Hauschel, 03/31/20):

P a g e | 66 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 67 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 68 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 69 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 70 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 71 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 72 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

First Solution with CART: Missing calculations on CART method using GINI Index
as a metric (see page number of the corresponding lecture slides on the right top):
see Notes Page in the lecture presentation.

Second Solution with CART (from [email protected], SS2020):

P a g e | 73 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 74 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H4.2 - “Define the Decision Tree for UseCase “Predictive


Maintenance” (slide p.77) by calculating the GINI Indexes”
Groupwork (3 Persons): Calculate the Decision Tree for UseCase “Predictive
Maintenance” on slide p.77. Do the following steps (one person per step):
1. Calculate the Frequency Matrices for the features „Temp.“, „Druck“ and
„Füllst.“
2. Define the Root-node by calculating the GINI-Index for all values of the three
features. Define the optimal split-value for the root-node (see slide p.67)
3. Finalize the decision tree by calculation the GINI-Index for the remaining
values for the features “Temp.” and “Füllst.”

Optional*: Create and describe the algorithm to automate the calculation of steps
1. to 3.

P a g e | 75 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

First Solution (H.Völlinger):

Ad 1:

We calculate first the matrix for Druck by looking on the Data Table:

Nr. Anl Typ Temp. Druck Füllst. Fehler


1001 123 TN 244 140 4600 NO
1002 123 TO 200 130 4300 NO
1009 128 TSW 245 108 4100 YES
1028 128 TS 250 112 4100 NO
1043 128 TSW 200 107 4200 NO
1088 128 TO 272 170 4400 YES
1102 128 TSW 265 105 4100 NO
1119 123 TN 248 138 4800 YES
1122 123 TM 200 194 4500 YES

When we follow strictly the approach of slide 67, we have to consider intervals for classes "<= "and ">" and a split-point
in the middle of the interval. See the slide p.67:

So we get the following matrix:


Druck
Values 105 107 108 112 130 138 140 170 194
Error NO NO YES NO NO YES NO YES YES
Split-Point 104 106 107,5 110 121 134 139 155 182 206
Interval < = > <= > <= >< = >< = >< = ><= > <= ><= >< = >
NO 0 5 1 4 2 32 3 3 2 4 1 4 1 5 0 5 0 5 0
YES 0 4 0 4 0 41 31 3 1 3 2 2 2 2 3 1 4 0
GINI

***********************************************************************************************************************
We calculate next the matrix for Temp.: Nr. Anl Typ Temp Druck Füllst. Fehler
1001 123 TN 244 140 4600 NO
Temp. 1002 123 TO 200 130 4300 NO
Values 200, 200, 200 244 245 248 250 265 272 1009 128 TSW 245 108 4100 YES
Error NO, NO,YES NO YES YES NO NO YES 1028 128 TS 250 112 4100 NO
Split-Point 178 222 244,5 246,5 249 257,5 268,5 275,5 1043 128 TSW 200 107 4200 NO
Interval < = > <= > <= >< = >< = >< = ><= > <= > 1088 128 TO 272 170 4400 YES
NO 0 5 2 3 3 23 23 2 4 1 5 0 5 0 1102 128 TSW 265 105 4100 NO
YES 0 4 1 3 1 3 2 2 3 1 3 1 3 1 4 0 1119 123 TN 248 138 4800 YES
GINI 1122 123 TM 200 194 4500 YES

***********************************************************************************************************************
Temp
Finally we calculate the matrix for Füllst.: Nr. Anl Typ Druck Füllst. Fehler
.
1001 123 TN 244 140 4600 NO
Füllst. 1002 123 TO 200 130 4300 NO
Values 4100, 4100,4100 4200 4300 4400 4500 4600 4800 1009 128 TSW 245 108 4100 YES
Error NO, NO, YES NO NO YES YES NO YES 1028 128 TS 250 112 4100 NO
Split-Point 4050 4150 4250 4350 4450 4550 4700 4900 1043 128 TSW 200 107 4200 NO
Interval < = > <= > <= >< = >< = >< = ><= > <= > 1088 128 TO 272 170 4400 YES
NO 0 5 2 3 3 2 4 14 1 4 1 5 0 5 0 1102 128 TSW 265 105 4100 NO
YES 0 4 1 3 1 3 1 3 2 2 3 1 3 1 4 0 1119 123 TN 248 138 4800 YES
GINI 1122 123 TM 200 194 4500 YES

P a g e | 76 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Ad2:

We calculate first for all values of Druck the GINI- Index:


See the following matrix, which shows the results.

Druck
Values 105 107 108 112 130 138 140 170 194
Error NO NO YES NO NO YES NO YES YES
Split-Point 104 106 107,5 110 121 134 139 155 182 206
Interval <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
NO 0 5 1 4 2 3 2 3 3 2 4 1 4 1 5 0 5 0 5 0
YES 0 4 0 4 0 4 1 3 1 3 1 3 2 2 2 2 3 1 4 0
GINI 0.494 0.444 0.381 0.481 0.433 0.344 0.444 0.317 0.417 0.494

First we calculate Gini (Druck) for the value= 139: Second we calculate Gini (Druck) for the value= 155:
Gini (Druck) Gini (Druck)
= 6/9*Gini(<=139) + 3/9*Gini(>139)' = 7/9*Gini(<=0155) + 2/9*Gini(>155)'
= 2/3*(1- (4/6)²-(2/6)²) + 1/3*(1-(1/3)²-(2/3)²)' = 7/9*(1- (2/7)²-(5/7)²) + 2/9*(1-(2/2)²-(0/2)²)'
= 2/3*((36-16-4)/36 ) + 1/3*((9-1-4)/9) = 8/27 + 4/27 = 4/9 = ~0.444' = 7/9*((49-4-25)/49 ) + 0 = 7/9*(20/49) = 20/63 = ~0.317'

Third we calculate GINI (Druck) for the value= 182:


Gini (Druck)
= 8/9*Gini(<=182) + 1/9*Gini(>182)'
= 8/9*(1- (3/8)²-(5/8)²) + 1/9*(1-(1/1)²-(0/1)²) '
= 8/9*((64-9-25)/49 ) + 0 = 8/9*(30/64) = 10/24 = 5/12 ~0.417'

For the rest of the calculations see the following screenshot:

P a g e | 77 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

We calculate next for all values of Temp. the GINI- Index:


See the following matrix, which shows the results:

Temp.
Values 200, 200, 200 244 245 248 250 265 272
Error NO, NO,YES NO YES YES NO NO YES
Split-Point 178 222 244,5 246,5 249 257,5 268,5 275,5
Interval <= > <= > <= > <= > <= > <= > <= > <= >
NO 0 5 2 3 3 2 3 2 3 2 4 1 5 0 5 0
YES 0 4 1 3 1 3 2 2 3 1 3 1 3 1 4 0
GINI 0.494 0.481 0.433 0.489 0.481 0.492 0.417 0.494

We see that the value of the GINI-index only depends on the distribution of YES and NO's:
For the values 178, 222, 244,5, 249, 268,5 and 275,5 we can use the GINI of Druck, since the distribution of YES and NO's are same.
So we need only to calculate GINI(Temp.) for the values= 246,5 and 257,5

First we calculate GINI (Temp.) for the value= 246,5:


Gini (Temp.)
= 5/9*Gini(<=246,5) + 4/9*Gini(>246,5)'
= 5/9*(1- (3/5)²-(2/5)²) + 4/9*(1-(2/4)²-(2/4)²)'
= 5/9*((25-9-4)/25 ) + 4/9*(1-1/4-1/4) = 5/9*(12/25) + 4/9*1/2 = 4/15 + 2/9 = 22/45 ~0.489'
Second we calculate GINI (Druck) for the value= 257,5:
Gini (Temp.)
= 7/9*Gini(<=257,5) + 2/9*Gini(>257,5)'
= 7/9*(1- (4/7)²-(3/7)²) + 2/9*(1-(1/2)²-(1/2)²) '
= 7/9*((49-16-9)/49 ) + 1/9 = 7/9*(24/49) + 1/9 = 8/21 + 1/9 = 31/63 ~0.492'

Finally we calculate all values of Füllst. the GINI- Index:


See the following matrix, which shows the results:
Füllst.
Values 4100, 4100,4100 4200 4300 4400 4500 4600 4800
Error NO, NO, YES NO NO YES YES NO YES
Split-Point 4050 4150 4250 4350 4450 4550 4700 4900
Interval <= > <= > <= > <= > <= > <= > <= > <= >
NO 0 5 2 3 3 2 4 1 4 1 4 1 5 0 5 0
YES 0 4 1 3 1 3 1 3 2 2 3 1 3 1 4 0
GINI 0.494 0.481 0.433 0.344 0.444 0.492 0.417 0.494

All values of GINI- Indexes are calculated above.


For example GINI(Füllst.) for the value= 4450 is the same as GINI(Druck) for the value=139.

*************************************************************************************************************
RESULT: When we consider the lowest GINI we see it with 0.317 for the feature DRUCK for the value 155.
=> DRUCK = Root-Node and the Split-Value is at 155. Our descion tree is now:

Ad3:

We need to calculate the GINI-Indexes for all remaining 7 values (where Druck <
170) for the Features Temp. and Füllst.:

P a g e | 78 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

We need to calculate the GINI-Indexes for all remaining 7 values (where Druck < =155) for the Features Temp. and Füllst.:

Nr. Anl Typ Temp. Druck Füllst. Fehler


1001 123 TN 244 140 4600 NO
1002 123 TO 200 130 4300 NO
1009 128 TSW 245 108 4100 YES
1028 128 TS 250 112 4100 NO
1043 128 TSW 200 107 4200 NO
1088 128 TO 272 170 4400 YES
1102 128 TSW 265 105 4100 NO
1119 123 TN 248 138 4800 YES
1122 123 TM 200 194 4500 YES

Temp.
Values 200, 200 244 245 248 250 265
Error NO, NO NO YES YES NO NO
Split-Point 178 222 244,5 246,5 249 257,5 272,5
Interval <= > <= > <= > <= > <= > <= > <= >
NO 0 5 2 3 3 2 3 2 3 2 4 1 5 0
YES 0 2 0 2 0 2 1 1 2 0 2 0 2 0

GINI 0.408 0.343 0.286 0.405 0.343 0.405 0.408

GINI (178) = GINI (272,5) = 0/7*(GINI<=178)+7/7*GINI(>178)= 0 + 1-(5/7)²-(2/7)² = (49-4-25)/49 = 20/49 ~ 0.408


GINI (222) = GINI (249) = 2/7*(1-0-1 )+ 5/7*(1-(3/5)²-(2/5)² = 5/7*((25-9-4)/25) = 1/7*(12/5) = 12/35 ~ 0.343

GINI (244,5) = 3/7*(1-0-1 )+ 4/7*(1-(1/2)²-(1/2)² = 0 + 4/7*(1/2) = 4/14 = 2/7 ~ 0.286

GINI (246,5) = 4/7*(1-(3/4)²-(1/4)² )+ 3/7*(1-(1/3)²-(2/3))² = 4/7*((16-9-1)/16) + 3/7*((9-1-4)/9)= 6/28 + 4/21 = 17/34 ~ 0.405

GINI (257,5) = 6/7*(1-(4/6)²-(2/6)² )+ 1/7*(1-0 -1))² = 6/7*(1-(2/3)²-(1/3)² + 0= 6/7*(4/9) =6/7*4/9 = 8/21 ~ 0.405

The final task ist to calculate the table for Füllst:

Nr. Anl Typ Temp. Druck Füllst. Fehler


1001 123 TN 244 140 4600 NO
1002 123 TO 200 130 4300 NO
1009 128 TSW 245 108 4100 YES
1028 128 TS 250 112 4100 NO
1043 128 TSW 200 107 4200 NO
1088 128 TO 272 170 4400 YES
1102 128 TSW 265 105 4100 NO
1119 123 TN 248 138 4800 YES
1122 123 TM 200 194 4500 YES

Füllst.
Values 4100, 4100, 4100 4200 4300 4600 4800
Error NO, NO, YES NO NO NO YES
Split-Point 4050 4150 4250 4450 4700 4900
Interval <= > <= > <= > <= > <= > <= >
NO 0 5 2 3 3 2 4 1 5 0 5 0
YES 0 2 1 1 1 1 1 1 1 1 2 0
GINI 0.408 0.405 0.405 0.371 0.238 0.408

For the Values 4050, 4150, 4250 and 4900 we can use the GINI calculation from Temp.

So we need only to calculate the GINI for 4450 and 4700:

GINI (4450) = 5/7*(1-(4/5)²- (1/5)²)+ 2/7*(1-(1/2)²-(1/2)²) = 5/7*((25-16-1)/25) + 2/7*(1/2) = 8/35 + 1/7 = 13/35 = 12/63 = 4/21 ~ 0.371

GINI (4700) = 6/7*(1-(5/6)²-(1/6)² )+ 1/7*(1-0 -1) = 6/7*((36-25-1)/36) = (6/7)*(10/36) = 10/42 = 5/21 ~ 0.238
P a g e | 79 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H4.3* - “Create and describe the algorithm to automate the


calculation of the Decision Tree for UseCase “Predictive Maintenance”

Groupwork (2 Persons): Create and describe the algorithm to automate the


calculation of steps 1. to 3. of homework H4.2. Do the following steps (following the
algorithm described in the lecture):
1. Calculate the Frequency Matrices for the features „Temp.“, „Druck“ and
„Füllst.“
2. Define the Root-node by calculating the GINI-Index for all values of the three
features. Define the optimal split-value for the root-node (see slide p.67)
3. Finalize the decision tree by calculation the GINI-Index for the remaining
values for the features “Temp.” and “Füllst.”

Solution: Created by H. Fritze. & P. Mäder (DHBW, SS2020) and H. Völlinger (DHBW,
WS2020). The following screenshot are from a Jupyter Notebook (using Python3):

P a g e | 80 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 81 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 82 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 83 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H4.4* - “Summary of the Article … prozessintegriertes


Qualitätsregelungssystem…”

Groupwork (2 Persons) – read and create a short summary about a special part of
article/dissertation from Hans W. Dörmann Osuna: “Ansatz für ein prozessintegriertes
Qualitätsregelungssystem für nicht stabile Prozesse“.
Link to article: http://d-nb.info/992620961/34
For the two chapters (1 Person, 15 Minutes):
• Chapter 7.1 „Aufbau des klassischen Qualitätsregelkreises”
• Chapter 7.2. “Prädiktive dynamische Prüfung”

First Solution: by Adrian Koslowski; 1.4.2020:

P a g e | 84 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Task: Summary of the chapter „Aufbau des klassischen Qualitätsregelkreises” of Hans W.


Dörmann Osuma‘s „Ansatz für ein prozessintegriertes qualitätsregelungssystem für nicht
stabile Prozesse“

Subheadings
• „Aufgaben“
• „Voraussetzungen für die Datenerfassung“
• „Datenauswertung“
▪ „Data Understanding“
▪ „Data Preparation“
▪ „Modellierung und Datenanalyse“
▪ „Implementierung“

„Aufgaben“ - Functions
During production data is collected and compared to target values. If the values do not
match, the system automatically acts to correct itself:

„Voraussetzungen für die Datenerfassung“ -Requirements for data collection


• Process must be formally describable
• Data must be measurable
• Values must be processable

„Datenauswertung“ – Data processing


4 phases:
1. Plan

2. Do

3. Check

4. Act

„Data Understanding“
• What variables are relevant for my process?
• What must be taken into consideration?

„Data Preparation “
• Goal: Creation of a table with which current data can be compared to target values
• Generation of initial target values by testing and measurements as well as opinions of
specialists and more

„Modellierung und Datenanalyse“ – Modeling and Data Analysis


• Creation of a model of the real process
• Search for dependencies and causalities
P a g e | 85 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

• CART- and CHAID- decision trees as well as rule-based System as possible methods

„Implementierung“ - Implementation
• Creation of new variables and target values based on new solutions
• Adaptation of existing target values to accommodate new knowledge and rules

************************************************************************************************

Second Solution: by Kevin Kretschmar & Krister Wolfhard; 27.10.2020:

P a g e | 86 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H4.5* - “Create and describe the algorithm to automate the


calculation of the Decision Tree for the Use Case “Playing Tennis” using
ID3 method”
Groupwork (2 Persons) - Calculate the measures of decision tree “Playing Tennis
Game” by creating a Python Program (i.e. using Jupyter Notebook) with “ID3
(Iterative Dichotomiser 3)” method using Entropy Fct. & Information Gain

P a g e | 87 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

First Solution: by Daniel Rück & Brian Brandner; 27.10.2020:

P a g e | 88 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 89 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

See the rest of this Jupyter Notebooks H4.3 with the name “Homework_H4.5-
DecTree_ID3.ipynb” (as PDF: “Homework_H4.5-DecTree_ID3.pdf”) in [HVö-6]:
GitHUb/HVoellinger: https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020

P a g e | 90 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML5: simple Linear Regression (sLR) &


multiple Linear Regression (mLR)
Homework H5.1 - “sLR manual calculations of R² & Jupyter Notebook
(Python)”
Consider we have the 3 points P1 = (1|2), P2 = (3|3) and P3 = (2|2) in the xy-plane.

Part b: 1 Person; Rest: 1 Person

Part a: Calculate the SLR-Measures R-Square R² for the two estimated SLR-lines
y=1,5 + 0,5*x and y=1,25 + 0,5*x. Which estimation (red or green) is better? (1
Person, 15 minutes). (Hint: R²-Square= 1-SSE/SST).

Part b: Calculate the optimal Regression-Line y = a + b*x. By using the formulas


developed in the lesson for the coefficients a and b. What is R² for this line?

Part c: Build a Jupyter Notebook (Python) to check the manual calculations of Part b.
You can use the approach of the lesson by using the Scikit-learn Python library.
Optional*: Pls. plot a picture of the “mountain landscape” for R² over the (a,b)-plane.

Part d: Sometimes in the literature or in YouTube videos you see the formula:
“SST=SSR+SSE” (SSE, SST see lesson and SSR := Sumi(f(xi) – Mean(yi))².
Theorem (ML5-2): “This formula is only true, if we have the optimal Regression-Line.
For all other lines it is wrong! Check this, for the two lines of Part a (red and green)
and the opt. Regression-Line calculated in Part b.
Solutions:

Part a: (H.Völlinger & Sam Matsa, INF17B, 5.4.2020):

P a g e | 91 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

We calculate for the “center of mass” [M(x), M(y)] = [2, 7/3]:

y(2) =1,5 + 0,5*2 = 2,5 > M(y)

y(2) = 1,25+0,5*2 = 2,25 < M(y)

Make some comments concerning the condition SST = SSE +SSR:

P a g e | 92 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Part b:
Detailed description and Excel document with the integrated formulas for the
calculation of the coefficients a, b can be found GitHub/Hvoellinger:
https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020

The excel name is “LR-Calculation of Coeff.xlsx”:

P a g e | 93 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

y=4/3 + 0.5*x is the Regression-Line. R² =3/4.

Part c:

Detailed description and code can be found in GitHub:


https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020
The Jupyter Notebook has the name “Homework-ML5_1c-LinReg.ipynb”:

P a g e | 94 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 95 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 96 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H5.2*- “Create a Python Pgm. for sLR with Iowa Houses Data”
2 Persons: See the video, which shows the coding using Keras library & Python:
https://www.youtube.com/watch?v=Mcs2x5-7bc0 .Repeat the coding with the dataset
“Iowa Homes” to predict the “House Price” based on “Square Feet”. See the result:

Solutions:

P a g e | 97 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Homework H5.3 – “Calculate Adj.R² for MR”

See also the YouTupe Video: “Regression II: Degrees of Freedom EXPLAINED |
Adjusted R-Squared”; https://www.youtube.com/watch?v=4otEcA3gjLk

Task:
• Part A: Calculate Adj.R² for given R² for a ”Housing Price” example (see table
below). Did you see a “trend”?
• Part B: What would be the best model if n=25 and if n=10 (use Adj.R²)?

First Solution (H.Völlinger):

Part A:

1. Row: Adj-R² = 1-(1-R²)*(n-1/n-k-1) = 1-(0,29)*24/20 = 1-0,348 = 0,652

……. Rest analogue……………

You get the final result:

Part B:

n=25: you get the best model for k=6 (Adj-R²=0.7067)

n=10: you get best the model for k=4 (Adj-R²=0.4780)

P a g e | 98 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

Second Solution (Lukas Petric, 8.4.2020):

Homework H5.4 - “mLR (k=2) manual calculations of Adj.R² & Jupyter


Notebook (Python) to check results”

Part a: 1 Person, Part b +c: 1 Person

Consider the 4 points P1=(1|2|3), P2=(3|3|4), P3=(2|2|4) and P4=(4|3|6) in the 3-


dimensional space:

Part a: Calculate the mLR-Measures Adj.R² for the two Hyperplanes H1:=plane
defined by {P1,P2,P3} and H2:=Plane defined bx {P2,P3,P4}. Which plane (red or
green) is a better mLR estimation? (Hint: calculate Adj.R²).

Part b: What is the optimal Regression-Plane z = a + b*x + c*y. By using the formulas
developed with “Least Square Fit for mLR” method for the coefficients a, b and c.
What is Adj.R² for this plane? (Hint: a=17/4, b=3/2, c=-3/2; R² ~0.9474 and Adj.R²=0,8421)

Part c: Build a Jupyter Notebook (Python) to check the manual calculations of part b.
You can use the approach of the lesson by using the Scikit-learn Python library.

P a g e | 99 Date: 22 December
2022
Exercises to ML DHBW Stuttgart – WS2020

First Solution: by Hermann Völlinger, 29.10.2020


Part a:

Part c:

P a g e | 100 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Adj.R² := 1 – (1 - R²) * (3/1) = 1 - (1 - 0,94736)*3 ~ 0,84208


Second Solution: by A. Wermerskirch, N. Baitinger und P. Jaworski, 2.11.2020
Part a+b:

P a g e | 101 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 102 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 103 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 104 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Part c:

Rest see [HVö-6]: Dr. Hermann Völlinger: GitHub to the Lecture "Machine Learning:
Concepts & Algorithms“; see: https://github.com/HVoellinger/Lecture-Notes-to-ML-
WS2020

Homework H5.5* - Decide (SST=SSE+SSR) => optimal sLR- line?

Examine this direction of the (SST=SSE+SSR) condition. We could assume that the
condition: "SST = SSR + SSE" (*) also implies that y(x) is an optimal regression line.
In many examples this is true! (see homework 5H.1_a).

Task: Decide the two possibilities a) and b): (2 Persons, one for each step)
a. Statement is true, so you have to prove this. I.e. Show that when the “mixed
term” of the equation is zero (sum[(fi-yi)*(fi-M(y)]=0 for all i) implies an optimal
sLR-line.
b. To prove that it’s wrong, it’s enough to construct a counterexample: define a
Training Set TS= {observation-points}; a sLR-line which has condition (*), but
is not an optimal sLR-line.

P a g e | 105 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML6: Convolutional Neural Networks


(CNN)

Homework H6.1 – “Power Forecasts with CNN in UC2”

Groupwork (2 Persons): Evaluate and explain in more details the CNN in “UC2-
Fraunhofer + enercast: Power forecasts for renewable energy with CNN”
https://www.enercast.de/wp-content/uploads/2018/04/whitepaper-prognosen-wind-solar-
kuenstliche-intelligenz-neuronale-netze_110418_EN.pdf

Solutions:
…..

Homework H6.2 – “Evaluate AI Technology of UC3”

Groupwork (2 Persons) – Evaluate and find the underlying AI technology which is


used in “UC3 – Semantic Search: “Predictive Basket with Fact-Finder”.
https://youtu.be/vSWLafBdHus

Solutions:
……

Homework H6.3* – “Create Summary to GO Article”

Groupwork (2 Persons) - read and create a summary of the main results of the article
“Mastering the game of Go with deep neural networks and tree search”
https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

Solutions:
…..

Homework H6.4* – “Create Summary to BERT Article”

Groupwork (2 Persons): Read and summaries of the main results of the article about
BERT. See Ref. [BERT]: Jacob Devlin and Other: “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding”; Google (USA); 2019

P a g e | 106 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Solutions: by Robert Merk unn Joshua Franz; 3.11.2020

P a g e | 107 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 108 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 109 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML7: BackPropagation for Neural


Networks

Homework H7.1 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

Homework H7.2 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

P a g e | 110 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

Exercises to Lesson ML8: Support Vector Machines (SVM)

Homework H8.1 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

Homework H8.2 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

Homework H8.3 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

Homework H8.4 – “Exercise of an Example with Python”

*********** placeholder********************

Solutions:
….

P a g e | 111 Date: 22 December


2022
Exercises to ML DHBW Stuttgart – WS2020

P a g e | 112 Date: 22 December


2022

You might also like