0% found this document useful (0 votes)
38 views

ML_Concepts&Algorithms

Jjjj

Uploaded by

230603014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

ML_Concepts&Algorithms

Jjjj

Uploaded by

230603014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

Dr.

Hermann Völlinger,
Mathematics & IT-Architecture

Machine Learning (ML)


Concepts & Algorithms
Status: 1 December 2020

DHBW – Fakultät Technik-Informatik, Stuttgart, Autumn 2020


Dr. Hermann Völlinger, Mathematics & IT Architecture
http://www.dhbw-stuttgart.de/~hvoellin/

www.dhbw-stuttgart.de

Lecture Information in Moodle Account of Course 18c:


https://elearning.dhbw-stuttgart.de/moodle/course/view.php?id=8210

Status: 1 December 2020 Page: 1


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ML0 – General Remarks and Goals of Lecture (ML)

Page 2
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 2


Dr. Hermann Völlinger
Mathematics & IT-Architecture

General Remarks to Lecture Machine Learning (ML) (1/3)

Wahlmodul Informatik
Data Mining
Data Mining & Martin Clement
Evaluation Share

Machine Learning Dr. Toni Bollinger


50%
Machine Learning
Dr. Hermann Völlinger
Prof. Dr. Dirk
Reichardt
50% Additional freely chosen
lecture

Page 3
Status: 1. Dezember 2020

“Wahlmodul Informatik II (T3INF4902)” (Elective Module Computer


Science), see:
https://www.dhbw.de/fileadmin/user/public/SP/STG/Informatik/Angewandte_Informatik.pdf

Modul Description: „Data Mining und Grundlagen des Maschinellen Lernens“:


Machine Learning
• Einführung in das Maschinelle Lernen
• Symbolische Lernverfahren
• Grundlagen Neuronaler Netze
• Probabilistische Lernmodelle
• Erweiterte Konzepte und Deep Learning
• Entwurf und Implementierung ausgewählter Techniken für eine Anwendung

Data Mining:
• Daten und Datenanalyse
• Clustering
• Classification
• Assoziationsanalyse
• Weitere Verfahren, z.B.:
• Regression
• Deviation Detection
• Visualisierung

Status: 1 December 2020 Page: 3


General Remarks to Lecture Machine Learning (ML) (2/3)

• Time: online, Tuesday (6 x) starting 29.9.


– 3.11.20, 10:00 – 12:30, i.e.: 105
minutes lecture, 30 minutes exercises /
homework (everyone should present his
homework at least one time), 10-15
minutes break, more details see my
DHBW homepage).
• Examination: Seminar Paper: 2 persons,
8-10 pages/English/11.12.2020,18:00.
Grading together with Data Mining will
be seen on DHBW Bachelor Certificate.
• My DHBW Lecture Homepage:
http://www.dhbw-stuttgart.de/~hvoellin/
contains a script of the lecture and also
solutions („Musterlösungen“) to the
homeworks/exercises. The homepage
includes also sample data for exercises
and a lot of other interesting facts about
Machine Learning.

Status: 1 December 2020 Seite 4

Status: 1 December 2020 Page: 4


Dr. Hermann Völlinger
Mathematics & IT-Architecture

General Remarks to Lecture Machine Learning (ML) (3/3)


• The solutions of homework
(“Musterlösungen”) are published in the
Homepage. More than 30 homeworks are
included. If the student agrees, his solutions
can also be included in the script of solutions.
• The lecture script is in English, since the
common IT language is English. Some
dedicated slides/pages are in German, not to
loose “Look and Feel” of the slide.
• Prerequisites: Python or R Programming (i.e.
usage of Jupyter Notebooks), Lecture:
“Datenbanken1”, Lecture: “Mathematik 1” and
“Mathematik 2”; some basis knowledge about
Data Warehousing (fifth semester) and Data
Mining (visited in parallel).
• In WS2020 lecture we have only 18 SW
(lecture hours), so we will skip some chapters
in this script, which is designed for 36 SW
(see next slide). But feel free to have also a
look on this chapters.

Status: 1. Dezember 2020 Page 5

Mathematik2 / Lehr- und Lerneinheiten:


Angewandte Mathematik
- Grundlagen der Differential und Integralrechnung reeller Funktionen mit
mehreren Veränderlichen sowie von Differentialgleichungen und
Differentialgleichungssystemen
- Numerische Methoden und weitere Beispiele mathematischer Anwendungen in
der Informatik
Statistik
- Deskriptive Statistik
- Zufallsexperimente, Wahrscheinlichkeiten und Spezielle Verteilungen –
Induktive Statistik
- Anwendungen in der Informatik

Data Warehouse & Data Mining:


Data Mining / Grundlagen
- Daten und Datenanalyse
- Clustering
- Classification
- Assoziationsanalyse
- Weitere Verfahren,
- z.B.: Regression, Deviation Detection
- Visualisierung

Status: 1 December 2020 Page: 5


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Goals of Lecture (1/3)


The lecture’s aim is to introduce Machine Learning (ML) as part of Artificial Intelligence.
We learn the most important methods that are used in Machine Learning (ML) and they
are presented with their essential features. Several references are given to in-depth
applications or information through internet-links or further literature. In may places
concrete implementation examples with Jupyter Notebooks* are defined. Especially we
see the following “List of Topics”:

• Introduction of ML (with Categorization of ML) and


Ethical aspects of AI and ML.
• Concept Learning: Version Spaces & Candidate
Elimination Algorithm.
• Classification with Bayes Learning and K-Means
Clustering (i.e. for Iris dataset).
• Build Decision Trees with GINI-Index method or with
ID3 Algorithm (incl. Python Jupyter Notebook).
• Decision Trees - Usage for Predictive Maintenance in
Production (Use Case: „Gießerei“).
* The name Jupyter is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R.

Page 6
Status: 1. Dezember 2020

Machine Learning, Deep Learning, Cognitive Computing - Artificial Intelligence technologies


are spreading rapidly. The background is that today the computing and storage capacities
are available that make AI scenarios possible.
What is Artificial Intelligence?
The term Artificial Intelligence (AI) was coined more than 60 years ago by the US computer
scientist John McCarthy. He submitted an application for a research project on machines
that played chess, solved mathematical problems and learned on their own. In the summer
of 1956, he presented his findings to other scientists. The British mathematician Alan Turing
had already developed the "Turing Test" six years earlier, which can determine whether the
other person is a human or a machine posing as a human being.
But it took decades to advance AI research because it required significantly more
computational power. In the mid-1990s, the time had come and scientists devoted
themselves to tasks such as image recognition. Gradually, artificial neural networks have
been developed that capture information such as language and images, recognize patterns
and develop their own solutions.
Artificial intelligence is now a part of everyday life, whether as search engines, as smart
language assistants, in medical diagnoses or in self-driving vehicles. Military also discusses
to use AI for military issues, like war robots. The United Nations (UN) debated how to
handle killer robots that autonomously decide on military operations.

Status: 1 December 2020 Page: 6


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Goals of Lecture (2/3)


Continue the “List of Topics”:
• Applications of Regression Methods: simple
Linear- (sLR) & multiple Linear Regression (mLR).

• Convolutional Neural Networks Concepts and


Algorithms (i.e. for image recognition) and several
other Use Cases.

• Deep Learning in Google AlphaGo (with Self


Learning AI) & Autonomous Driving.

• Back Propagation for Neural Networks (incl.


Definition and Implementation with Jupyter
Notebook).

• Support Vector Machines (SVM) and detailed


concrete examples with Jupyter Notebook.

See Ref. [HVö-3]: http://wwwlehre.dhbw-stuttgart.de/~hvoellin/MindMap_Machine_Learning_(ML)_Lecture.pdf

Page 7
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 7


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Goals of Lecture (2/3) – Understand 3 Categories of ML


See Ref. [HHeid]: https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f

Page 8
Status: 1. Dezember 2020

Supervised Learning
Supervised learning is the most popular paradigm for machine learning. It is the easiest to
understand and the simplest to implement. It is very similar to teaching a child with the
use of flash cards.
Given data in the form of examples with labels, we can feed a learning algorithm these
example-label pairs one by one, allowing the algorithm to predict the label for each
example, and giving it feedback as to whether it predicted the right answer or not.
Unsupervised Learning
Unsupervised learning is very much the opposite of supervised learning. It features no
labels. Instead, our algorithm would be fed a lot of data and given the tools to understand
the properties of the data. From there, it can learn to group, cluster, and/or organize the
data in a way such that a human (or other intelligent algorithm) can come in and make
sense of the newly organized data.
Reinforcement Learning
Reinforcement learning is learning from mistakes. Place a reinforcement learning algorithm
into any environment and it will make a lot of mistakes in the beginning. So long as we
provide some sort of signal to the algorithm that associates good behaviors with a positive
signal and bad behaviors with a negative one, we can reinforce our algorithm to prefer
good behaviors over bad ones. Over time, our learning algorithm learns to make less
mistakes than it used to.

Status: 1 December 2020 Page: 8


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Overview of the Lessons in Machine Learning (ML)

We have 8 chapters, 3 are skipped (see gray color font). The date can
change if necessary:
1. ML1: Introduction to Machine Learning - 29.09.2020, 10:00-12:30
2. ML2: Concept Learning: Version Spaces & Candidate Elim. Algorithm
3. ML3: Supervised and Unsupervised Learning - 06.&13.10.20, 10:00-12:30
4. ML4: Decision Tree Learning - 20.10.2020, 10:00-12:30
5. ML5: Simple Linear - & Multiple Regression - 27.10.2020, 10:00-12:30
6. ML6: Neural Networks: Convolutional NN - 03.11.2020, 10:00-12:30
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)
Actual time plan:
https://rapla.dhbw-stuttgart.de/rapla?key=txB1FOi5xd1wUJBWuX8lJoG0cr9RVi1zB7e5WYTczgq3qJZTian3jkGZb9hTVbzP

Status: 1. Dezember 2020 Page 9

Status: 1 December 2020 Page: 9


Dr. Hermann Völlinger
Mathematics & IT-Architecture

References to Machine Learning (1/3)


1. [EAlp] Ethem Alpaydin, "Maschinelles Lernen", De Gruyter (Oldenbourg), 2. Edition (May 2019)
2. [Bei+Kern] Christoph Beierle,Gabriele Kern-Isberner, "Methoden Wissensbasierter Systeme -
Grundlagen - Algorithmen - Anwendungen", Springer-Vieweg Verlag, 6. Edition (August 2019)
3. [BERT]: Jacob Devlin and Other: “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding”; Google(USA); 2019
4. [CW]: COMPUTERWOCHE – Digitalisierung/ Machine Learning „Teil 1: Die Grundlagen - darum
geht es“, „Teil 2: Die Pläne der Anwender - das haben deutsche Unternehmen vor“, Teil 3:
Anwendungen und Plattformen - die Technik“: https://www.computerwoche.de/a/machine-
learning-darum-geht-s,3330413
5. [FAZ]: FAZ Artikel „TECHNIK DER ZUKUNFT: Die Hälfte der Deutschen weiß nicht, was KI ist“:
http://www.xing-
news.co/reader/news/articles/1857263?newsletter_id=39283&toolbar=true&xng_share_origin=w
eb
6. [GI-RgStg]: Gesellschaft für Informatik-Arbeitskreis Künstliche Intelligenz, Regionalgruppe
Stuttgart: http://rg-stuttgart.gi.de/veranstaltungen/mo-14012019-themenabend-kuenstliche-
intelligenz.html
7. [HHeid]: Hunter Heidenreich: “What are the Types of Machine Learning”; Internet- Blog:
https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f

Page 10
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 10


Dr. Hermann Völlinger
Mathematics & IT-Architecture

References to Machine Learning (2/3)

8. [HVö-1]: Hermann Völlinger: Script of the Lecture "Introduction to Data Warehousing“; DHBW
Stuttgart; WS2019
9. [HVö-2]: Hermann Völlinger and Other: Exercises & Solutions of the Lecture "Introduction to
Data Warehousing“; DHBW Stuttgart; WS2019
10. [HVö-3]: Hermann Völlinger: MindMap of the Lecture "Machine Learning: Concepts &
Algorithms” “; DHBW Stuttgart; WS2020
11. [HVö-4]: Hermann Völlinger and Other: Exercises & Solutions of the Lecture "Machine
Learning: Concepts & Algorithms” “; DHBW Stuttgart; WS2020
12. [HVö-5]: Hermann Völlinger: Script of the Lecture "Machine Learning: Concepts &
Algorithms“; DHBW Stuttgart; WS2020
13. [HVö-6]: Hermann Völlinger: GitHub to the Lecture "Machine Learning: Concepts &
Algorithms“; see in: https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020

14. [MatLab]: MatLab eBook: ”Reinforcement Learning with MATLAB - Understanding Training
and Deployment”; MathWorks 2019;
https://www.slideshare.net/HiteshMohapatra/reinforcement-learning-ebook-part3

15. [SfUni-1]: Stanford University (USA) - Machine Learning Course, by Andrey Ng:
https://www.coursera.org/learn/machine-learning;

Page 11
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 11


Dr. Hermann Völlinger
Mathematics & IT-Architecture

References to Machine Learning (3/3)


16. [SfUni-2]: Stanford University (USA) – Series of 112 YouTube Videos about Machine
Learning, by Andrew Ng (see above):
https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

17. [Sift]: Sift Science Engineering – Article: “Deep Learning for Fraud Detection - Introduction”:
https://engineering.siftscience.com/deep-learning-fraud-detection/

18. [TDWI]: TDWI eBook: “PROJEKTE UND ARTIFICIAL-INTELLIGENCE


WIRTSCHAFTLICHKEIT”; SIGS DATACOM GmbH, 53842 Troisdorf, info@sigs-
datacom.de; 2020

19. [TMitch]: Tom Mitchell „Machine Learning”, McGraw Hill, 1997:


https://www.cs.cmu.edu/~tom/mlbook.html

20. [TMun]: Toshinori Munakata: "Fundamentals of the new Artificial Intelligence", Springer
Verlag, 2. Edition (January 2011)

21. [TU-Darm]: TU Darmstadt: Data Mining und Maschinelles Lernen - WS 17/18 „Einführung
in maschinelles Lernen und Data Mining“: http://www.ke.tu-darmstadt.de/lehre/ws-17-
18/mldm

Page 12
Status: 1. Dezember 2020

Beside the information you get from the literature, you have also the chance
to learn from other ML experts in your town. For example you can visit a
Meetup meeting (i.e under the logo “Cyber Valley”) in Stuttgart/Tübingen.

Status: 1 December 2020 Page: 12


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML0

Homework H0.1 – “Three Categories of Machine Learning”

Groupwork (2 Persons). Compare the differences of the three categories, see slide
“goal of lecture (2/2)”:

1. Supervised - (SVL)

2. Unsupervised - (USL)

3. Reinforcement - Learning (RFL)

See the information in internet, for example the following link:


https://towardsdatascience.com/what-are-the-types-of-machine-learning-e2b9e5d1756f

Give of short descriptions of the categories and explain the differences (~5 minutes for
each category).

Page 13
Status: 1. Dezember 2020

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 13


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML1 - Introduction to Machine Learning (ML)

https://www.youtube.com/watch?v=5dLG3JDk2VU https://www.youtube.com/watch?v=XHjIqQBsPjk

Page 14
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 14


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Motivation1: the New Revolution through Digitalization

Like the dream of flying, artificial intelligence (AI) has long been a dream and vision. Muscle
power was not enough to enable man to fly (see "Schneiderlein von Ulm"). Even the Wright
brothers would not have gotten so far using a steam engine weighing tons. Only the discovery of
the light diesel engine brought the breakthrough.
Similar in computing only the quantum jump in performance enabled the new digital revolution.
See next slide.
Page 15
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 15


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Motivation2: Why Machine Learning (ML) now?

There are three main drivers for Digitalization (i.e. ML), compared to end
of the last century (1997-1998) - only 20 years ago:
~ 100-times better CPU power available using cloud-based infrastructure
~ 100-times more data (and more use cases) available than 20 yeas ago
~ 100-times better mathematical algorithms and models
===➔ ~ 1 million better possibilities
Moore's law is the
observation that the
number of transistors in a
dense integrated circuit
doubles about every two
years.
https://en.wikipedia.org/wi
ki/Moore%27s_law

Page 16
Status: 1. Dezember 2020

https://www.youtube.com/watch?v=OmJ-4B-mS-Y

Status: 1 December 2020 Page: 16


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Motivation3: Germans are skeptical about the use AI

The main obstacles that prevent ML in Germany to really take of are:


1. It is not clear to many what ML and digitization is and how it will change their daily
work.
2. There is a "fear" of new technologies. Some persons fear to loose familiar
procedures others even to loose their working place entirely.
3. German companies (in the majority of the industries) are technologically,
organizationally and intellectually unprepared for ML (i.e. digitization)
Page 17
Status: 1. Dezember 2020

In the most cases where persons are interviewed about the usage of KI, it
was not used in production-processes in the German companies (see
chemistry, travel, logistics, machine building, etc.). Only in Automobile (20%)
or Finance (~10%) we see some progress in this area (compare slide 23) .
Currently the topic of artificial intelligence is dividing the spirits. On the one
hand, it brings great progress, on the other, it carries risks that are difficult to
assess. This becomes evident in the discussion about self-driving cars which
would make road traffic much safer but are highly debated and are feared to
be partly unpredictable.
Tesla founder Elon Musk warns against the use of artificial intelligence, which
could be more dangerous for humanity than nuclear weapons. The Germans
are also skeptical about the use of Artificial Intelligence in general as above
YouGov survey clearly shows.

Status: 1 December 2020 Page: 17


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Understanding - What is Machine Learning (ML)?

Page 18
Status: 1. Dezember 2020

Traditional Software vs. Machine Learning:

Status: 1 December 2020 Page: 18


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of Machine Learning (ML), see Ref. [TMitch]

Page 19
Status: 1. Dezember 2020

See the book „Machine Learning” from Tom Mitchell, McGraw Hill, 1997:
https://www.cs.cmu.edu/~tom/mlbook.html
Machine Learning is the study of computer algorithms that improve automatically
through experience. Applications range from datamining programs that discover
general rules in large data sets, to information filtering systems that automatically
learn users' interests. This book provides a single source introduction to the
field. It is written for advanced undergraduate and graduate students, and for
developers and researchers in the field. No prior background in artificial
intelligence or statistics is assumed.
See the following two examples:
1. Chess Playing, where Task T is playing chess. Performance measure P is
percent of games won against opponents and Training experience E is
playing practice games against itself.
2. Robot Driving, where Task T is driving on public four-lane highways using
vision sensors. Performance measure P is average distance traveled before
an error (as judged by human overseer) and Training experience E is a
sequence of images and steering commands recorded while observing a
human driver.

Status: 1 December 2020 Page: 19


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example1 of ML - Learning to Play Backgammon

Page 20
Status: 1. Dezember 2020

GNU Backgammon (GNUbg) plays and analyzes backgammon games and


matches.
It is able to play and analyze both money games and tournament matches,
evaluate and roll out positions, and more.
Driven by a command-line interface, it displays an ASCII rendering of a board
on text-only terminals, but also allows the user to play games and manipulate
positions with a graphical GTK+ interface.
GNU Backgammon is extensible on platforms which support Python.

Status: 1 December 2020 Page: 20


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example2 of ML - Recognizing Spam-Mail

Page 21
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 21


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example3 of ML - Market Basket Analysis

Page 22
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 22


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 1: Advanced Analytics & Machine Learning (1/3)


Machine Learning

Page 23
Status: 1. Dezember 2020

"Advanced Analytics (AA)":


The basis of the technical and technical content was a YouTube video that
can be found on the Internet on this topic:
https://www.youtube.com/watch?v=oNNk9-tmsZY
As an introductory slide for a better understanding and overview of AA, we
are used to mention the six most important methods of "Advanced
Analytics". Their complexity increases from bottom to top and at the same
time the professional added value that can be achieved with it increases
from left to right. The simplest and least valuable subject is thus the
reporting or also called ("reporting") (bottom left).
The most difficult to implement but also the most technically valuable
method is the descriptive predictive model "Prescriptive" (top right).
For Forecasting, Predictive Analysis and Precriptive Analysis we will use ML.

Status: 1 December 2020 Page: 23


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 1: Advanced Analytics & Machine Learning (2/3)

Page 24
Status: 1. Dezember 2020

Prescriptive analytics incorporates both structured and unstructured data,


and uses a combination of advanced analytic techniques and disciplines to
predict, prescribe, and adapt. While the term prescriptive analytics was first
coined by IBM and later trademarked by Ayata, the underlying concepts
have been around for hundreds of years.
The technology behind prescriptive analytics synergistically combines
hybrid data, business rules with mathematical models and computational
models.
The data inputs to prescriptive analytics may come from multiple sources:
internal, such as inside a corporation; and external, also known as
environmental data. The data may be structured, which includes numbers
and categories, as well as unstructured data, such as texts, images, sounds,
and videos.
Unstructured data differs from structured data in that its format varies
widely and cannot be stored in traditional relational databases without
significant effort at data transformation.[10] More than 80% of the world's
data today is unstructured, according to IBM.

Status: 1 December 2020 Page: 24


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 1: Advanced Analytics & Machine Learning (3/3)

Page 25
Status: 1. Dezember 2020

In addition to this variety of data types and growing data volume, incoming data can
also evolve with respect to velocity, that is, more data being generated at a faster or a
variable pace. Business rules define the business process and include objectives
constraints, preferences, policies, best practices, and boundaries. Mathematical
models and computational models are techniques derived from mathematical
sciences, computer science and related disciplines such as applied statistics, machine
learning, operations research, natural language processing, computer vision, pattern
recognition, image processing, speech recognition, and signal processing.

The correct application of all these methods and the verification of their results implies
the need for resources on a massive scale including human, computational and
temporal for every Prescriptive Analytic project.

In order to spare the expense of dozens of people, high performance machines and
weeks of work one must consider the reduction of resources and therefore a reduction
in the accuracy or reliability of the outcome. The preferable route is a reduction that
produces a probabilistic result within acceptable limits.

Status: 1 December 2020 Page: 25


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 2: Use AI for creating Art (ex. V. van Gogh) (1/3)

https://youtu.be/ARotWjhRpjE
Page 26
Status: 1. Dezember 2020

Cambridge Consultants — A new artificial intelligence system can turn simple sketches
into paintings reminiscent of works by great artists of the 19th and 20th centuries,
researchers say.
The artificial intelligence (AI) system, dubbed Vincent, learned to paint by "studying"
8,000 works of art from the Renaissance up to the 20th century. According to the
system's creators — engineers from the United Kingdom-based research and innovation
company Cambridge Consultants — Vincent is unique not only in its ability to make art
that is actually enjoyable but also in its capability to respond promptly to human input.
"Vincent allows you to draw edges with a pen, edges of a picture you can imagine in
your mind, and from those pictures, it produces a possible painting based on its
training," said Monty Barlow, director of machine learning at Cambridge Consultants,
who led the project. "There is this concern that artificial intelligence will start replacing
people doing things for them, but Vincent allows humans to take part in the decisions of
the creativity of artificial intelligence." [Super-Intelligent Machines: 7 Robotic Futures]

Status: 1 December 2020 Page: 26


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 2: Use AI for creating Art (ex. V. van Gogh) (2/3)

https://blog.netapp.com/how-vincent-ai-learned-to-paint/
Page 27
Status: 1. Dezember 2020

Teaching Vincent
Barlow said that using only 8,000 works of art to train Vincent is by itself a major
achievement. Previously, a similar system would have needed millions, or even
billions, of samples to learn to paint.
"Most machine learning deployed today has been about classifying and feeding lots
and lots of examples into a system," Barlow said. "It's called supervised
learning. You show a million photos of a face, for example, and a million photos of
not a face, and it learns to detect faces."
Vincent uses a more sophisticated technique that allows the machine to teach itself
automatically, without constant human input. The system behind Vincent's abilities
is based on the so-called generative adversarial network, which was first described
in 2014.
The technique uses two neural networks that compete with each other. At the
beginning, both networks are trained, for example, on images of birds.
Subsequently, one network is tasked with producing more images of birds that
would persuade the other network that they are real. Gradually, the first network
gets better at producing realistic images, while the second one gets better at
spotting fakes, according to the researchers.

Status: 1 December 2020 Page: 27


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 2: Use AI for creating Art (ex. V. van Gogh) (3/3)

https://deepart.io/
Page 28
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 28


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Use-Case 3: Use AI/Digitalization in the Medicine of the Future


Of course, when studying the current applications of AI, medicine is a large and broad field
of AI applications. Three important fields of application are:
1. Monitoring: When the smartphone notices the disease before it breaks out.
Apple Watch 4: With the new product generation the tech company is
expanding the device into a health guard.
With a software update, Apple released a new ECG function. Two sensors on the
underside of the clock record the electrical impulses of the heartbeat. The
measurement results are recorded in an app on the iPhone. The application will
tell you if it detects any signs of cardiac arrhythmia after about 30 seconds. The
results of ECG features stores the app on the iPhone. The owner can later share
it with the doctor via PDF.

2. Diagnosis: When computer and doctor are reading x-rays together.


Above the strongly enlarged, purple-colored cells, light green circles float in the picture. They mark where Google’s
Artificial Intelligence found cancerous tissue in the sample. As Bob McDonald, Technical Program Manager at the
software company, moves the small glass plate under the microscope, he scours through the next segment of the cell…
3. Therapy: When every patient gets his own medication
Not only for prophylaxis and diagnosis, but also for completely new therapies, the IT revolution in medicine can pave the way.
This is shown, for example, by a project that is currently being driven by the biotech company Biontech in Mainz with high
investments. In cooperation with the US company Genentech, a subsidiary of the Roche Group, Biontech is working on the
development of completely individualized vaccines designed to trigger a body immune reaction against tumor cells.

The Handelsblatt has written a good report. “Digitization and Health - The Medicine of the Future
- How AI protects us against cancer and heart attack”: http://a.msn.com/05/de-de/BBT0lCR?ocid=se
Page 29
Status: 1. Dezember 2020

1. Monitoring: Significant deviations in essential body data will be automatically analyzed


in the future: is the increased heart rate a normal consequence of the staircase that has
just been climbed? Or does it indicate cardiovascular disease when combined with other
data and the patient's history? Thus, diseases can be detected in early stages and treated
accordingly effectively.
2. Diagnostics: Where it depends today almost exclusively on the knowledge and the
analysis ability of the physician, whether for example the cancer metastasis on the X-ray
image is recognized as such, support the physician in the future artificially intelligent
systems, which become a little smarter with each analyzed X-ray image. The probability
of error in the diagnosis decreases, the accuracy in the subsequent treatment increases.
3. Therapy: Big data & AI have the potential to make the search for new medicines and
other treatment methods significantly more efficient. Today, countless combinations of
molecules must first be tested for their effectiveness in the Petri dish, then in animal
experiments, and finally in clinical trials, until at the end there may be a new drug. A
billion dollar roulette game in which the chances of winning can be significantly
increased by computer-aided forecasting, which in turn access an unprecedented wealth
of research data. Examples:
• „KI entwickelt Medikament“: https://t3n.de/news/vollstaendig-ki-entwickeltes-
1248880/ ;
• “Corona Virus”: https://futurezone.at/science/coronavirus-ki-warnte-schon-2019-
vor-dem-ausbruch/400738542

Status: 1 December 2020 Page: 29


Dr. Hermann Völlinger
Mathematics & IT-Architecture

What does Machine Learning look like?

Page 30
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 30


Dr. Hermann Völlinger
Mathematics & IT-Architecture

The 3 major Types of Machine Learning algorithms (1/2)

Page 31
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 31


Dr. Hermann Völlinger
Mathematics & IT-Architecture

The 3 major Types of Machine Learning algorithms (2/2)

Page 32
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 32


Dr. Hermann Völlinger
Mathematics & IT-Architecture

10 Fundamental Terms of Machine Learning (1/2)

Page 33
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 33


Dr. Hermann Völlinger
Mathematics & IT-Architecture

10 Fundamental Terms of Machine Learning (2/2)

Page 34
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 34


Dr. Hermann Völlinger
Mathematics & IT-Architecture

What is the status of ML and what are future plans

Page 35
Status: 1. Dezember 2020

Today, machine-learning algorithms are mainly used in the field of image analysis
and recognition. In the future, speech recognition and processing will become more
important. The processing and analysis of large amounts of data is a core task of
such a digital infrastructure platform.
Therefore, IT managers must ensure that their IT can handle different artificial
intelligence processes. Server, storage and network infrastructures must be designed
for new ML-based workloads. Data management must also be prepared so that ML-
as-a-Service offerings in the cloud can be used. In the context of ML, alternative
hardware components such as GPU-based clusters from Nvidia, Google's Tensor
Processing Unit (TPU) or IBM's TrueNorth processor have become popular in recent
months.
Companies have to decide whether they want to invest themselves or use the
services of corresponding cloud providers. One of the major uses for ML is speech
recognition and processing. Amazon Alexa is currently moving into households,
Microsoft, Google, Facebook and IBM have invested here a large part of their
research and development funds and purchased specialist firms.
It can be foreseen that natural language communication becomes more natural at
the customer interface. The operation of digital products and enterprise IT solutions
will also be possible via voice command. This affects both the customer frontend and
the IT backend.

Status: 1 December 2020 Page: 35


Dr. Hermann Völlinger
Mathematics & IT-Architecture

In what ML Tasks customers need external skills

Page 36
Status: 1. Dezember 2020

With large cloud providers including ML services and products in their service
portfolio, it's relatively easy for users to get started. Amazon Machine Learning,
Microsoft Azure Machine Learning, IBM Bluemix, and Google Machine Learning
provide cost-effective access to related services through the public cloud.

So users do not need their own supercomputers, statistical experts or dedicated


infrastructure management. You can start with a few commands about the APIs of
the big public cloud providers.
They will find various machine learning techniques, services and tools such as
graphical programming models and storage services.

The more they get involved, the greater the risk of vendor lock-ins. Therefore, users
should think about their strategy before starting. IT service providers and managed
service providers can also deploy and operate ML systems and infrastructures,
making independence from the public cloud providers and their SLAs equally
possible.

Status: 1 December 2020 Page: 36


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Status of Industries in the Usage of ML Technologies

Page 37
Status: 1. Dezember 2020

The usage behavior of ML is very different not only between, but also within the industries. In the
automotive industry, for example, there are big gaps between the pioneers and the latecomers.
Real-time image and video analysis and statistical methods and mathematical models from
machine learning and deep learning are widely used for the development and production of self-
driving cars. Some methods are also used to detect manufacturing defects.
The share of innovators, who already use ML to a large extent, is the largest in the automotive
industry at around 20 percent. In contrast, however, there are 60 percent who deal with ML but are
still in the evaluation and planning phase. Thus, it turns out that in the automotive industry, some
lighthouses shape the picture, but from a nationwide adaptation, there is no question.
The mechanical and plant engineering companies also have half (53 percent) in the evaluation and
planning phase. Nearly one-third use ML productively in selected applications and 18 percent are
currently building prototypes. Next are the commercial and consumer goods companies, which are
44 percent to test ML in initial projects and prototypes. This is not surprising given that these
companies usually have well-maintained data sets and a lot of experience in business intelligence
and data warehouses. If they succeed in measurably improving pricing strategies, product
availability or marketing campaigns, ML is seen as a welcome innovation tool for existing big-data
strategies.
The same applies to the IT, telecoms and media industries: there ML processes have long been
used, for example, for playing online advertising, calculating purchase probabilities (conversion
rates) or personalizing web content and shopping recommendations. For professional service
providers, measuring and improving customer loyalty, quality of service and on-time delivery play
an important role, as these are the competitive differentiating factors.

Status: 1 December 2020 Page: 37


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ML platforms & ML products

Page 38
Status: 1. Dezember 2020

When it comes to selecting platforms and products, solutions from the public cloud play an
increasingly important role (ML as a Service). In order to avoid complexity and because the
major cloud providers are also the leading innovators in this field, many users are choosing
these cloud solutions. While 38.1 of the respondents prefer solutions from the public cloud,
19.1 percent choose proprietary solutions from selected providers and 18.5 percent open
source alternatives. The rest either follow a hybrid strategy (15.5 percent) or have not yet
formed an opinion (8.8 percent).
Among the cloud-based solutions AWS (https://aws.amazon.com/de/machine-learning/) has the
highest level of awareness: 71 percent of the decision makers indicate that they Amazon in this
context is known. Even Microsoft, Google and IBM are the survey respondents to more than
two-thirds in the ML environment. Interestingly enough, only 17 percent of respondents use
AWS cloud services in the context of evaluation, design and production operations for ML.
About one third of the respondents in each case deal with IBM Watson ML
(https://www.ibm.com/cloud/machine-learning), Microsoft Azure ML Studio
(https://azure.microsoft.com/en-us/services/machine-learning-studio) or the Google Cloud ML
Platform (https://cloud.google.com/ml-engine/docs/tensorflow/technical-overview). The analysts
believe that this has a lot to do with the manufacturers' marketing efforts. Accordingly, IBM and
Microsoft are investing heavily in their cognitive and AI strategies. Both have strong SME and
wholesale distribution and a large network of partners. Google, however, owes its position to
the image as a huge data and analytics machine, which drives the market through many
innovations - such as Tensorflow, many ML-APIs and their own hardware. After all, HP
Enterprise with "Haven on Demand" is also one of the relevant ML players and is used by 14%.

Status: 1 December 2020 Page: 38


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Ethics in Artificial Intelligence (AI)


Lecture on GI Regional Group Stg./BB from 14.1.19: "How 'ethics' can be integrated into AI
programs" by Michael Mörike, (board member Integrata-Stiftung Tübingen):

The development of artificial intelligence is rapid and explosive at the same time: Algorithms that decide
for us, and the capabilities of machines that surpass us humans already raise many ethical questions:
How should an autonomous car behave in everyday life? Which rules will apply to robots in the future? Is it
possible to include ethics in AI?
The latter question is at the center of the lecture. The author assumes this could work and tries to explain
how to do that. He differentiates between human ethics in dealing with AI, which he calls external ethics,
and ethics, which he calls machinery morality. In the lecture the necessity and the benefit as well as the
feasibility are discussed.
In the Special Interest Group AI at bwcon, we address ethical issues and challenges in the rapid
development of artificial intelligence with business representatives from southern Germany.
More about Special Interest Group

See also Stanley Kubrick’s famous movie from 1968: “2001: A Space Odyssey“
https://www.youtube.com/watch?v=XHjIqQBsPjk
HAL 9000: "I'm sorry Dave, I'm afraid I can't do that“ & “Deactivation of HAL 9000”
https://www.youtube.com/watch?v=ARJ8cAGm6JE&t=42s
https://www.youtube.com/watch?v=c8N72t7aScY&list=PLawr1rgf_CvSiNsWPbLOOrMKbcZRHJud7&index=25

Page 39
Status: 1. Dezember 2020

Ethics of AI and ML is very important and should also be mentioned in a lecture


about ML. Better understanding ML methods give the students also the ability to
better understand and evaluate the ethical aspects of the technology.
Examples such as these cause skepticism as to whether algorithms judge more
objectively than humans. They are only as good as the developers who program
them - and how the data they are fed with. Therefore, they must be
understandable, because AI does not always have to be intelligent, let alone
meaningful. On the other hand, people are mistaken. Especially in the area
of ​personnel decisions on the stomach are made or the applicants are preferred,
who are similar to the hiring person. Intelligent programs can counteract if they
have been cleverly designed and scrutinized. You can give applicants a chance
that would never have been invited. So there is no simple solution. Which
program is clever and which tasks it should and is supposed to take over - that is
why people still have to decide.
Consider an Interview with Carsten Kraus: „Deep Neural Networks könnten
eigene Moralvorstellungen entwickeln“
https://ecommerce-news-magazin.de/e-commerce-news/e-commerce-
interviews/interview-mit-carsten-kraus-deep-neural-networks-koennten-eigene-
moralvorstellungen-entwickeln/

Status: 1 December 2020 Page: 39


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML1


Homework H1.1 – “Most Popular ML Technologies + Products”

Groupwork (3 Persons). Look on the three most used ML


technologies/products (see information in internet):
1. IBM Watson Machine Learning - https://www.ibm.com/cloud/machine-learning

2. Microsoft Azure ML Studio - https://azure.microsoft.com/en-


us/services/machine-learning-studio/

3. Google Cloud Machine Learning Plattform - https://cloud.google.com/ml-


engine/docs/tensorflow/technical-overview

Give of short overview about the products and its features (~10 minutes for
each) und give a comparison matrix of the 3 products and an evaluation.
What is your favorite product (~ 5 minutes).

Page 40
Status: 1. Dezember 2020

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 40


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML1


Homework H1.2 - “Ethics in Artificial Intelligence”
Groupwork (2 Persons) - evaluate the interview with Carsten Kraus (Founder
Omikron/Pforzheim, Germany): „Deep Neural Networks könnten eigene Moralvorstellungen
entwickeln“.
https://ecommerce-news-magazin.de/e-commerce-news/e-commerce-interviews/interview-
mit-carsten-kraus-deep-neural-networks-koennten-eigene-moralvorstellungen-entwickeln/
The victory of Google-developed DeepMind-Software AlphaGo against South Korean Go-
world champion Lee Sedol does not simply ring in the next round of industrial revolution.
According to IT expert Carsten Kraus, the time of superiority of Deep Neural Networks (DNN)
with respect to human intelligence has now began.

Homework H1.3 (optional) – “Create Painting with DeepArt”


1 Person – Create your own painting by using DeepArt company in Tübingen
(https://deepart.io/). What ML method did you use to create “paintings”?

Page 41
Status: 1. Dezember 2020

Not only in Silicon Valley, but also in the northern Black Forest, the artificial
intelligence is driven forward. The medium-sized company Omikron wrestles with
Google for specialists and sells its search technology to the greats: Renault,
Fresenius and Siemens rely on the machine learning from Pforzheim.

Omikron founder Carsten Kraus understood early on how to make calculating


machines more efficient and intelligent. As a gifted prodigy, he founded the
company during school hours. The startup mentality he has preserved until today.

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 41


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML1

Homework H1.4 (optional) – Summary of video “What is ML?”


1 Person - summaries the results of the first YouTupe Video “What is Machine
Learning” by Andrew Ng in a report of 10 minutes. Create a small PowerPoint
presentation. See:
https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

Homework H1.5 (optional)– Summary of video “Supervised - &


Unsupervised Learning”

Groupwork (2 Persons) - summaries the results of the second and third YouTupe
Video “Supervised Learning” and “Unsupervised Learning” by Andrew Ng in a Report
of 15 Minutes. Create a small PowerPoint presentation. See:
https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN

Page 42
Status: 1. Dezember 2020

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 42


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML2 – Concept Learning: Version Spaces & Candidate


Elimination Algorithm

Page 43
Status: 1. Dezember 2020

References:
• T. Mitchell, 1997, Chapter 2.
• P. Winston, "Learning by Managing Multiple Models", in P.
Winston, Artificial Intelligence, Addison-Wesley Publishing Company,
1992, pp. 411-422.
See also:
http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/vspace/3_vspace.html

Status: 1 December 2020 Page: 43


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Motivation

***************** placeholder ***********

Page 44

Status: 1 December 2020 Page: 44


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Introduction

***************** placeholder ***********

Page 45

Status: 1 December 2020 Page: 45


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of AQ Learning

Definition
AQ learning is a form of supervised machine learning of rules from examples and
background knowledge performed by the well-known AQ family of programs and other
machine learning methods. AQ learning pioneered separate-and-conquer approach to
rule learning in which examples are sequentially covered until a complete class
description is formed. Derived knowledge is represented in a highly expressive form
of attributional rules.
Theoretical Background
The core of AQ learning is a simple version of Aq (algorithm quasi-optimal) covering
algorithm, developed by Ryszard S. Michalski in the late 1960s (Michalski 1969). The
algorithm was initially developed for the purpose of minimization of logic functions,
and later adapted for rule learning and other machine learning applications.
Simple Aq Algorithm
Aq algorithm realizes a form of supervised learning. Given a set of positive events
(examples) P, a set of negative events N, and a...

Page 46

Status: 1 December 2020 Page: 46


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example of Candidate Elimination Algorithm (1/3)

Page 47

https://onlinelibrary.wiley.com/doi/abs/10.1002/wics.78

Status: 1 December 2020 Page: 47


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example of Candidate Elimination Algorithm (2/3)

Page 48

Status: 1 December 2020 Page: 48


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example of Candidate Elimination Algorithm (3/3)

Page 49

Status: 1 December 2020 Page: 49


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of AQ Learning

***************** placeholder ***********

Page 50

Status: 1 December 2020 Page: 50


Dr. Hermann
Dr. Hermann Völlinger
Völlinger
Mathematics
Mathematics & IT-Architecture
& IT-Architecture

Exercises to Lesson ML2


Homework H2.1 – “Version Space for “EnjoySport”:
Create the Version Space for the EnjoySport concept learning problem with training
examples in the following table; see [TMitch], Ch.2 or
https://www.youtube.com/watch?v=cW03t3aZkmE

Homework H2.2 – “ VS ---second example******

***************** placeholder ***********

Page 51
Status: 1. Dezember 2020

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 51


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML2

Homework 2.3 – “Exercise of an Example with Python”

***********Platzhalter********************

Homework 2.4 – “Exercise of an Example with Python”

***********Platzhalter********************

Page 52
Status: 1. Dezember 2020

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 52


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML3 – Supervised- and Unsupervised Learning

https://images.app.goo.gl/ZiyTSGRYcnSzBsB59

Page 53
Status: 1. Dezember 2020

In the ML3 Chapter we list the most common concepts and algorithms of
SuperVised- (SVL) and UnSuperVised Learning (USVL):
In esp. under SVL we see Classification methods like Lazy Learning ( Rote
Learning, kNN Algorithm, etc. ) and Bayes Learning for Text Classification and
also Regression methods (i.e. simple linear regression).
In USVL we discuss Clustering- (i.e. K-Means Clustering) and Association
Algorithms (i.e. Predictive Market Basket Analysis).

Status: 1 December 2020 Page: 53


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Different Learning Scenarios – Supervised v. Unsupervised

https://en.wikipedia.org/wiki/Machine_learning#Supervised_and_semi-supervised_learning

Page 54
Status: 1. Dezember 2020

Supervised: All data is labeled and the algorithms learn to predict the
output from the input data.
Unsupervised: All data is unlabeled and the algorithms learn to inherent
structure from the input data.
Semi-supervised: Some data is labeled but most of it is unlabeled and a
mixture of supervised and unsupervised techniques can be used.

Status: 1 December 2020 Page: 54


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (1/8): Lazy Learning / k-Nearest Neighbor


Rote Learning Example

Page 55
Status: 1. Dezember 2020

In machine learning, Lazy Learning is a learning method in which generalization of


the training data is, in theory, delayed until a query is made to the system, as
opposed to in eager learning, where the system tries to generalize the training data
before receiving queries.
Rote learning is a memorization technique based on repetition. The idea is that one
will be able to quickly recall the meaning of the material the more one repeats it.
Some of the alternatives to rote learning include meaningful learning, associative
learning, and active learning.
The primary motivation for employing lazy learning, as in the K-nearest
neighbors algorithm, used by online recommendation systems ("people who
viewed/purchased/listened to this movie/item/tune also ...") is that the data set is
continuously updated with new entries (e.g., new items for sale at Amazon, new
movies to view at Netflix, new clips at Youtube, new music at Spotify or Pandora).
Because of the continuous update, the "training data" would be rendered obsolete
in a relatively short time especially in areas like books and movies, where new best-
sellers or hit movies/music are published/released continuously. Therefore, one
cannot really talk of a "training phase".

Status: 1 December 2020 Page: 55


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (2/8): Value Difference Metric (VDM)


Value Difference Metric (VDM): The VDM metrics described here address the problem of classification. A system is
presented with a training set T, which consists of n instances. Each instance has an input vector x, and an output
class c. The problem of classification is to decide what the output class of a new input vector y should be, based
on what was learned from the training set, even if y did not appear in the training set.

Business Meaning: The vdm metrics describes the distance in the behavior of different instances of the
attribute a. See for example the outcome of the concrete calculations in the homework ML3.1

Page 56

https://pdfs.semanticscholar.org/f72c/bf9f16f244f5643273fa04c25e2697fe66b
9.pdf

Status: 1 December 2020 Page: 56


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (3/8): Bayes Learning Concept


Bayes Learning (Conditional Probabilitiy):
In probability theory and statistics, Bayes' theorem) describes
the probability of an event, based on prior knowledge of
conditions that might be related to the event. Bayes
theorem is named after Reverend Thomas Bayes (/beɪz/;
1701?–1761).
More details later in this chapter

Page 57

Example: see https://en.wikipedia.org/wiki/Bayes%27_theorem


A particular test for whether someone has been using cannabis is 90% sensitive and
80% specific, meaning it leads to 90% true "positive" results (meaning, "Yes he used
cannabis") for cannabis users and 80% true negative results for non-users-- but also
generates 20% false positives for non-users. Q.: Assuming 5% of people actually do use
cannabis, what is the probability that a random person who tests positive is really a cannabis
user?
Solution: Let P(User|Positive) mean "the probability that someone is a cannabis user
given that he tests positive“. Then we can write withy the formula above:
Test with Confusion-Matrix

Remark: Bayes Learning is called Naive Bayes when the value of a particular feature
is independent of the value of any other feature, given the class variable. For example, a fruit may be
considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier
considers each of these features to contribute independently to the probability that this fruit is an
apple, regardless of any possible correlations between the color, roundness, and diameter features.

Status: 1 December 2020 Page: 57


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (4/8): Bayes L. for Text Classification


Task: Let’s see how this works in practice with a simple example. Suppose we are building a classifier that says
whether a text is about sports or not. Our training data has 5 sentences:
Training-Text and Target-Text: No. Training-Text Label
1 "A great game” Sports
2 “The election was over” Not Sports
2 “Very clean match” Sports
4 “A clean but forgettable game” Sports
5 “It was a close election” Not Sports
Target-Text
new "A very close game" ??????????

Bayes’ Theorem is useful when working with conditional probabilities (like we are doing here), because it provides us with a
way to reverse them. In our case, we have, so using this theorem we can reverse the conditional probability:

Page 58

More details:
https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-
556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20
training%20data.
The Naive Bayes classifier is a simple classifier that classifies based on probabilities
of events. It is the applied commonly to text classification. Though it is a simple
algorithm, it performs well in many text classification problems.
Other Pros include less training time and less training data. That is, less CPU and
Memory consumption.
As with any machine learning model, we need to have an existing set of examples
(training set) for each category (class).
Let us consider sentence classification to classify a sentence to either ‘Sports’ or
‘Not Sports’. In this case, there are two classes (“Sports” and “Not Sports”). With
the training set, we can train a Naive Bayes classifier which we can use to
automatically categorize a new sentence.

Important Use-Case: Recognizing Spam-Mail


Consider the problem of classifying documents by their content, for example
into spam and non-spam e-mails. See also: Example2 of the ML Definition by Tom
Mitchel (Chapter ML1).

Status: 1 December 2020 Page: 58


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (5/8): Bayes L. for Text Classification


So here comes the Naive part: we assume that every word in a sentence is independent of the other ones. This
means that we’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “this
was a fun party” is the same as “this party was fun” and “party fun was this”.

Frequencies Training- Data:

Calculating Probabilities:
The final step is just to calculate every probability and see which one turns out to be larger. Calculating a probability is just
counting in our training data. First, we calculate the a priori probability of each tag: for a given sentence in our training data, the
probability that it is Sports = P(Sports)=3/5. Then, P(Not Sports)= 2/5. That’s easy enough.
Then, calculating P(game|Sports) means counting how many times the word “game” appears in Sports texts (2) divided by the
total number of words in sports (11). Therefore, P(game|Sports)=2/11.
However, we run into a problem here: “close” doesn’t appear in any Sports text! That means that P(close|Sports)=0. This is
rather inconvenient since we are going to be multiplying it with the other probabilities, so we’ll end up with zero.

Page 59

One can also write a Jupyter Notebook to automat the calculations:


To find the total number of times a word appears in a class, we can use
CountVectorizer from sklearn. CountVectorizer gives Term-Document Matrix
(TDM) for each class. A term-document matrix (TDM) consists of a list of word
frequencies appearing in a set of documents. Next, let’s compute the Term-
Document Matrix (TDM) for ‘Sports’ class.
We follow the description of the above link (page before):

See also: https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-


classifier-fa59e3e24aaf

Status: 1 December 2020 Page: 59


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (6/8): Bayes L. for Text Classification

How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s never zero. To
balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In
our case, the possible words are (see notespage): [
'a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].
Since the number of possible words is 14 (I counted them!), applying smoothing we get
that P(game|Sports)=(2+1)/(11+14)=3/25. The full results are:

Word P(word | Sports) P(word | Not Sports)


P(Sports)=3/5
P(Not Sports)=2/5 a 3/25 2/23
Anzahl (Words|Sports)=11 very 2/25 1/23
# (Words|Not Sports)=9 close 1/25 2/23
# (possible words)=14 <-- see notes game 3/25 1/23

==> P(a|Sports)* P(very|Sports)* P(close|Sports)* P(game|Sports)* P(Sports) = 3/25*2/25*1/25*3/25*3/5 =


18/25*1/25³*3/5 = 54/125*1/25³~ 2,7648e-5

Analog: P(a|Not Sports)*P(very|Not Sports)*P(close|Not Sports)*P(game|Not Sports)*P(Not Sports) =


2/23*1/23*2/23*1/23*2/5= 8/115*1/23³~0,57176e-5

=➔ the Sentence ‘a very close game‘ is tagged as Sports.

https://www.youtube.com/watch?v=exHwwy9kVcg

Page 60

No. word #word>1 sum


To calculate the number 1 a 3 3
of positive words, see 2 great 4
the following table: 3 very 5
4 over 6
5 it 7
6 but 8
7 game 2 10
8 election 2 12
9 clean 2 14
10 close 15
11 the 16
12 was 2 18
13 forgettable 19
14 match 20

Status: 1 December 2020 Page: 60


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (7/8): Support Vector Machines (SVM)


In machine learning, support-vector machines (SVMs, also support-vector networks[1]) are supervised
learning models with associated learning algorithms that analyze data used for classification and regression
analysis. The Support Vector Machine (SVM) algorithm is a popular machine learning tool that offers
solutions for both classification and regression

See: Chapter ML8

Page 61

In machine learning, support-vector machines (SVMs, also support-vector networks[1])


are supervised learning models with associated learning algorithms that analyze data used
for classification and regression analysis. The Support Vector Machine (SVM) algorithm is a
popular machine learning tool that offers solutions for both classification and regression
problems. Developed at AT&T Bell Laboratories by Vapnik with colleagues (Boser et al.,
1992, Guyon et al., 1993, Vapnik et al., 1997), it presents one of the most robust
prediction methods, based on the statistical learning framework or VC theory proposed by
Vapnik and Chervonekis (1974) and Vapnik (1982, 1995). Given a set of training examples,
each marked as belonging to one or the other of two categories, an SVM training
algorithm builds a model that assigns new examples to one category or the other, making
it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist
to use SVM in a probabilistic classification setting). An SVM model is a representation of
the examples as points in space, mapped so that the examples of the separate categories
are divided by a clear gap that is as wide as possible. New examples are then mapped into
that same space and predicted to belong to a category based on the side of the gap on
which they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification using what is called the kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces….
More details: https://en.wikipedia.org/wiki/Support_vector_machine

Status: 1 December 2020 Page: 61


Dr. Hermann Völlinger
Mathematics & IT-Architecture

SVL Classification (8/8): Natural Language Processing (NLP)

Computer Linguistics (also called Natural


Language Processing – NLP) one learned from
many training examples to understand Human
Speech and transform it to text or vice versa. For
example today speech assistants (ex. Alexa etc.)
are very common.
For example:
▪ Build a “Voice Assistant” by using the Text /
Speech Recognition services of the IBM Cloud :
platform:
1. Watson Voice Agent
2. Watson Assistant
3. Speech to Text
4. Text to Speech

See also Homework ML3.3 for more details about


building the services.

Page 62

Log in into IBM Cloud and follow the tutorial descriptions (see links):
Create a “Voice Agent” by running the following steps:
• Set up the requires IBM Cloud Services
• Configure the TWILIO Account
• Configure the Voice Agent on the IBM Cloud and import Skill by uploading
either skill-banking-balance-enquiry.json or skill-pizza-order-book-
table.json

Status: 1 December 2020 Page: 62


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Supervised Learning (SVL) – List of Regression Concepts

Link: https://en.wikipedia.org/wiki/Regression_analysis

Page 63
Status: 1. Dezember 2020

In statistical modeling, regression analysis is a set of statistical processes


for estimating the relationships among variables. It includes many techniques
for modeling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent
variables (or 'predictors'). More specifically, regression analysis helps one
understand how the typical value of the dependent variable (or 'criterion
variable') changes when any one of the independent variables is varied, while
the other independent variables are held fixed.

Status: 1 December 2020 Page: 63


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Supervised Learning (SVL) – Simple Linear Regression Example


Given a data set { y1 , x1 , … , xp } of n statistical units, a linear regression model assumes that the
relationship between the dependent variable y and the p-vector of regressors x is linear. This
relationship is modeled through a disturbance term or error variable ε — an unobserved random variable
that adds "noise" to the linear relationship between the dependent variable and regressors.

See more details to this in chapter ML5.

Page 64
Status: 1. Dezember 2020

See chapter ML5

Status: 1 December 2020 Page: 64


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of Unsupervised Learning (USVL)

Page 65
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 65


Dr. Hermann Völlinger
Mathematics & IT-Architecture

First and most common USVL Example - Clustering (1/4)

Page 66
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 66


Dr. Hermann Völlinger
Mathematics & IT-Architecture

First USVL Example – Clustering Concept (2/4)

Page 67
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 67


Dr. Hermann Völlinger
Mathematics & IT-Architecture

First USVL Example: K-Means-Clustering Algorithm (3/4)

Page 68

K-means Clustering is one of the simplest and popular unsupervised machine learning
algorithms. Typically, unsupervised algorithms make inferences from datasets using only
input vectors without referring to known, or labelled, outcomes. The objective of K-
means is simple: group similar data points together and discover underlying patterns. To
achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
A cluster refers to a collection of data points aggregated together because of certain
similarities. You’ll define a target number k, which refers to the number of centroids you
need in the dataset. A centroid is the imaginary or real location representing the center
of the cluster. Every data point is allocated to each of the clusters through reducing the
in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then
allocates every data point to the nearest cluster, while keeping the centroids as small as
possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the
centroid. More details:
K-means algorithm: Let’s see the steps on how the K-means machine learning algorithm
works using the Python. We’ll use the Scikit-learn library and some random data to
illustrate a K-means clustering ….See more details under:
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-
learning-6a6e67336aa1

Status: 1 December 2020 Page: 68


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Cluster Ex. & K-Means Clusters of IRIS Dataset (4/4)

https://github.com/bhattbhavesh91/k_means_iris_dataset/blob/master/K_in_K_means_Clustering.ipynb

Page 69
Status: 1. Dezember 2020

How the K-means algorithm works: To process the learning data, the K-means algorithm
in data mining starts with a first group of randomly selected centroids, which are used as
the beginning points for every cluster, and then performs iterative (repetitive)
calculations to optimize the positions of the centroids It halts creating and optimizing
clusters when either: The centroids have stabilized — there is no change in their values
because the clustering has been successful. The defined number of iterations has been
achieved.
K-means Clusters of IRIS Dataset: The Iris dataset contains the data for 50 flowers from
each of the 3 species - Setosa, Versicolor and Virginica.
http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html
The data gives the measurements in centimeters of the variables sepal length and width
and petal length and width for each of the flowers. Goal of the study is to perform
exploratory analysis on the data and build a K-means clustering model to cluster them
into groups. Here we have assumed we do not have the species column to form clusters
and then used it to check our model performance. Since we are not using the species
column we have an unsupervised learning method. Develop a Python program by using
the Scikit-learn library can bee see under:
https://github.com/bhattbhavesh91/k_means_iris_dataset/blob/master/K_in_K_means_
Clustering.ipynb

Status: 1 December 2020 Page: 69


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase MS Azure ML: k-Means Algorithm on UCI Iris data

Page 70
Status: 1. Dezember 2020

YouTupe videos:
https://www.youtube.com/watch?v=Cifl6cuEwMw
https://www.youtube.com/watch?v=nnp77iFxjrE
https://www.youtube.com/watch?v=pH3hQc585WQ
Use Case Description:
https://gallery.azure.ai/Experiment/a7299de725a141388f373e9d74ef2f86
This sample demonstrates how to perform clustering using k-means algorithm on the UCI Iris
data set. Also we apply multi-class Logistic regression to perform multi-class classification
and compare its performance with k-means clustering.
Clustering: Group Iris Data
This sample demonstrates how to perform clustering using the k-means algorithm on the
UCI Iris data set. In this experiment, we perform k-means clustering using all the features in
the dataset, and then compare the clustering results with the true class label for all samples.
We also use the Multiclass Logistic Regression module to perform multiclass classification
and compare its performance with that of k-means clustering.
Data
We used the Iris data set, a well-known benchmark dataset for multiclass classification from
the UCI repository. This dataset has 150 samples with 4 features and 1 label (the last
column). All features are numeric except that the label, which is a string.

Status: 1 December 2020 Page: 70


Dr. Hermann Völlinger
Mathematics & IT-Architecture

2. Ex. of USVL: Association Learning – General Form (1/4)

Page 71
Status: 1. Dezember 2020

What is Association Learning?


Association learning is a rule based machine learning and data mining technique that
finds important relations between variables or features in a data set. Unlike
conventional association algorithms measuring degrees of similarity, association rule
learning identifies hidden correlations in databases by applying some measure of
interestingness to generate an association rule for new searches.

Practical Uses of Association Learning


• Basket data analysis – Whether planning product placement in a storefront,
running a marketing campaign or designing a business catalog, association mining is a
useful tool to take the guesswork out of what your customers are looking for.
• Web usage mining and intrusion detection – Finding these hidden correlations is a
powerful predictive tool to discover brand new security threats and network
performance issues that haven’t been analyzed first by a human.
• Bioinformatics – From biology to engineering and everything in between,
association mining is one of the go-to foundational tools for spotting overlooked and
potentially useful techniques.

Status: 1 December 2020 Page: 71


Dr. Hermann Völlinger
Mathematics & IT-Architecture

2. Ex. USVL: Association Rules – Repeat Measure Def’s (2/4)

Properties:
• Sup(X=>Y) = Sup(Y=>X)
• Lift(X=>Y) = Lift(Y=>X)

Question:
• How many rules have you to consider in
this example?
• Prove the answer: You have to consider
80 rules (40 for Support and Lift).

Page 72
Status: 1. Dezember 2020

N=5
Support (A=>D):= frq(A,D)/5=2/5
Support (C=>A):= frq(C,D)/5=2/5
Support (A=>C):= frq(A,C)/5=2/5
Support (B&C=>D):= frq(B&C,D)/5=1/5

Confidence(A=>D):=frq(A,D)/frq(A)=(2/5)/(3/5)=2/3
Confidence(C=>A):=frq(C,A)/frq(C)=(2/5)/(4/5)=2/4=1/2
Confidence(A=>C):=frq(A,C)/frq(A)=(2/5)/(3/5)=2/3
Confidence(B&C=>D):=frq(B&C,D)/frq(B&C)=(1/5)/(3/5)=1/3

Lift(A=>D):=Sup(A=>D)/(Sup(A)*Sup(D))=(2/5)/(3/5*3/5)=(2/5)/(9/25)=2/(9/5)=10/9
Lift(C=>A):=Sup(C=>A)/(Sup(C)*Sup(A))=(2/5)/(4/5*3/5)=(2/5)/(12/25)=2/(12/5)=10/1
2=5/6
Lift(A=>C):=Sup(A=>C)/(Sup(A)*Sup(C))=(2/5)/(3/5*4/5)=(2/5)/(12/25)=2/(12/5)=10/1
2=5/6
Lift(B&C=>D):=Sup(B&C=>D)/(Sup(B&C)*Sup(D))=(1/5)/(3/5*3/5)=(1/5)/(9/25)=1/(9/
5)=5/9

Status: 1 December 2020 Page: 72


Dr. Hermann Völlinger
Mathematics & IT-Architecture

2. Ex. USVL: Use-Case: Predictive Market Basket Analysis (3/4)

Page 73
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 73


Dr. Hermann Völlinger
Mathematics & IT-Architecture

2. Ex. USVL: Check Associations for Business Context (4/4)

Status: 1. Dezember 2020 Page: 74

The diagram (example) suggests that the temperature of the ocean


depends on the number of pirates on the oceans. One might come to the
conclusion: "Let's increase the number of pirates and the temperature of
the oceans will drop again".
It is therefore always to check by business experts, which metrics are only
in a correlation and which have a causal business context.

Status: 1 December 2020 Page: 74


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML3


Homework H3.1 – “Calculate Value Difference Metric”
Calculate d:=Value Difference Metric (VDM) for the fields “Refund” and “Marital
Status”. Remember the following formula and see also details of VDM in internet (1
person, 10 minutes):
𝑛1,𝑐 = die Häufigkeit von Attributwert 1 in Klasse c
𝑛1 = die Häufigkeit von Attributwert 1 über alle Klassen
Da keine numerischen Werte vorhanden sind, setze
k=1

With data table:

Hint: d(single, married), d(single, divorced), d(married, divorced); d(refund=yes, refund=no)

Page 75
Status: 1. Dezember 2020

d(single, married), d(single, divorced), d(married, divorced);

d(refund=yes; refund=no)

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 75


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML3


Homework H3.2 – “Bayes Learning for Text Classification”

1 Person: Review the example about Bayes Learning in this lesson. Use the same training data as
in the lesson together with the new lagged text. Run the Bayes -Text Classification calculation for
the sentence “Hermann plays a TT match” and tag this sentence.
No. Training-Text Label

1 "A great game” Sports

2 “The election was over” Not Sports


2 “Very clean match” Sports

4 “A clean but forgettable game” Sports


5 “It was a close election” Not Sports
6 "A very close game" Sports
Target-Text
new "Hermann plays a TT match" ??????????

Additional Question: What will happen if we change the target to “Hermann plays a very clean
game”

Optional*(1 P.): Define an algorithm in Python (use Jupyter Notebook) to automate the calculations.
Use description under: https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-
556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data.

Page 76
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 76


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML3


Homework H3.3 – Create in IBM Cloud two services “Voice Agent” and
”Watson Assistant Search Skill” with IBM Watson Services
Homework for 2 person: Log in into IBM Cloud and follow the tutorial descriptions (see links):
1. “Voice Agent” (1 person):
a. Set up the requires IBM Cloud Services
b. Configure the TWILIO Account
c. Configure the Voice Agent on the IBM Cloud and import Skill by uploading either
 skill-banking-balance-enquiry.json or
 skill-pizza-order-book-table.json
See tutorial: https://github.com/FelixAugenstein/digital-tech-tutorial-voice-agent
2. “Assistant Search Skill” (1 person):
a. Configuring Watson Assistant & Discovery Service on the IBM Cloud
b. Configuring Watson Assistant & Search Skill on the IBM Cloud
c. Deploy the Assistant with Search Skill
See tutorial: https://github.com/FelixAugenstein/digital-tech-tutorial-watson-assistant-search-skill
Remark: You can integrate the two skills, such that when the dialog skill has no answer you show the search
results. The reading of texts from the search results of the search skill is unfortunately not (yet) possible.
Watson can only display the search result with title/description etc. as on Google. The tutorial in the cloud
docs on the same topic is also helpful: https://cloud.ibm.com/docs/assistant?topic=assistant-skill-search-add

Page 77
Status: 1. Dezember 2020

A further example of a Finance Chatbot is BOTTO which runs von MS Azure from Fiducia
AG (Karlsruhe). See a presentation from Fiducia and Adesso (Dortmund):
https://www.adesso.de/adesso/adesso-de/branchen/banken-
finanzdienstleister/sonderthemen/forum-banken/praesentation-chatbot-botto-g-weber-fiducia-
gad-it-ag.pdf

Status: 1 December 2020 Page: 77


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML3


Homework H3.4 – Create a K-Means Clustering in Python”
Homework for 2 Persons: Create a python algorithm (in Jupyter Notebook) which clusters the
following points:

Following the description of: https://benalexkeen.com/k-means-clustering-in-python/ to come to 3


clear clusters with 3 means at the center of these clusters:
We’ll do this manually first (1 person), then show how it’s done using scikit-learn (1 person)

Page 78
Status: 1. Dezember 2020

Repeat K-Means clustering (DM lesson or internet). Describe and explain the 4
necessary steps to reach the final cluster
1. The centroids.
2. Assigning the first clusters.
3. Calculating the center of gravity and interacting.
4. The final clusters.

Status: 1 December 2020 Page: 78


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML3


Homework H3.5 – “Repeat + Calculate of Measures for Association”

Remember and give explanations of the Measures for Association: support,


confidence and lift (1 person, 10 min):

Calculate these measures for following 8 item sets of a shopping basket (1 person, 10
minutes):
{ Milch, Limonade, Bier }; { Milch, Apfelsaft, Bier }; { Milch, Apfelsaft, Orangensaft };{
Milch, Bier, Orangensaft, Apfelsaft };{ Milch, Bier };{ Limonade, Bier, Orangensaft }; {
Orangensaft };{ Bier, Apfelsaft }
1. What is the support of the item set { Bier, Orangensaft }?
2. What is the confidence of { Bier } ➔ { Milch } ?
3. Which association rules have support and confidence of at least 50%?

Page 79
Status: 1. Dezember 2020

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 79


Dr. Hermann Völlinger
Mathematics & IT-Architecture

1. ML1: Introduction to Machine Learning (ML)


2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML4 – Decision Tree Learning

Page 80
Status: 1. Dezember 2020

Decision tree learning uses a decision tree as a predictive model which maps
observations about an item to conclusions about the item's target value.
It is one of the predictive modelling approaches used in statistics, data
mining and machine learning. Tree models where the target variable can take a
finite set of values are called classification trees. In these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead
to those class labels. Decision trees where the target variable can take continuous
values (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree describes
data but not decisions; rather the resulting classification tree can be an input for
decision making.

Status: 1 December 2020 Page: 80


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of Decision Trees


A decision tree is a tree where each node represents a feature(attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues
value).
The whole idea is to create a tree like this play tennis example or the entire data and process a
single outcome at every leaf .

Page 81

Status: 1 December 2020 Page: 81


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Usage of Decision Trees

• Decision tree is one of the most popular machine learning


algorithms used all along.
• Decision trees are used for both classification and regression
problems, in this this lesson we will talk about classification.
• We have couple of other algorithms in ML, so why do we have to
choose Decision Trees? Here are a few arguments why we use
Decision Trees:
✓ Decision tress often mimic the human level thinking so it’s so simple
to understand the data and make some good interpretations.
✓ Decision trees actually make you see the logic for the data to
interpret (not like black box algorithms like SVM, NN, etc..)

Page 82
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 82


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build a Decision Tree with Different Methods

There are couple of algorithms there to • Information entropy is the average rate at which
information is produced by a stochastic source of data.
build a decision tree , the most
important are: • Information Gain is the change in information entropy Η
from a prior state to a state that takes some information
1. ID3 (Iterative Dichotomiser 3) →
as given:
uses Entropy IG (T , a ) = H( T ) − H (T | a ) ,
function and Information where H( T | a ) is the conditional entropy of T given the
value of attribute a.
gain as metrics.

2. CART (Classification and Regression


Trees) → uses Gini Index
(Classification) as metric. • Consider Gini Index with Binary Target variables (Yes
and No), then we have 4 possible combinations
(Probabilities) with sum =1:
We will build a two decision trees with P(Target=1).P(Target=1) + P(Target=1).P(Target=0) +
P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1
this two methods using the Playing P(Target=1).P(Target=0) + P(Target=0).P(Target=1) =
Tennis Game data set (see next foil). 1 - P2(Target=0) – P2(Target=1).
Therefore Gini Index = 1 – P2(Target=0) – P2(Target=1).

Page 83
Status: 1. Dezember 2020

What are the differences between ID3, C4.5 and CART?


• ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations
developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn.
1, 1 (Mar. 1986), 81-106.)
• CART, or Classification And Regression Trees is often used as a generic acronym for the
term Decision Tree, though it apparently has a more specific meaning. In sum, the CART
implementation is very similar to C4.5; the one notable difference is that CART constructs
the tree based on a numerical splitting criterion recursively applied to the data, whereas
C4.5 includes the intermediate step of constructing rule sets.
• C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both
continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-
fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv)
different weights can be applied the features that comprise the training data. Of these, the
first three are very important--and (i) would suggest that any DT implementation you
choose have all three. The fourth (differential weighting) is much less important.
What is Gini index (source wikipedia): In economics, the Gini coefficient, sometimes called
the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent
the income or wealth distribution of a nation's residents, and is the most commonly used
measurement of inequality. It was developed by the Italian
statistician and sociologist Corrado Gini (1864-1965) and published in his 1912
paper ”Variability and Mutability”.[1][2]

Status: 1 December 2020 Page: 83


Dr. Hermann Völlinger
Mathematics & IT-Architecture

„Playing Tennis Game“ data set


Let’s just take a famous dataset in the machine learning world which is weather dataset (playing
game Y or N based on weather condition). For smarter work, count the frequencies of all feature
values in dependency of YES/NO.
Definition of “Frequency Table” X → y
feature feature feature
feature value 1 ( v1) value 2 (v2) value 3 (v3) sum
YES #v1&YES #v2&YES #v3&YES #YES
NO #v1&NO #v2&NO #v3&NO #NO
sum #value 1 #value 2 #value 3 #counts

Example: feature = outlook

outlook overcast sunny rainy sum


YES 4 2 3 9
NO 0 3 2 5
sum 4 5 5 14

Page 84
Status: 1. Dezember 2020

We have four X values (also called “features”)= {outlook, temp, humidity, windy} being
categorical and one y value (“target”)= {play Y or N} also being categorical. So we need to
learn the mapping (what machine learning always does) between X and y.
To better work with the data set we have first to summaries the number of results
(yes/no) in the target variable y=“play” for each value of all features (attributes)
X={outlook; temp.; humidity; windy}:

humidity high normal sum


YES 3 6 9
NO 4 1 5
sum 7 7 14

temperature hot mild cool sum


YES 2 4 3 9
NO 2 2 1 5
sum 4 6 4 14

windy FALSE TRUE sum


YES 6 3 9
NO 2 3 5
sum 8 6 14

Status: 1 December 2020 Page: 84


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build the Tree using the ID3 algorithm (1/7)

To create a tree, we need to have a root node first and we know that nodes are
features/attributes (outlook, temp, humidity and windy),
So, which one do we need to pick first??

Answer: determine the attribute that best classifies the training data; use this attribute
at the root of the tree. Repeat this process at for each branch.

This means we are performing top-down, greedy search through the space of possible
decision trees.
Okay, so how do we choose the best attribute?

Answer: use the attribute with the highest information gain in ID3

In order to define information gain precisely, we begin by defining two measures


commonly used in information theory, called entropy and information gain

Page 85
Status: 1. Dezember 2020

Lets first start with the ID3 algorithm:

1. Definition of the Root value: defined by the highest Information Gain in


ID3.
a) We need the two measures “Entropy of a Dataset S”=H(S) & “Information Gain”=IG.
b) …. see next slides

Status: 1 December 2020 Page: 85


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (2/7) – Measures (see Wikipedia)

Page 86
Status: 1. Dezember 2020

Remarks: Entropy for a binary classification problem


• If all examples are positive or all are negative then entropy will be zero i.e.
low.

• If half of the examples are of positive class and half are of negative class
then entropy is one i.e. high

Status: 1 December 2020 Page: 86


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (3/7) – Steps to Build Tree


1. Compute the Entropy for Dataset = H(S)(use the above
formula) and fill the Frequency Tables for all features.
2. For every attribute/feature:
i. Calculate entropy for all categorical values
ii. Take average information entropy for the current
attribute
iii.Calculate gain for the current attribute
3. Pick the highest gain attribute.
4. Repeat until we get the tree we desired.

Example: Compute the entropy for the weather data set

Page 87
Status: 1. Dezember 2020

See the calculation of the measure IG(S,Outlook) for the “Playing Tennis” data:

Status: 1 December 2020 Page: 87


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (4/7) – Entropy & Information Gain

Similarity we can calculate for other two attributes (Humidity and Temp).

Page 88
Status: 1. Dezember 2020

Detailed calculations are found in “Exercises2Lecture.pdf”


See for example the calculation of IG(S,T Temperature):

Which results in: IG(S, Temp) = 0.94 –(4/14)*1 - (6/14)*0.918 - (4/14)*0.811


~ 0,02914

Status: 1 December 2020 Page: 88


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (5/7) – Results for Root Node

Pick the highest gain attribute as root node.

Page 89
Status: 1. Dezember 2020

Detailed calculations are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 89


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (6/7) – Calculate Node|outlook=sunny

Remark: Details to the calculation of


InfoGain (temperature|outlook=sunny),
etc. can be found in the Notes-Page to this
slide or in the solution of the exercise ML3.1 .

Page 90
Status: 1. Dezember 2020

Detailed calculations are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 90


Dr. Hermann Völlinger
Mathematics & IT-Architecture

ID3 algorithm (7/7) – Final decision tree

Remark: Details to the calculation of


InfoGain (temperature)|outlook=rainy),etc.
can be found in the Notes-Page to this slide
or in the solution of the exercise ML3.1 .

Finally we get the tree something like his:

Page 91
Status: 1. Dezember 2020

Detailed calculations are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 91


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (1/8) - Definitions

In CART we use Gini index as a metric. We use the Gini Index as our cost
function used to evaluate splits in the dataset.
Gini Index for Binary Target variable

A Gini score gives an idea of how good a split is by how mixed the classes are in the
two groups created by the split. A perfect separation results in a Gini score of 0,
whereas the worst case split that results in 50/50 classes:

For Binary Target variable, Max Gini Index value = 1—(1/2)^2—(1/2)^2


= 1–2*(1/2)^2 = 1- 2*(1/4) = 1–0.5 = 0.5
Similarly if Target Variable is categorical variable with multiple levels, the Gini
Index will be still similar. If Target variable takes k different values, the Gini
Index will be: (Maximum value Gini Index is = 1–1/k)

Minimum value of Gini Index will be 0 when all observations belong to one label.

Page 92
Status: 1. Dezember 2020

GINI Index is also used in Data Mining lecture and is therefore a repeating of the
concepts. By using a different example like the DM lecture the GINI measure is
better understood.
Compare also the following YouTupe video “Gini index based Decision Tree” about
the calculation of a Gini index based Decision Tree:
https://www.youtube.com/watch?v=2lEcfRuHFV4

Status: 1 December 2020 Page: 92


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (2/8) – Continuous Attributes


 For efficient computation: for each attribute,
◼ Sort the attribute on values
◼ Linearly scan these values, each time updating the
count matrix and computing gini index
◼ Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Remark: For the calculation of the Gini index for each cell see Notes-Page of this slide.
Page 93
Status: 1. Dezember 2020

First calculate the Gini-Index of the first cell, we write: Gini (55):
Gini(55) = Frq(<=55)*Gini(<=55)+Frq(>55)*Gini(>55) where Frq(X):=#(values in X)/#(values in
cell). We calculate: Gini(<=55)=1-0²-0²=1
Gini(>55)=1-(3/10)²-(7/10)²=(100-9-49)/100=42/100=0.42
=➔ Gini(55)=0/10*1+10/10*0.42=0.420
Second cell: Gini (65): We calculate: Gini(<=65)=1-0²-1²=0
Gini(>65)=1-(3/9)²-(6/9)²=(81-9-36)/81=36/81=4/9
=➔ Gini(65)=1/10*0+9/10*4/9=0.400
Similar: Gini(72)=2/10*(1-0²-1²)+8/10*(1-(3/8)²-(5/8)²)=4/5*(64-9-
25)/64=4/5*30/64=3/8=0.375
Gini(80)=3/10*(1-0-1²)+7/10*(1-(3/7)²-(4/7)²)=7/10*(49-9-16)/49=7/10*24/49=12/35=0.343
Gini(87)=4/10*(1-(1/4)²-(3/4)²)+6/10*(1-(2/6)²-(4/6)²)=4/10*(16-1-9)/16+6/10*(36-4-16)/36
=3/20+4/15=(9+16)/60=25/60=5/12=0.417
Gini(92)=5/10*(1-(2/5)²-(3/5)²)+5/10*(1-(1/5)²-(4/5)²)=5/10*(25-4-9)/25+5/10*(25-1-16)/25
=5/10*12/25+5/10*8/25=6/25+4/25=10/25=40/100=0.400
Gini(97)=6/10*(1-(3/6)²-(3/6)²)+4/10*(1-(0/4)²-(4/4)²)=6/10*(1-1/4-1/4)+4/10*0
=6/10*1/2=3/10=0.300
Per symmetry of the cell values you can see, that: Gini(110)=Gini(80)=0.343;
Gini(122)=Gini(72)=0.375; Gini(172)=Gini(65)=0.400 and Gini(230)=Gini(55)=0.420

Result: the best split value is 97, as this has the lowest Gini-Index.

Status: 1 December 2020 Page: 93


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (3/8) – Gini (Node)

Page 94
Status: 1. Dezember 2020

Calculate the Gini Index of each of the attributes/features of “Play Tennis” example. Use
the Frequency Table X→y:

Gini (outlook) = 4/14*Gini(overcast) + 5/14*Gini(sunny) + 5/14*Gini(rainy) = 4/14*(1-


(4/4)²-(0/4)²) + 5/14*(1-(2/5)²-(3/5)²) + 5/14*(1-(3/5)²-(2/5)²)
= 4/14*(1- 1-0) + 5/14*(25/25-4/25-9/25) + 5/14*(25/25-9/25-4/12)
= 4/14*0 + 5/14*(12/25) + 5/14*12/25 = 10/14 * 12/25 = 2/7 * 6/5 = 12/35 = ~ 0.343

Gini (temp) = 4/14*Gini(hot) + 6/14*Gini(mild) + 4/14*Gini(cold) = 4/14*(1-


(2/4)²-(2/4)²) + 6/14*(1-(4/6)²-(2/6)²) + 4/14*(1-(3/4)²-(1/4)²)
= 4/14*(1- 1/4 - 1/4 ) + 6/14*(36/36-16/36-4/36) + 4/14*((16-9-1)/16)
= 4/14*1/2 + 6/14*(16/36) + 4/14*6/16 = 1/7 + 4/21 + 3/28 = 37/84 = ~ 0.44

Analogous: Gini(windy) ~ 0,428; Gini(humidity) ~ 0,367 (see slide p.68)

Status: 1 December 2020 Page: 94


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (4/8) – Algorithm Steps

1. Compute the gini index for data-set. This value is


not needed for the building(modeling) the tree, but
to have a quality measure for the tree. Fill the
frequency-table for all features (all values).
2. For every attribute/feature:
i. Calculate gini index for all categorical values
ii. Take average information entropy for the current
attribute
iii.Calculate the gini gain (lowest gini is best)
3. Pick the best gini gain attribute.
4. Repeat until we get the tree we desired.

We calculate it for every row and split the data accordingly in our binary tree.
We repeat this process recursively.

Page 95
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 95


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (5/8) – Step1


The calculations are similar to ID3 ,except the formula changes.
Compute Gini index for dataset = Gini(S): Fill Frequency Table:
outlook overcast sunny rainy sum
YES 4 2 3 9
NO 0 3 2 5
sum 4 5 5 14

humidity high normal sum


YES 3 6 9
NO 4 1 5
sum 7 7 14

temperature hot mild cool sum


YES 2 4 3 9
NO 2 2 1 5
sum 4 6 4 14

windy FALSE TRUE sum


YES 6 3 9
NO 2 3 5
sum 8 6 14

Page 96
Status: 1. Dezember 2020

Status: 1 December 2020 Page: 96


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (6/8) – Step2 (GiniSplit)

“Playing Tennis” example -Calculate Gini(outlook) (use Frequency Table):

Gini (outlook)
Fill Frequency Table for outlook: = 4/14*Gini(overcast) + 5/14*Gini(sunny) + 5/14*Gini(rainy)
outlook overcast sunny rainy sum = 4/14*(1- (4/4)²-(0/4)²) + 5/14*(1-(2/5)²-(3/5)²) + 5/14*(1-(3/5)²-(2/5)²)
= 4/14*(1- 1-0) + 5/14*(25/25-4/25-9/25)+ 5/14*(25/25-9/25-4/12)
YES 4 2 3 9 = 4/14*0+5/14*(12/25)+5/14*12/25=10/14*12/25=2/7*6/5=12/35~0.343
NO 0 3 2 5
sum 4 5 5 14 Similar: Gini(windy)=0,429; Gini(temp)=0.44; Gini(humidity)=0.367
=➔ choose outlook = root node

Page 97
Status: 1. Dezember 2020

Calculate the Gini Index of each of the attributes/features of “Play Tennis” example. Use the
Frequency Table X→y:
Gini (temp)
= 4/14*Gini(hot) + 6/14*Gini(mild) + 4/14*Gini(cold)
= 4/14*(1- (2/4)²-(2/4)²) + 6/14*(1-(4/6)²-(2/6)²) + 4/14*(1-(3/4)²-(1/4)²)
= 4/14*(1- 1/4 - 1/4 ) + 6/14*(36/36-16/36-4/36) + 4/14*(16/16-9/16-1/16) = 4/14*1/2
+ 6/14*(16/36) + 4/14*6/16 = 1/7 + 4/21 + 3/28 = 37/84 = ~ 0.44
Gini(windy)
= 8/14*Gini(false) + 6/14*Gini(true)
= 8/14*(1- (6/8)²-(2/8)²) + 6/14*(1-(3/6)²-(3/6)²)
= 8/14*(16/16 - 9/16 - 4/16) + 6/14*(1-1/4 -1/4) = 3/14*10/16 + 6/14*1/2
= 3/14 + 3/14 = 6/14 = 3/7 = ~ 0.429
Gini(humidity)
= 7/14*Gini(high) + 7/14*Gini(normal)
= 7/14*(1- (3/7)²-(4/7)²) + 7/14*(1-(6/7)²-(1/7)²)
= 7/14*(49/49-9/49-16/49) + 7/14*(49/49-36/49 -1/4) = 7/14*24/49 + 7/14*12/49 = 12/49
+ 6/49 = 18/49 = ~ 0,367

=➔ outlook = root note

Status: 1 December 2020 Page: 97


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (7/8) – Step2-4

In the same procedure, we can follow other steps to build the tree with Gini index

Gini(A) = 17/35 Gini(B) = 13/35


➔ the option B is the better choice

Page 98
Status: 1. Dezember 2020

Calculate the above numbers:

Case A:
Gini (N1) = 1 – (4/7)² + (3/7)² = 49/49 – 16/49 - 9/49 = 24/49 ~0.4898
Gini (N2) = 1 – (2/5)² + (3/5)² = 25/25 – 4/25 - 9/25 = 12/25 = 0.48

 Gini(A) = 7/12 * 24/49 + 5/12 * 12/25 =shorten= 2/7 + 1/5 = 17/35

Case B:
Gini (N1) = 1 – (1/5)² + (4/5)² = 25/25 – 1/25 - 16/25 = 8/25 = 0.32
Gini (N2) = 1 – (5/7)² + (2/7)² = 49/49 – 25/49 - 4/49 = 20/49 ~0.4082

 Gini(B) = 5/12 * 8/25 + 7/12 * 20/49 =shorten= 2/15 + 5/21 = 14/105 + 25/105 =
39/105 = 13/35

==➔ option B is the better choice.

Status: 1 December 2020 Page: 98


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Build Tree with Gini Index (8/8) – Step4


Calculate the left node for outlook feature=sunny.

Frequency Table (temp|sunny)


Gini(temp|outlook=sunny)
temperature with
= 2/5*(1-(0/2)²-(2/2)²)+2/5*(1–(1/2)²-(1/2)²)+1/5*(1-
outlook=sunny hot mild cool sum
(1/1)²-(0/1)²)
YES 0 1 1 2
= 2/5*(1-0-1)+2/5*(1-1/4-1/4)+1/5*(1–1– 0)
NO 2 1 0 3 = 2/5*1/2 =1/5 = 0.2
sum 2 2 1 5

Similar: Gini(windy|sunny) ~ 0,267; Gini(humidity|sunny) = 0


=➔ choose humidity = left node
Page 99
Status: 1. Dezember 2020

windy with outlook=sunny FALSE TRUE sum


YES 1 1 2
NO 2 1 3
sum 3 2 5

humidity with outlook=sunny high normal sum


YES 0 2 2
NO 3 0 3
sum 3 2 5

Gini(windy|sunny)
= 3/5*(1-(1/3)²-(2/3)²) + 2/5*(1–(1/2)²-(1/2)²) = 3/5*(9/9 -
1/9 - 4/9) + 2/5*(1-1/4-1/4)=3/5*4/9+2/5*1/2=7/15 = ~0.267

Gini(humidity|sunny)
= 3/5*(1-(0/3)²-(3/3)²) + 2/5*(1–(2/2)²-(0/2)²)=3/5*(1-0-1)+2/5*(1-1-0) = 0

➔ Humidity is the left node

Status: 1 December 2020 Page: 99


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Final Decison Tree - Summary


Calculate Gini(temp|rainy)~0,467 and Gini(windy|rainy) = 0 ➔ choose windy = right node

Page
Status: 1. Dezember 2020
100

temperature with outlook=rainy hot mild cool sum

YES 0 2 1 2
NO 0 1 1 3
sum 0 3 2 5

Gini(temp|rainy)
= 0* + 3/5*(1-(2/3)²-(1/3)²)+ 2/5*(1–(1/2)²-(1/2)²) = 3/5*(4/9) + 2/5*1/2
= 4/15 +3/15 = 7/15 = ~0.467
Remark*: there is no data record for the value “hot” --→Training Set is to small.
windy with outlook=rainy FALSE TRUE sum
YES 3 0 2
NO 0 2 3
sum 3 2 5

Gini(windy|rainy)
= 3/5*(1-(3/3)²-(0/3)²) + 2/5*(1–(0/2)²-(2/2)²) = 3/5*(1-1-0) + 2/5*(1-0-1) = 3/5*0 +
2/5*0 = 0

➔ Windy is the right node

Status: 1 December 2020 Page: 100


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Overfitting in Decision Trees


One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A
tree that is too large risks overfitting the training data and poorly generalizing to new samples.
A small tree might not capture important structural information about the sample space.
However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the
addition of a single extra node will dramatically decrease error.
This problem is known as the horizon effect. A common strategy is to grow the tree until each node
contains a small number of instances then use pruning to remove nodes that do not provide
additional information.[1]

Page
Status: 1. Dezember 2020
101

Overfitting and Pruning shows limitations of ML Decisions Tree methods and give
the reason to consider special ML methods for special problems. A ML method
with fits all applications did not exist.

In statistics, overfitting is "the production of an analysis that corresponds too


closely or exactly to a particular set of data, and may therefore fail to fit additional
data or predict future observations reliably".[1]
An overfitted model is a statistical model that contains more parameters than can
be justified by the data.[2]
The essence of overfitting is to have unknowingly extracted some of the residual
variation (i.e. the noise) as if that variation represented underlying model
structure.
See also for more details:
https://www.youtube.com/watch?v=i_0-5rdxsfg

Status: 1 December 2020 Page: 101


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Decision Tree Pruning

Pruning is a technique in machine learning that reduces the size of decision trees by removing
sections of the tree that provide little power to classify instances.
Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by
the reduction of overfitting.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as
measured by a cross-validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.

Common Method: Cost Complexity Pruning

Status: 1. Dezember 2020 Page 102

To lessen the chance of, or amount of, overfitting, several techniques are available
(e.g. model comparison, cross-validation, regularization, early
stopping, pruning, Bayesian priors, or dropout).
The basis of some techniques is either (1) to explicitly penalize overly complex
models or (2) to test the model's ability to generalize by evaluating its performance
on a set of data not used for training, which is assumed to approximate the typical
unseen data that a model will encounter.

Status: 1 December 2020 Page: 102


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (1/10)

Decision Tree
Nr. Anl Typ Temp. Druck Füllst. Fehler
1001 123 TN 244 140 4600 nein
1002 123 TO 200 130 4300 nein
1009 128 TSW 245 108 4100 ja
1028 128 TS 250 112 4100 nein
1043 128 TSW 200 107 4200 nein
1088 128 TO 272 170 4400 ja
1102 128 TSW 265 105 4100 nein
1119 123 TN 248 138 4800 ja
1122 123 TM 200 194 4500 ja

Page
Status: 1. Dezember 2020
103

The use case for Predictive Analysis (Maintenance) is a concrete project of


IBM and BMW with the goal to improve the quality of production processes.

Status: 1 December 2020 Page: 103


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (2/10)


You can either do the calculation of the Decision Tree (using GINI Index) by hand (manually), see
Homework H4.2 or by using Python Program with a SKLearn Library - see Homework H4.3:

Page
Status: 1. Dezember 2020
104

Status: 1 December 2020 Page: 104


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (3/10)

Page
Status: 1. Dezember 2020
105

Status: 1 December 2020 Page: 105


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (4/10)

Page
Status: 1. Dezember 2020
106

Status: 1 December 2020 Page: 106


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (5/10)

Page
Status: 1. Dezember 2020
107

Status: 1 December 2020 Page: 107


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (6/10)

Page
Status: 1. Dezember 2020
108

Status: 1 December 2020 Page: 108


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (7/10)

Page
Status: 1. Dezember 2020
109

Status: 1 December 2020 Page: 109


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (8/10)

Example „Gießerei“:
Root Cause Analysis and Result
Prediction

Regel:
WENN Öffnungszeit > 1287
UND Heizkreis 12 <= 598,8
DANN wird zu 100% Ausschuss
produziert

Page
Status: 1. Dezember 2020
110

Another link to the paper you will find in:


https://nbn-resolving.org/urn:nbn:de:gbv:ilm1-2008000255

Status: 1 December 2020 Page: 110


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (9/10)

http://d-nb.info/992620961/34

Page
Status: 1. Dezember 2020
111

Status: 1 December 2020 Page: 111


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UseCase: Predictive Maintenance in Production (10/10)

Page
Status: 1. Dezember 2020
112

Status: 1 December 2020 Page: 112


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from IBM Watson- Machine Learning (1/5)

Page
Status: 1. Dezember 2020
113

https://console.bluemix.net/dashboard/apps/?cm_mmc=Email_Nurtur
e-_-Cloud_Platform-_-WW_WW-_-
LoginBodyLoginButton&cm_mmca1=000002FP&cm_mmca2=10001675
&cm_mmca3=M00010245&cvosrc=email.Nurture.M00010245&cvo_ca
mpaign=Cloud_Platform-WW_WW

Status: 1 December 2020 Page: 113


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from IBM Watson- Machine Learning (2/5)

Page
Status: 1. Dezember 2020
114

Status: 1 December 2020 Page: 114


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from IBM Watson- Machine Learning (3/5)

Page
Status: 1. Dezember 2020
115

Status: 1 December 2020 Page: 115


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from IBM Watson- Machine Learning (4/5)

Page
Status: 1. Dezember 2020
116

Status: 1 December 2020 Page: 116


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from IBM Watson- Machine Learning (5/5)

Page
Status: 1. Dezember 2020
117

Status: 1 December 2020 Page: 117


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML4

Homework H4.1 - “Calculate ID3 and CART Measures”


Groupwork (2 Persons) - Calculate the measures of decision tree “Playing Tennis
Game”:
1. ID3 (Iterative Dichotomiser 3) method using Entropy Fct. & Information Gain.
2. CART (Classification) → using Gini Index(Classification) as metric.

Homework H4.2 - “Define the Decision Tree for UseCase Predictive


Maintenance (see slide p.77) by calculating the GINI Indexes”
Groupwork (3 Persons): Calculate the Decision Tree for UseCase “Predictive
Maintenance” on slide p.77. Do the following steps (one person per step):
1. Calculate the Frequency Matrices for the features „Temp.“, „Druck“ and „Füllst.“
2. Define the Root-node by calculating the GINI-Index for all values of the three features. Define
the optimal split-value for the root-node (see slide p.88)
3. Finalize the decision tree by calculation the GINI-Index for the remaining values for the
features “Temp.” and “Füllst.”

Page
Status: 1. Dezember 2020
118

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”


See also for Jupyter Notebooks for Homework 4.2 in [HVö-6]: GitHUb/HVoellinger:
https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020

Status: 1 December 2020 Page: 118


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML4


Homework H4.3 (advanced)*-“Create and describe the algorithm to
automate the calculation of the Decision Tree for the Use Case
“Predictive Maintenance”
Groupwork (2 Persons): Create and describe the algorithm to automate the calculation of steps
1. to 3. of homework H4.2. See more detailed description of the steps in the lecture:
1. Calculate the Frequency Matrices for the features „Temp.“, „Druck“ and „Füllst.“
2. Def. Root-node by calculating GINI-Index of the three features & find the optimal split-value for the root-node.
3. Finalize the decision tree by calculation the GINI-Index for the remaining values for the features “Temp.” and
“Füllst.”

Homework H4.4* - “Summary of the Article …prozessintegriertes


Qualitätsregelungssystem...”
Groupwork (2 Persons) – read and create a short summary about a special part of
article/dissertation from Hans W. Dörmann Osuna: “Ansatz für ein prozessintegriertes
Qualitätsregelungssystem für nicht stabile Prozesse“. Link to article: http://d-nb.info/992620961/34
For the two chapters (1 Person each Chapter, 15 Minutes):
▪ Chapter 7.1 „Aufbau des klassischen Qualitätsregelkreises”
▪ Chapter 7.2. “Prädiktive dynamische Prüfung”

Page
Status: 1. Dezember 2020
119

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Hint to H4.3: See also for Jupyter Notebooks for Homework H4.3 with the name
“ML4-Homework-H4_3.ipynb” in [HVö-6]: GitHUb/HVoellinger:
https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020

Hint to H4.4*: Another link to the paper you will find in:
https://nbn-resolving.org/urn:nbn:de:gbv:ilm1-2008000255

Status: 1 December 2020 Page: 119


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML4


Homework H4.5* - “Create and describe the algorithm to automate the
calculation of the Decision Tree for the Use Case “Playing Tennis” using
ID3 method”
Groupwork (2 Persons) - Calculate the measures of decision tree “Playing Tennis
Game” by creating a Python Program (i.e. using Jupyter Notebook) with “ID3 (Iterative
Dichotomiser 3)” method using Entropy Fct. & Information Gain

Page
Status: 1. Dezember 2020
120

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 120


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML5 – simple Linear Regression (sLR) & multiple


Linear Regression (mLR)

https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020/blob/master/QYuIc.gif

Page
Status: 1. Dezember 2020
121

Regression methods are supporting the concept of building a "good" model,


when you only know test data (or measurement points). It also shows how to
calculate the error and to give you therefore a measure for the quality of the
model. The concepts use well-known mathematics (Linear Algebra and n-dim.
Geometry).

For more details see also the following YouTupe video:


https://www.youtube.com/watch?v=aq8VU5KLmkY

Status: 1 December 2020 Page: 121


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Simple Linear Regression: Variable’s Roles and sLR Model

Page
Status: 1. Dezember 2020
122

Status: 1 December 2020 Page: 122


Dr. Hermann Völlinger
Mathematics & IT-Architecture

sLR Example: Training-Data x=income, y=consumption


Business Question: What Properties could explain a family’s Consumption?

Page
Status: 1. Dezember 2020
123

Status: 1 December 2020 Page: 123


Dr. Hermann Völlinger
Mathematics & IT-Architecture

sLR Example: Get the LR Result with a Tool

Page
Status: 1. Dezember 2020
124

Status: 1 December 2020 Page: 124


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of the Measure of the sLR Method: “R-Squared” R²


https://en.wikipedia.org/wiki/Coefficient_of_determination

For the definition of R² we need some measures of the training-set (set of observations points). These measures are Sum of
Squares Total (SST), Sum of Squares Error (SSE) and Sum of Squares Regression (SSR). SSR is not needed for the definition of
R², but we will use it later in the chapter:

The better the linear regression (on the


right) fits the data in comparison to the
simple average (on the left graph), the
closer the value of R² is to 1. The areas
of the blue squares represent the
squared errors (SSE) with respect to the
linear regression. The areas of the red
squares represent the squared totals
with respect to the average:
R² = 1 - SSE/SST

Page
Status: 1. Dezember 2020
125

Status: 1 December 2020 Page: 125


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Properties & Theorems about the Measure R² (1/3)

Page
Status: 1. Dezember 2020
126

The question is how well the independent variables are suited to explain the
variance of the dependent or to predict their values. This is where the R² comes
into play. It is a measure that can not be less than 0 and not greater than 1. Since
the R² is a share value, it is also often given in percent.
If a regression has an R² close to 0, it means that the chosen independent
variables are not well suited to predict the dependent variable. One speaks then
also of a bad model adaptation ("poor model fit").

Status: 1 December 2020 Page: 126


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Properties & Theorems about the Measure R² (2/3)


To find the optimal sLR-Regression-Line y(x) = a + b*x, you have
to find the maximum of function R²= R²(a, b). R² is a function in
the two variables a (= “intercept”) and b (= “slope”).
Geometrically this is a “surface” of degree 2 over the (a, b)-
plane; see sketch “R² over (a, b)-plane”. The maximum in the R²-
surface looks like the “summit in a hill landscape” (see below):

You get a better impression of the surface, when you check the
slices where intercept or slope are constant. See the sketches:
“Slice-const. slope” & “Slice-const. intercept” (see the right side).
What are the conditions for R² to be maximal? From Calculus we
know that a function in variables a, b has extreme values, if the
differential dR²=(dR²/da)*da+(dR²/db)*db=0; max. if d(d(R²))<0.
See the tangential lines in the sketches on the right side. To
calculate the two variables a and b you have to solve the two
equations: dR²/da=0 and dR²/db=0 & check maximum condition.
We will use these conditions for the calculations of coefficients a,
b in y = a + b*x (see “Least Square Fitting” method).

Page 127

In the homework we will try to visualize the “R²-mountain landscape

Status: 1 December 2020 Page: 127


Properties & Theorems about the Measure R² (3/3)
Corollary (C5.1): “Simple linear regression without an intercept term (a =0)”
Sometimes it is appropriate to force the optimal sLR-line to pass through the origin, because x and y are assumed
to be proportional, then: b=Mean(x*y)/Mean(x²)
Proof:
Let y=b*x => R²=1-SSE/SST= 1-(1/SST)*sum[yi-b*xi]² ➔ 0 = dR²/db=-2sum[(yi -b*xi)*(xi)]=-2*sum[xi*yi-
b*xi²] ➔ sum[xi*yi] =b*sum[xi²] =➔ b= sum[xi*yi] / sum(xi²) = n*Mean(x*y)/n*Mean(x²) ➔
b=Mean(x*y)/Mean(x²). q.e.d.

We need now some helpful formulas about sums and mean-values, because we need them later in the calculation
of “optimal” coefficients a, b (see: “Least Square Fit” (LSF) method):

Proposition (P5.1): Easily you can proof, that the following equations are valid (let M(x):=Mean(xi)).
(i) Sum[(xi - M(x))²] = sum(xi²) - n*M(x)²
(ii) Sum[(yi – M(y))²] = sum(yi²) - n*M(y)²
(iii) Sum[(xi - M(x))*(yi - M(y))] = sum(xi*yi) – n*M(x)*M(y)

Proof: ”straightforward”
(i) sum[(xi - M(x))²] = sum[xi²- 2*M(x)*(xi) + M(x)²] (binominal formula)
= sum(xi²)-2*M(x)*sum(xi)+sum(M(x)²)=sum(xi²)–2*n*M(x)²+n*M(x)² because sum(xi)=n*M(x))
(ii) analog: sum[(yi-M(y))]²= sum(yi²)-2*M(y)*sum(yi)+ sum[M(y)²]=sum(yi²) - 2n*M(y)²+n*M(y)²=sum(yi²)-n*M(y)²

(iii) analog: sum[(xi - M(x))*(yi-M(y))] = sum[xi*yi - M(x)*yi – xi*M(y) + M(x)*M(y)] (“multiply out all factors”)
= sum(xi*yi) – n*M(x)*M(y) - n*M(x)*M(y) + n*M(x)*M(y) = sum(xi*yi*) - n*M(x)*M(y) q.e.d.

Page 128

Status: 1 December 2020 Page: 128


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” of a sLR-Line y = a + b*x (1/4)


https://www.analyzemath.com/calculus/multivariable/linear-least-squares-fitting.html

Task: Calculate for a simple Linear Regression (sLR)-line f(x) = a +b*x the coefficients a and such
that f(x) is an optimal.
Solution:
The condition for f(x)=optimal is equivalent (“=>”) R²=1-SSE/SST= max. (T5.1)=> SSE = min.
We know from the initial lecture mathematics 1 ("extreme value problems") that the first derivation
of S (“dS”) must be zero and for a minimum additionally the second derivation (d(ds)) is less zero.
Start with first derivation: dS = dS/da*da + dS/db*db = 0 => dS/da = 0 (1) & dS/db = 0 (2)
We write “sum” for the symbol:

Execute now formula (1): 0 = d/da(S) = d/da[sum(yi - a - b*xi)²] = 2*sum(yi – a - b*xi)(-1)


= -2*[sum(yi) - sum(a) - sum(b*xi)]
=> 0 = sum(yi) - n*a - b*sum(xi) = n*M(y) - n*a - n*b*M(x)
=> a + b*M(x) = M(y) (3)

Similar for formula (2): 0 = d/db(S) = d/db[sum(yi - a - b*xi)²] = 2*sum(yi – a - b*xi)(-xi)


= -2*[sum(xi*yi) - sum(a*xi) - sum(b*xi²)]
=> 0 = sum(xi*yi) - n*a*M(x) - b*sum(xi²)
=> a*n*M(x) + b*sum(xi²) = sum(xi*yi) (4)

Page
Status: 1. Dezember 2020
129

Status: 1 December 2020 Page: 129


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” of a sLR-Line y = a + b*x (2/4)

Page
Status: 1. Dezember 2020
130

Status: 1 December 2020 Page: 130


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” of a sLR-Line y = a + b*x (3/4)


Example of a "Least Squares Fitting" calculation for simple LR

Find the "least square fit" y = b0 + b1x for the experimental data points: {(1 , 2) , (3 , 4) , (2 , 6) , (4 , 8) , (5 , 12) , (6 , 13) , (7 , 15)}

Solution:
Number of Point N=7 Mean-Values ("Mittelwerte"): [M(x),M(y)] = [28/7; 60/7] ~ [4; 8,5714]
Set up a table with the quantities included in the above formulas for b0 and b1 and also the quantities for the calculation of R²:

needed for calculation of b0 and b1 needed for calculation of R² SST = SSE + SSR ?
SST=sum(yi-
i xi yi x i *y i x i² y(x i ) SSE=sum(yi-y(x i ))² SSR=sum(y(xi) - M(y))²
M(y))²
1 1 2 2 1 2,0357 0,001274 43,1833 42,7101
2 3 4 12 9 6,3929 5,726 20,8977 4,7459
3 2 6 12 4 4,2143 3,1887 6,6121 18,9843
4 4 8 32 16 8,5715 0,3266 0,3265 0
5 5 12 60 25 10,7501 1,5623 11,7553 4,7467
6 6 13 78 36 12,9287 0,00661 19,6125 18,9861
7 7 15 105 49 15,1073 0,01151 41,3269 42,718
sum 28 60 301 140 10,822994 143,7143 132,8911

Substitute these values into Formula I and II: Compare with Python

b0 = (140*60/7 -4*301)/(140 -7*16) = -28/196 intercept:


= -1/7 ~ -0,14286 - 0.14285714285714057
b1 = (301 -7*4*(60/7))/28 = 61/28 ~ 2,1786 slope: [2.17857143]

----> Regression-Line: y = -1/7 + (61/28)*x

R² = 1 - Sum((yi-y(xi))²)/Sum((yi-M(y))²) = 1 - coeff. determination:


(10,822994 / 143,7143) ~ 0,9247 0.9247017892644135

Page
Status: 1. Dezember 2020
131

Status: 1 December 2020 Page: 131


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” of a sLR-Line y = a + b*x (4/4)


Corollary (C5.2): “LSF Model based Properties: i.e. center of mass point”

Let y=a + b*x be an optimal SLR line, then we have the following properties:

i. Let M(x) and M(y) the mean-values for all xi and yi ➔ the sLR line goes through the center of mass
(“Schwerpunkt”) point = (M(x), M(y)) if the model includes an intercept term (i.e. is not forced to go
through the origin).

ii. The sum of errors ei:=yi-y(xi) is zero if the model includes an intercept term: sum(ei) = 0

iii. Values ei and xi are uncorrelated (whether or not there is an intercept term model): sum(xi*ei )=0

Proof:
Part (i): Use the equations of “least best fit” method for sLR, then take the formulas (I) and (II) for the coeff. a and
b and set the values into regression line to proof: y(M(x)) =a + b*M(x) =M(y):
(I),(II) => y(M(x))= a+b*M(x)= (1/det)*[(M(y)*sum(xi²)-M(x)*sum(xi*yi)]+(1/det)[(sum(xi*yi)-n*M(x)*M(y)]*M(x) (1)
where det:= sum(xi²)–n*M(x)² (2)
The red parts are vanishing,(1) => y(M(x))= (M(y)/det)*[sum(xi²)-n*M(x)²] = (M(y)/det)*det = M(y) using (2).

Part (ii) and (iii): see equation (2) and (3) in Theorem (T5.2) q.e.d.

Remark: With the “center of mass” condition, we have a “quick check” for a line, if it is an candidate for an optimal
sLR-model. Since only one of two conditions for the determination of an optimal regression line is fulfilled with the
"center of mass" it’s therefore a necessary but not a sufficient condition (“..notwendig aber nicht hinreichend ….”)

Page 132

Status: 1 December 2020 Page: 132


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Remark & Theorem to the “SST=SSR+SSE“ Condition (1/3)


https://math.stackexchange.com/questions/709419/prove-sst-ssessr

Sometimes in the literature or in YouTube videos you see the formula: “SST=SSR+SSE” (SSE,SST see slides
before, and SSR := Sumi(f(xi) – Mean(yi))²). Wikipedia (https://en.wikipedia.org/wiki/Coefficient_of_determination):
“… In some cases the total sum of squares equals the sum of the two other sums of squares defined above…”.
We can prove that his formula is true, if we have the “optimal” Regression-Line. (Take Care!).

Theorem (T5.2): Let y(x) = a + bx be a regression-line with a=intercept and b=slope, with b <> 0 then:
y(x) is optimal ==> SST = SSR + SSE (Equation (E5.1))
Proof: Proof that SST=SSR+SSE under the condition: y(x) is optimal <=➔ dR²/da=0 and dR²/db=0 for y(x).

Page 133

Details to Theorem(T5.2) see: https://www.youtube.com/watch?v=-fP7VasT1Oc

See also: https://math.stackexchange.com/questions/709419/prove-sst-ssessr

Mathematik1 / Lehr- und Lerneinheiten


Angewandte Mathematik
- Grundlagen der Differential und Integralrechnung reeller Funktionen mit mehreren
Veränderlichen sowie von Differentialgleichungen und Differentialgleichungssystemen
- Numerische Methoden und weitere Beispiele mathematischer Anwendungen in der
Informatik
Statistik
- Deskriptive Statistik
- Zufallsexperimente, Wahrscheinlichkeiten und Spezielle Verteilungen
- Induktive Statistik
- Anwendungen in der Informatik

Status: 1 December 2020 Page: 133


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Remark & Theorem to the “SST=SSR+SSE“ Condition (2/3)


Remark (R5.1):
Whit the statement: "SST = SSR + SSE“(*) you could rewrite R² = 1 - SSE/SST = (SST-SSE)/SST =*= SSR/SST *). You
could now think R²=SSR/SST is also a good metric, but this is wrong (see below: example of Homework 5H.1_a ).
ATTENTION: In some documents or YouTube videos you see this as metric. Please don’t use this metric !!!
…… Example: See Homework (H5.1):

Manuel calculation of two sLR-lines (green ,red) (Homework (H5.1_a) + Compare with optimal sLR-line (homewrk (H5.1_b) + Check Results with the new metric R²=SSR/SST

Decide what is the "better" sLR-Line: y = 1,5 + 0,5*x or y = 1,25+0,5*x ? With the defintion R²:=SSR/SST we get the
result that the green line is the best sLR-line of
Solution:
Number of Point N=3 Mean-Values ("Mittelwerte"): [M(x),M(y)] = [2; (7/3)]
the three --> With R²=1-SSE/SST it was the
Set up a table with the quantities included in the above formulas for a and b and also the quantities for the calculation of R²: yellow line => the red metric is not applicable!

needed for calculation of a and b Needed for calculation of R² SST = SSE + SSR ?
SST=sum(yi-
i xi yi x i *y i x i² y(x i ) SSE=sum(yi-y(x i ))² R² SSR=sum(y(xi) - M(y))² R²
M(y))²
1 1 2 2 1 2,00 0,0000000 0,1111111 0,1111111
2 3 3 9 9 3,00 0,0000000 0,4444444 0,4444444
3 2 2 4 4 2,50 0,2500000 0,1111111 0,0277778
sum 6 7 15 14 0,2500000 0,6666667 0,6250000 0,5833333 0,8750000
0,8333333 <--SSR + SSE
Needed for calculation of R² SST = SSE + SSR ?
SST=sum(yi-
y(x i ) SSE=sum(yi-y(x i ))² R² SSR=sum(y(xi) - M(y))² R²
M(y))²
1,75 0,0625000 0,1111111 0,3402778
2,75 0,0625000 0,4444444 0,1736111
2,25 0,0625000 0,1111111 0,0069444
0,1875000 0,6666667 0,7187500 0,5208333 0,7812500
0,7083333 <--SSR + SSE
From Homework (H5.1_b) we get the data for the "optimal" sLR-line:
needed for calculation of R² SST = SSE + SSR ?
y(xi) SSE=sum(yi-y(xi))² SST=sum(yi- R² SSR=sum(y(xi) - M(y))² R²
11/6 (1/6)²=1/36 (-1/3)²=1/9 1/4
17/6 (1/6)²=1/36 (2/3)²=4/9 1/4
14/6 (-2/6)²=4/36 (1/3)²=1/9 0
42/6=7 6/36=1/6 2/3 0,7500000 1/2 0,7500000
0,6666667 <--SSR + SSE

Page 134

Status: 1 December 2020 Page: 134


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Remark &Theorem to the “SST=SSR+SSE“ Condition (3/3)

Conjecture
I guess the following conjecture. It is a conjecture, since I can’t prove it so far. → Proof
done →
Theorem

Conjecture (Conj5.1): “SST=SSR+SSE” is not sufficient for an optimal sLR-line”


(“…necessary but not sufficient…”). This would prove that the direction “== “ of Theorem
(T5.2) is wrong !!

Idea of a proof:
If you do not believe that “==“ is correct, you can for example refute this by constructing a counterexample. This
would mean you have to find a Trainings-Set TS, with a sLR-line with (SST=SST+SSE) which is not optimal. To
check this, is a sub-task in Homework (H5.5)*.

******************** see Homework (H5.5)* ******************************

Page 135

Status: 1 December 2020 Page: 135


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Plots of underfitted, well-fitted, and overfitted Models


This plot illustrates
This plot shows a polynomial regression
linear regression line with the degree equal to
that has a low 𝑅². It 2. In this instance, this
might also be might be the optimal
important that a degree for modeling this
straight line can’t data. The model has a
take into account the value of 𝑅² that is
fact that the actual satisfactory in many
response increases cases and shows
as 𝑥 moves away trends nicely (well-
from 25 towards zero. fitted).
This is likely an
example of In this plot, you may see
underfitting. the perfect fit: six points
and the polynomial line of
the degree 5 (or higher)
yield 𝑅² = 1. Each actual
response equals its
This plot presents
corresponding prediction.
polynomial regression
In some situations, this
with the degree equal
might be exactly what
to 3. The value of 𝑅² is
you’re looking for. In many
higher than in the
cases, however, this is an
preceding cases. This
overfitted model. It is
model behaves better
likely to have poor behavior
with known data than
with unseen data, esp. with
the previous ones.
the inputs larger than 50.
However, it shows
For example, it assumes,
some signs of
without any evidence, that
overfitting, especially
there is a significant drop in
for the input values
responses for 𝑥 > 50 and
close to 60 where the
that 𝑦 reaches zero for 𝑥
line starts decreasing,
near 60. Such behavior is
although actual data
the consequence of
don’t show that.
excessive effort to learn
and fit the existing data.

Page
Status: 1. Dezember 2020
136

Underfitting and Overfitting


Underfitting occurs when a model can’t accurately capture the dependencies among
data, usually as a consequence of its own simplicity. It often yields a low 𝑅² with
known data and bad generalization capabilities when applied with new data.
Overfitting happens when a model learns both dependencies among data and
random fluctuations. In other words, a model learns the existing data too well.
Complex models, which have many features or terms, are often prone to overfitting.
When applied to known data, such models usually yield high 𝑅². However, they often
don’t generalize well and have significantly lower 𝑅² when used with new data.
Polynomial Regression: You can regard this as a generalized case of linear regression.
In other words, in addition to linear terms like 𝑏₁𝑥₁, your regression function 𝑓 can
include non-linear terms such as 𝑏₂𝑥₁², 𝑏₃𝑥₁³, or even 𝑏₄𝑥₁𝑥₂, 𝑏₅𝑥₁²𝑥₂, and so on. In the
case of two variables and the polynomial of degree 2, the regression function has this
form: 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁+ 𝑏₂𝑥₂ + 𝑏₃𝑥₁² + 𝑏₄𝑥₁𝑥₂ + 𝑏₅𝑥₂². The simplest example of such
estimated regression function is a polynomial of degree 2: 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥 + 𝑏₂𝑥².
One very important question that might arise when you’re implementing polynomial
regression is related to the choice of the optimal degree of the polynomial
regression function. There is no straightforward rule for doing this. It depends on the
case. You should, however, be aware of two problems that might follow the choice of
the degree: underfitting and overfitting.

Status: 1 December 2020 Page: 136


Dr. Hermann Völlinger
Mathematics & IT-Architecture

simple Linear Regression (sLR) with Scikit-learn (1/3)


Following ideas from: "Linear Regression in Python" by Mirko Stojiljkovic, 28.4.2020 (see
details: https://realpython.com/linear-regression-in-python/#what-is-regression)
There are five basic steps when you’re implementing linear regression:

1. Import the packages and classes you need.


2. Provide data to work with and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is satisfactory.
5. Apply the model for predictions. These steps are more or less general for most of the regression
approaches and implementations.

Page
Status: 1. Dezember 2020
137

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free


software machine learning library for the Python programming language.[3] It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-
means and DBSCAN, and is designed to interoperate with the Python numerical
and scientific libraries NumPy and SciPy.

See also: “scikit-learn Machine Learning in Python”. https://scikit-


learn.org/stable/

Complete solution you will find under:


https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020
“sLR Example_Lecture-ML5.ipynb“

Status: 1 December 2020 Page: 137


Dr. Hermann Völlinger
Mathematics & IT-Architecture

simple Linear Regression (sLR) with Scikit-learn (2/3)

Page
Status: 1. Dezember 2020
138

Status: 1 December 2020 Page: 138


Dr. Hermann Völlinger
Mathematics & IT-Architecture

simple Linear Regression (sLR) with Scikit-Learn (3/3)

Page
Status: 1. Dezember 2020
139

Status: 1 December 2020 Page: 139


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Examples of sLR-lines: y = a + b*x (1/2)


Example (E5.1): Build the SLR (x,y)-Model “Students Examination Results”
Part a: Analyze the y:=“Achieved points of the exam [%]” depending on the parameter: x:=" Effort exam preparation
[h]" Find "least square fit" y = b0 + b1x with {(exam prep.[h], score[pt.])}= {(7, 41), (3, 27), (5, 35), (3, 26), (8, 48),
(7, 45), (10, 46), (3, 27), (5, 29) , (3, 19)}

Part b: Do the same with a new x:=“Effort for homework[h]“: {(homework[h], score[pt.])}= {(5, 41), (4, 27), (5, 35), (3,
26), (9, 48), (8, 45), (10, 46), (5, 27), (3, 29), (3, 19)}

Task: Build the model sLR(x,y). Compare and check your result with the output of a Python-Program. Answer the
following questions:
1. Q1: How much points would a student achieve without any preparation or without doing any homework?
2. Q2: How much points would a student achieve with (10 hours of preparation for the exam) or (10 hours homework)?
3. Q3: How much effort you will need to reach enough points (=25) to get enough points to pass the exam?

Additional Question/Remark: Our calculation use one of the two variables independent from the other variable. What
is the difference to mLR-model results. Is R² (calculate here) different from Adj.R² we got in Example (E5.3)?

Page
Status: 1. Dezember 2020
140

Example (E5.1)-First Part: {(exam prep.[h], score[pt.])} = {(7, 41), (3, 27), (5, 35), (3,
26), (8, 48), (7, 45), (10, 46), (3, 27), (5, 29) , (3, 19)}

Second Part: similar to first part build the above table with the new data:
{(homework[h], score[pt.])}= {(5, 41), (4, 27), (5, 35), (3, 26), (9, 48), (8, 45), (10, 46),
(5, 27), (3, 29), (3, 19)}

Status: 1 December 2020 Page: 140


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Examples of sLR-lines: y = a + b*x (2/2)


Solution of E5.1-a:

Complete Solution: see the file LR-Calculation_of_Coeff.xlsx in my GitHub [HVö-6]:


https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020/blob/master/LR-Calculation_of_Coeff.xlsx
Solution of E5.1-b: see also the above GitHub link.
Q1: 14,117 and 15,164; Q2: in both cases 50 points, which is a grade of 1,0. Q3: 2,91[h] and 2,83[h]

Example (E5.2): “Manual calculation for sLR-model with Iowa Houses Data”
Take a subset of 10 data-records and calculate manually a, b and R² using the matrices of the lesson. Compare your
result with the results of the Homework ML5.3: “Coding with the dataset “Iowa Homes” to predict the “House Price”
based on “Square Feet” and a second variable (i.e. “Age of Home”). Compare your result with the output of a Python-
Program. See the same link to GitHub as above in E5.1

Page
Status: 1. Dezember 2020
141

Example (E5.2): See Homework H5.3

Status: 1 December 2020 Page: 141


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of the multiple Linear Regression Function (k=2)

Answer = 2

df :=Degrees of
Freedom =
Number of
observations - 2

Answer = 3

n:=Number of
observations
k:=Number of
variables
Page
Status: 1. Dezember 2020
142

Multiple or multivariate linear regression is a case of linear regression with two or


more independent variables.
If there are just two independent variables, the estimated regression function is
𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional
space. The goal of regression is to determine the values of the
weights 𝑏₀, 𝑏₁, and 𝑏₂ such that this plane is as close as possible to the actual
responses.
The case of more than two independent variables is similar, but more general. The
estimated regression function is 𝑓(𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1
weights to be determined when the number of inputs is 𝑟.

See also the YouTupe Video: “Regression II: Degrees of Freedom EXPLAINED |
Adjusted R-Squared” : https://www.youtube.com/watch?v=4otEcA3gjLk

Consider also the YouTupe video from Andrew Ng: “Lecture 4.1 - Linear Regression
With Multiple Variables” and following Lectures 4.x:
https://www.youtube.com/watch?v=Q4GNLhRtZNc&list=PLLssT5z_DsK-
h9vYZkQkYNWcItqhlRJLN&index=18

Status: 1 December 2020 Page: 142


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Definition of the “Adjusted R-squared” Measure

Rule: if K>1 choose the regression model where Adj.R² is maximum

Page
Status: 1. Dezember 2020
143

R² only works as intended in a simple linear regression model with one explanatory
variable. With multiple regression made up of several independent variables (k>1),
the R-Squared must be adjusted.
The adjusted R² compares the descriptive power of regression models that include
diverse numbers of predictors. Every predictor added to a model increases R² and
never decreases it. Thus, a model with more terms may seem to have a better fit
just for the fact that it has more terms, while the adjusted R-squared compensates
for the addition of variables and only increases if the new term enhances the model
above what would be obtained by probability and decreases when a predictor
enhances the model less than what is predicted by chance.
➔ Rule: if K>1 choose the regression model where Adj.R² is max.
In an overfitting condition, an incorrectly high value of R², which leads to a
decreased ability to predict, is obtained. This is not the case with the adjusted R².

For more detailed interpretation about the differences of R-² and Adj.-R², see also
the YouTupe Video: “Regression II: Degrees of Freedom EXPLAINED | Adj. R²”:
https://www.youtube.com/watch?v=4otEcA3gjLk

Status: 1 December 2020 Page: 143


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Properties and Remarks of Adj.R² Measure


Remark (R5.2):
Adj.R² can only be defined when n-k-1<> 0 which is same as k<>n-1.
Also it is clear that the number of observation-points(n) >= k+2. If this is not the case, you can’t calculate a meaningful
Regression-Hyperplane HP. We say “without loss od generality” (w.l.o.g): k<= n-2.

Theorem (T5.3): “Properties of Adj.R²”

(i): Adj.R² =< R² =< 1 “Limitation”


(ii): Adj.R² can become negative “Negativity”

Proof:

(i) Adj.R² = 1 - (1 - R²)[(n-1)/(n-1-k)]


=< 1 - (1-R²)(n-1) by Remark R5.2: n-1-k >= n-1-n+2=1, which shows 1/(n-k-1) =< 1=> (n-1)/n-1-k)=< n-1
= 1 + (1-R²) (-1)(n-1)
=< 1 + (1-R²)(-1) by n-1>= 1 (n>=2) we see - (n-1) =< -1
= 1 - 1 + R² = R² =< 1 (by T5.1)

(ii) In Homework H5.3 we see for n=10, k=8: R² ~ 0,8


Then we see for Adj.R² = 1 - (1-R²)(9/1) ~ 1 –(0,2)*9 = 1-1,8 = -0,8

Page
Status: 1. Dezember 2020
144

Status: 1 December 2020 Page: 144


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” for mLR(k=2): z = a + b*x +c*y (1/4)


https://www.youtube.com/watch?v=XN0yKLRPlQQ

Page
Status: 1. Dezember 2020
145

Status: 1 December 2020 Page: 145


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” for mLR(k=2): z = a + b*x +c*y (2/4)

https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020/blob/master/QYuIc.gif

Page
Status: 1. Dezember 2020
146

Status: 1 December 2020 Page: 146


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” for mLR k=2): z = a + b*x +c*y (3/4)

Page
Status: 1. Dezember 2020
147

Status: 1 December 2020 Page: 147


Dr. Hermann Völlinger
Mathematics & IT-Architecture

“Least Square Fitting” for mLR(k=2): z = a + b*x +c*y (4/4)


Corollary (C5.3): “mLR(k=2) -Center of mass point”

Let z=a + b*x + c*y be an optimal mLR(k=2)-plane, then we have the following properties:

Let M(x), M(y) and M(z) the mean-values for all xi, yi and zi ➔ The center of mass point (“Schwerpunkt”) =
(M(x), M(y),M(z)) is on the mLR(k=2)-plane.

Proof: The statement follows directly for the LSF calculation, formula (I).

Page
Status: 1. Dezember 2020
148

Status: 1 December 2020 Page: 148


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Examples for mLR(k=2) Models: z = a + b*x +c*y (1/2)


Example (E5.3): Calculation of mLR(k=2)-model for “Students Examination Results”

Similar to Example (E5.1): Find "least square fit" z = a + b*x + c*y for the z:=“Achieved points(score) of exam[pt.]”
depending on the two parameter: x:="Effort exam preparation[h]" and y:=“Effort for homework [h]“. Data from Training
Set TS ={(x, y; z) | (exam prep.[h], homework[h]; score[pt.])}= {(7,5;41 ), (3,4;27), (5,5;35), (3,3;26), (8,9;48), (7,8;45),
(10,10;46), (3,5;27), (5,3;29), (3,3;19)}
Task: Build to the model mLR(x,y; z). Compare and check your result with the output of a Python-Program.
Answer the following three Questions:
1. Q1: How much points would a student achieve without any preparation and without doing any homework?
2. Q2: How much points would a student achieve with (10 hours of preparation for the exam) and (10 hours homework) ?
3. Q3: How much effort you will need to reach enough points (=25) to score enough points to pass the exam?
Add. Question/Remark: Our calculation use both variables in the calculation. What is the difference to our sLR-model
results? Compare Adj.R² (calculated here) to the two R² you got in Example (E5.1).

Solution: see https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020/blob/master/LR-Calculation_of_Coeff.xlsx

Page
Status: 1. Dezember 2020
149

Status: 1 December 2020 Page: 149


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Examples for mLR(k=2) Models: z = a + b*x +c*y (2/2)


1. Q1: How much points would a student achieve without any preparation and without doing any homework? 13,264 points.
2. Q2: How much points would a student achieve with (10 hours of preparation for the exam) and (10 hours homework)?
50 points ---→ score = 1,0.
3. Q3: How much effort you will need to reach enough points (=25) to score enough points to pass the exam?
Consider the cases: x=0 -→ y=(25-13,264)/1,382=8,49[h]; y=0 -→ x=(25-13,264)/2,488=4,72[h]. When you do 3 hours
for both tasks, you will reach z = 13,264 + 2,488*3 + 1,382*3 = 24,874 ~ 25 points. So you would pass the examination.

Example (E5.4): “Manual calculation for mLR(k=2)-model with Iowa Houses Data”
Similar to example (E5.2) we take a subset of 10 data-records and calculate manually a,b and and Adj.R² using
the matrices of the lesson. Compare your result with the results of Homework: “Coding with the dataset “Iowa
Homes” to predict the “House Price” based on “Square Feet” and a second variable (i.e. “Age of Home”).
Compare your result with the output of a Python-Program.

Page
Status: 1. Dezember 2020
150

Status: 1 December 2020 Page: 150


Dr. Hermann Völlinger
Mathematics & IT-Architecture

multiple Linear Regression (mLR) with Scikit-learn (1/2)


Following ideas from: "Linear Regression in Python" by Mirko Stojiljkovic, 28.4.2020 (see details:
https://realpython.com/linear-regression-in-python/#what-is-regression) you can obtain the properties of the model the
same way as in the case of simple linear regression in five basic steps (see before):

Page
Status: 1. Dezember 2020
151

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free


software machine learning library for the Python programming language.[3] It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-
means and DBSCAN, and is designed to interoperate with the Python numerical
and scientific libraries NumPy and SciPy.

See also: “scikit-learn Machine Learning in Python”. https://scikit-


learn.org/stable/

Complete solution you will find under:


https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020
“mLR - Example for ML05.ipynb”

Status: 1 December 2020 Page: 151


Dr. Hermann Völlinger
Mathematics & IT-Architecture

multiple Linear Regression (mLR) with Scikit-learn (2/2)

Page
Status: 1. Dezember 2020
152

Status: 1 December 2020 Page: 152


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to ML5
Homework H5.1 - “sLR manual calculations of R² & Jupyter Notebook
(Python)”
Consider 3 points P1=(1|2), P2=(3|3) and P3=(2|2) in the xy-plane. Part b: 1 Person; Rest: 1
Person;

Part a: Calculate the sLR-Measures R² for the two estimated sLR-lines y=1,5 + 0,5*x and y=1,25
+ 0,5*x. Which estimation (red or green) is better? Check “center of mass”. (Hint: R²:= 1 - SSE/SST).

Part b: Calculate the optimal Regression-Line y = a + b*x. By using the formulas developed in the
lesson for the coefficients a and b. What is R² for this line?

Part c: Build a Jupyter Notebook (Python) to check the manual calculations of Part b. You can use
the approach of the lesson by using the Scikit-learn Python library. Optional*: Pls. plot a picture of
the “mountain landscape” for R² over the (a,b)-plane.

Part d: Sometimes in the literature or in YouTube videos you see the formula: “SST=SSR+SSE”
(SSE,SST see lesson and SSR := Sumi(f(xi) – Mean(yi))². Theorem (ML5-2): “This formula is only
true, if we have the optimal Regression-Line. For all other lines it is wrong! Check this, for the
two lines of Part a (red and green) and the opt. Regression-Line calculated in Part b.

Page
Status: 1. Dezember 2020
153

Solutions of Homework 5.1 are found in “Exercises2Lecture.pdf”


Check your results:
Part a: R² = 5/8 = 0.625; R² = 23/32 = 0.71875
 y=1,25+0.5*x is the better regression model.
Part b: y=4/3 + 0.5*x is the Regression-Line. R² =3/4.
Part c: See in GitHub https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020
 the Jupyter Notebook “Homework-ML5_1c-LinReg.ipynb”

Part d:

Status: 1 December 2020 Page: 153


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to ML5
Homework H5.2*- “Create a Python Pgm. for sLR with. Iowa Houses
Data”:
2 Persons: See the video, which shows the coding using Keras library & Python:
https://www.youtube.com/watch?v=Mcs2x5-7bc0 .Repeat the coding with the dataset “Iowa
Homes” to predict the “House Price” based on “Square Feet”. See the result:

Page
Status: 1. Dezember 2020
154

Status: 1 December 2020 Page: 154


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML5


Homework H5.3 – “Calculate Adj.R² for mLR”
See also the YouTupe Video: “Regression II: Degrees of Freedom EXPLAINED | Adjusted R-
Squared”; https://www.youtube.com/watch?v=4otEcA3gjLk

Task:
• Part A: Calculate Adj.R² for given R² for a ”Housing Price” example (see table below). Did
you see a “trend”?
• Part B: What would be the best model if n=25 and if n=10 (use Adj.R²)?

Page
Status: 1. Dezember 2020
155

Part A: Check your results:

Complete solutions are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 155


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to ML5
Homework H5.4 - “mLR (k=2) manual calculations of Adj.R² & Jupyter
Notebook (Python)”

Consider the 4 points P1= (1|2|3), P2=(3|3|4), P3=(2|2|4) and P4=(4|3|6) in the 3-
dimensional space:

Part a: Calculate the sLR-Measures Adj.R² for the two Hyperplanes H1:=plane defined
by {P1,P2,P3} and H2:=Plane defined bx {P2,P3,P4}. Which plane (red or green) is a
better mLR estimation?

Part b: What is the optimal Regression-Plane z = a + b*x + c*y. By using the formulas
developed with “Least Square Fit for mLR” method for the coefficients a, b and c. What
is Adj.R² for this plane? (Hint: a=17/4, b=3/2, c=-3/2; R²~0,9474 and Adj.R² ~0.8421)

Part c: Build a Jupyter Notebook( Python) to check the manual calculations of part b.
You can use the approach of the lesson by using the Scikit-learn Python library.

Part a: 1 Person, Part b +c: 1 Person

Page
Status: 1. Dezember 2020
156

Solutions of Homework 5.4 are found in “Exercises2Lecture.pdf”


Check your results:
Part a: z = 4 + x - y , z = 4 + 2x - 2y; R² = R² = 15/19 => AdjR² is also equal for both
planes=7/19 (no decision about better plan is possible).

Part b: z = 17/4 + (3/2)*x - (3/2)*y is the optimal Regression-Plane. AdjR² ~ 0.8421

Part c: See in GitHub https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020


 the Jupyter Notebook “Homework-ML5_4c-LinReg.ipynb”

Status: 1 December 2020 Page: 156


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to ML5

Homework H5.5* - Decide (SST=SSE+SSR) => optimal sLR-line ?

Examine this direction of the (SST=SSE+SSR) condition. We could assume that the
condition: "SST = SSR + SSE" (*) also implies that y(x) is an optimal regression line.
In many examples this is true! (see homework 5H.1_a).

Task: Decide the two possibilities a) and b): ( 2 Persons, one for each step)

a. Statement is true, so you have to prove this. I.e. Show that when the “mixed term”
of the equation is zero (sum[(fi-yi)*(fi-M(y)]=0 for all i) implies an optimal sLR-line.

b. To proof that it’s wrong, it’s enough to construct a counterexample: define a


Training Set TS= {observation-points}; a sLR-line which has condition (*), but is
not an optimal sLR-line.

Page
Status: 1. Dezember 2020
157

Solutions of Homework H5.5 are found in “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 157


Dr. Hermann Völlinger
Mathematics & IT-Architecture

1. ML1: Introduction to Machine Learning (ML)


2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML6 – Convolutional Neural Networks (CNN)

https://www.youtube.com/watch?v=3JQ3hYko51Y

Page
Status: 1. Dezember 2020
158

See the also the following documents/scripts:

• Scikit-learn 0.23.1 “ Ch. 1.17. Neural network models (supervised)” https://scikit-


learn.org/stable/modules/neural_networks_supervised.html?highlight=back
%20propagation

See also YouTube videos:

• Neural Network 3D Simulation: https://www.youtube.com/watch?v=3JQ3hYko51Y

• “Convolutional Neural Networks (CNNs) explained”


https://www.youtube.com/watch?v=YRhxdVk_sIs

• “Convolutional Neural Network Tutorial (CNN) | How CNN Works | Deep Learning
Tutorial | Simplilearn“ https://www.youtube.com/watch?v=Jy9-aGMB_TE

Status: 1 December 2020 Page: 158


Dr. Hermann Völlinger
Mathematics & IT-Architecture

CNN simulate Biological Neurons

Page
Status: 1. Dezember 2020
159

For more details see:


https://www.youtube.com/watch?v=oI2rvjbzVmI

CNN is the result of two elements: well-known algorithm of


artificial neural networks plus a set of operations that we will call convolution.
By combining these two kinds of ideas here we get the convolutional neural
networks or simply CNN.
Recalling the concept neuro networks are composed of artificial neurons which
simulate biological neurons in a limited way.

Status: 1 December 2020 Page: 159


Dr. Hermann Völlinger
Mathematics & IT-Architecture

CNN Concepts – The Artificial Neuron

Page
Status: 1. Dezember 2020
160

Let's just take a look at the artificial neuron.


What we have here is a set of elements that are represented by the set of inputs of
x1,x2 up to xN which are connected to an activation function f(.).
The connection between the input and the activation function is drawn by a set of
weights which are represented by Omega1, Omega2 up to OmegaN.
Besides this we have a bias, which is the letter b. After applying the output of the
activation function is represented by the letter z. Z is a function applied to the input
weighted by all these elements plus the bias. In this we connect some input and
have a single output.
Remark: bias is a constant distortion.

Status: 1 December 2020 Page: 160


Dr. Hermann Völlinger
Mathematics & IT-Architecture

CNN – Convolution Operations on Image Pixles

Page
Status: 1. Dezember 2020
161

The other concept is the convolution operation: we have a single input image and
we apply some filter or can be called mask care node template or even window
and by applying this we can see that this red square is represented in this input
window so all these values pixels of the image in that position and by applying a
filter which we can represent here by Omega at these nine pixels of the image in
that position pixels of the image in that position.
For these three by three positions we obtain an output value by combining the
input window or the input image weighted by all that values inside
the filter. The result of the convolution in this position X will be the result of the
sum (see right side of the picture)

Status: 1 December 2020 Page: 161


Dr. Hermann Völlinger
Mathematics & IT-Architecture

CNN Architecture Example for Image Recognition

Page
Status: 1. Dezember 2020
162

Here is CNN applied to an image classification example. It is an example of a


Supervised Deep Learning method.
In the example we are using 4 special layers:
The first step or the first layer of the CNN is composed by one or more layers of
convolution after these we can apply one or more steps of pooling after this we
design what we call a fully connected layer and at the end to obtain a classification
we can apply some operation for example SoftMax.

Status: 1 December 2020 Page: 162


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Image Recognition - First Convolutional Layer

Page
Status: 1. Dezember 2020
163

Here we have the example of image recognition of the number “2”.


The slide show the processes (i.e. filters, ReLu) in the first convolutional layer.
ReLu layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. Output are 2 tensors (25x25).

In general the convolutional layer is the core building block of a Convolutional


Network that does most of the computational heavy lifting.

In general ReLU (Rectified Linear Unit) is an Activation Function. The ReLU is


the most used activation function in the world right now. Since, it is used in
almost all the convolutional neural networks or deep learning.

Status: 1 December 2020 Page: 163


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Image Recognition - Second Convolutional Layer

Page
Status: 1. Dezember 2020
164

The second convolutional layer run again filters and ReLu function.
Output are 3 tensors (25 x25)

Status: 1 December 2020 Page: 164


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Image Recognition - Pooling Layer

Page
Status: 1. Dezember 2020
165

POOLING layer will perform a down-sampling operation along the spatial


dimensions (width, height). Here we have a filter 4X4 with max function
which is resulting in 3 tensors (6x6).

In general a pooling layer is another building block of a CNN. Its function is to


progressively reduce the spatial size of the representation to reduce the
amount of parameters and computation in the network. Pooling
layer operates on each feature map independently. The most common
approach used in pooling is max pooling.

Status: 1 December 2020 Page: 165


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Image Recognition - Fully Connected Layer

Page
Status: 1. Dezember 2020
166

FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of
size, where each of the 10 numbers correspond to a class score, such as among the
10 categories of CIFAR-10. As with ordinary Neural Networks and as the name
implies, each neuron in this layer will be connected to all the numbers in the
previous volume.
Output: 4 neurons.

In general the FC is the fully connected layer of neurons at the end of CNN.
Neurons in a fully connected layer have full connections to all activations in the
previous layer, as seen in regular Neural Networks and work in a similar way.

Status: 1 December 2020 Page: 166


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Image Recognition - SoftMax Equation

Page
Status: 1. Dezember 2020
167

The SoftMax Layer (SoftMax Equation) reduces a 2-dimensional matrix (4x10) of


results/scores to a 1-dimensional result. The 4 rows of the matrix come from 4
neurons, each delivers 10 numbers/scores. The result are 10 values/scores with values
in the interval [0, 1] for the numbers 0,1…9.
In mathematics, the softmax function, also known as softargmax or normalized
exponential function, is a function that takes as input a vector of K real numbers, and
normalizes it into a probability distribution consisting of K probabilities. That is, prior
to applying softmax, some vector components could be negative, or greater than one;
and might not sum to 1; but after applying softmax, each component will be in
the interval [0,1] and the components will add up to 1, so that they can be interpreted
as probabilities. Furthermore, the larger input components will correspond to larger
probabilities. Softmax is often used in neural networks, to map the non-normalized
output of a network to a probability distribution over predicted output classes.

Status: 1 December 2020 Page: 167


Dr. Hermann Völlinger
Mathematics & IT-Architecture

CNN – General Architecture Overview

Page
Status: 1. Dezember 2020
168

Regular Neural Nets: Neural Networks receive an input (a single vector), and transform
it through a series of hidden layers. Each hidden layer is made up of a set of neurons,
where each neuron is fully connected to all neurons in the previous layer, and where
neurons in a single layer function completely independently and do not share any
connections. The last fully-connected layer is called the “output layer” and in
classification settings it represents the class scores. Regular Neural Nets don’t scale well
to full images.
3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that
the input consists of images and they constrain the architecture in a more sensible way.
In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons
arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers
to the third dimension of an activation volume, not to the depth of a full Neural
Network, which can refer to the total number of layers in a network.) For example, the
input images in CIFAR-10 are an input volume of activations, and the volume has
dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the
neurons in a layer will only be connected to a small region of the layer before it, instead
of all of the neurons in a fully-connected manner. Moreover, the final output layer
would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet
architecture we will reduce the full image into a single vector of class scores, arranged
along the depth dimension.

Status: 1 December 2020 Page: 168


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Layers used to build ConvNets - Convolutional Layer

Page
Status: 1. Dezember 2020
169

As we described above, a simple ConvNet is a sequence of layers, and every layer of


a ConvNet transforms one volume of activations to another through a differentiable
function. We use three main types of layers to build ConvNet
architectures: Convolutional Layer, Pooling Layer, and Fully-Connected
Layer (exactly as seen in regular Neural Networks). We will stack these layers to
form a full ConvNet architecture.

Status: 1 December 2020 Page: 169


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Layers used to build ConvNets - Pooling Layer

Page
Status: 1. Dezember 2020
170

Status: 1 December 2020 Page: 170


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Layers used to build ConvNets - Fully-Connected Layer

Page
Status: 1. Dezember 2020
171

Status: 1 December 2020 Page: 171


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Layers to build ConvNets–Example “Image Recognition”

Page
Status: 1. Dezember 2020
172

Status: 1 December 2020 Page: 172


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Deep Neural Networks (DNN)

Page
Status: 1. Dezember 2020
173

DNNs have already made breakthroughs for a number of practical applications, the most
widely used being speech recognition. Without DNN Siri would not have been possible.
DNN is also used for playing games: The new with DNN is that systems like Goggle
AlphaGo has learned the game strategy independently. The victory of IBM Deep Blue
against Kasparov in chess was a victory of ingenious programmers in combination with
superior computing power. AlphaGo, on the other hand, has achieved its progress since
October - when it beat the European champions - by playing and learning against itself.
This is the new capability of Deep Neural Networks (DNN). In other gam
es, DNN have already demonstrated how autonomous their learning has become: for
example, Space Invaders, where the DNN became a master player just by "watching" the
screen pixels and playing around with joystick moves.
See also: https://www.latimes.com/science/sciencenow/la-sci-sn-computer-learning-
space-invaders-20150224-story.html
However, for many areas, such as autonomous driving or lending, the use of such
networks is extremely critical and risky due to their "black box" nature, as it is difficult to
interpret how or why the models come to particular conclusions. It is an open issue to
understand and explain the decision making of deep neural networks.

Status: 1 December 2020 Page: 173


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UC1 - Daimler AG: “Deep-Learning-und-Autonomes-Fahren”

Referent: Eike Rehder (Daimler AG.)


Inhalt: Automatisches Fahren verspricht, die Mobilität neu zu gestalten. Damit ein
automatisches Fahrzeug sicher im Straßenverkehr navigieren kann, muss es mit einer
künstlichen Intelligenz ausgestattet werden. Diese ermöglicht dem Fahrzeug, seine
Umgebung wahrzunehmen, das Verkehrsgeschehen zu analysieren und Entscheidungen zu
treffen.
Ein wesentlicher Baustein für diese neue Intelligenz sind künstliche neuronale Netze (KNNs).
Obwohl diese bereits seit Jahrzehnten Gegenstand der Forschung sind, haben sie erst in den
letzten Jahren eine Qualität erreicht, die ihre Anwendung im Straßenverkehr in greifbare
Nähe rücken lässt. Dies geht zurück auf die Entwicklung des so genannten Deep Learning, bei
dem KNNs mit mehreren Millionen Parameters erfolgreich trainiert werden können. Diese
tiefen Netze können überall dort Anwendung finden, wo eine explizite Modellierung durch
den Menschen kaum möglich ist. Insbesondere in der Verarbeitung von Sensordaten sowie
dem Verstehen komplexer Verkehrsszenarien werden sie eingesetzt.
Dieser Vortrag soll eine Einführung in das maschinelle Lernen, künstliche Neuronale Netze
und Deep Learning geben. Als Beispiele hierfür dienen Anwendungen aus der
Wahrnehmung, Situationsinterpretation und Entscheidungsfindung des automatischen
Fahrens.
Page
Status: 1. Dezember 2020
174

https://rg-stuttgart.gi.de/veranstaltung/deep-learning-und-autonomes-fahren/
Automatic driving promises to redesign mobility. For an automatic vehicle to
navigate safely on the road, it has to be equipped with artificial intelligence. This
allows the vehicle to perceive its surroundings, analyze the traffic and make
decisions.
An essential building block for this new intelligence are artificial neural networks
(KNNs). Although these have been the subject of research for decades, only in
recent years have they achieved a quality that makes their use in road traffic within
reach. This goes back to the development of so-called deep learning, in which KNNs
with several million parameters can be successfully trained.
These deep networks can be used everywhere where explicit modeling by humans
is hardly possible. In particular, they are used in the processing of sensor data and
the understanding of complex traffic scenarios.

Status: 1 December 2020 Page: 174


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UC2 – Fraunhofer IEE + enercast GmbH (Kassel, Germany):


“Power forecasts for renewable energy with CNN”
By using Big Data & Artificial Intelligence (CNN) Fraunhofer IEE and enercast GmbH have been
operating (since 2011) forecast services via distributed and redundant data centers in Germany.
The currently 600 TB of Managed Data are processed via a robust, industry-ready Big Data
infrastructure – growth per day: 150 GB.
Hundreds of artificial neural networks can be trained on our Artificial Intelligence platform. For each
site, we transform data points into high-quality and relevant information via this infrastructure (24/7
calculation, transmission and monitoring). See also White Paper

Page
Status: 1. Dezember 2020
175

Fraunhofer IEE has more than 15 years of experience in forecasting from volatile
energy producers. Enercast GmbH (https://www.enercast.de/) has been developing
since 2011 together with Fraunhofer in a cooperation project reliable and accurate
performance forecasts and extrapolations for wind + photovoltaic systems. The
special at the calculations is that these rely on artificial intelligence - through the use
of neural networks. All data is processed by an algorithm. For more details see also
the whitepaper: https://www.enercast.de/wp-content/uploads/2018/04/whitepaper-
prognosen-wind-solar-kuenstliche-intelligenz-neuronale-netze_110418_EN.pdf
Detail: The entire infrastructure was gradually built up - always with the objective of
providing a max. to achieve possible forecast quality. Fact is, per system (so for a
single wind turbine) are always used multiple networks. This leads for a complete
wind farm to hundreds of networks at work. The first training of network is mostly on
Cassandra clusters. Only when a certain quality is achieved, the switch takes place in
the operational mode - then without Cassandra. The networks then change
continuously and adapt to the respective local conditions. That is, if there are changes
in the local climate situation, then the grids correct themselves automatically to
provide optimal performance prediction. Exactly this adaptability is the big advantage
over the classic, static models. The static models describe the situation through a set
of fixed parameters in an algorithm. Economically, the static models deliver faster
results, whereas the AI-based systems achieve higher forecasting quality in the
medium term.

Status: 1 December 2020 Page: 175


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UC3 -Semantic Search: “Predictive Market with Fact-


Finder” https://youtu.be/vSWLafBdHus

Page
Status: 1. Dezember 2020
176

Maximize re-orders. With the Predictive Basket. Online shopping without


rummaging and thinking: The Predictive Basket by FACT-Finder shows yours
customers are likely to buy those products that they are likely to buy in their current
session become.
Whether click, search query or purchase - with every interaction in the shop your
customers leave traces in the form of data. With our tracking interface, this data is
captured by your FACT-Finder and used for shop optimization. But what exactly does
FACT-Finder use the tracking data for? And what does shop tracking bring to your
customers and you as a user? In this post you will find the answers. Your box office
hits automatically move into the focus of the customers, based on the shop tracking
data. FACT-Finder learns which products are most popular with your customers - that
is, which ones they most click on, add to the shopping cart and buy. This knowledge
can be incorporated into the sorting of your search results - by activating the
Automatic Search Optimization. Your bestsellers will automatically move up to the top
positions in the result ranking over time. This increases the purchase probability.
Because the higher the relevant products appear, the faster they will catch your
customers. See also Fact-Finder Webinar: Dr. Holger Schmidt – „Wie Plattformen den
eCommerce disrupten“:
https://www.youtube.com/watch?v=9T9sOxRB9qg&feature=youtu.be

Status: 1 December 2020 Page: 176


Dr. Hermann Völlinger
Mathematics & IT-Architecture

UC4 – Deep Neural Network – “Google AlphaGo”


https://storage.googleapis.comdeepmimd-media/alphago/AlphaGoNaturePaper.pdf

Page
Status: 1. Dezember 2020
177

The victory of Google-developed DeepMind AlphaGo software against South-Korean


Go world- champion Lee Sedol does not simply ring in the next round of industrial
revolution. According to IT expert Carsten Kraus, the superiority of Deep Neural
Networks (DNN) with respect to human intelligence has probably even put an end to
all the upheavals ...
What's so special about AlphaGo? The novelty is that AlphaGo has learned the game
strategy independently. The victory of IBM Deep Blue against Kasparov in chess was a
victory of ingenious programmers in combination with superior computing power.
AlphaGo, on the other hand, has achieved its progress since October - when it beat the
European champions - by playing and learning against itself. This is the new capability
of Deep Neural Networks (DNN). In other games, DNN have already demonstrated how
autonomous their learning has become: for example, Space Invaders Space Invaders,
where the DNN became a master player just by "watching" the screen pixels and
playing around with joystick moves. See also YouTube videos:
• “AlphaGo - The Movie | Full Documentary” (2016):
https://www.youtube.com/watch?v=WXuK6gekU1Y
• New DeepMind AI Beats AlphaGo 100-0 | Two Minute Papers #201 (2017):
https://www.youtube.com/watch?v=9xlSy9F5WtE
• Google's Deep Mind Explained! - Self Learning A.I.
https://www.youtube.com/watch?v=TnUYcTuZJpM

Status: 1 December 2020 Page: 177


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Example from Microsoft Azure- Machine Learning

Page
Status: 1. Dezember 2020
178

https://gallery.azure.ai/Experiment/Compare-Multi-class-Classifiers-Letter-
recognition-2

Status: 1 December 2020 Page: 178


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML6

Homework H6.1 – “Power Forecasts with CNN in UC2”


Groupwork (2 Persons): Evaluate and explain in more details the CNN in
“UC2-Fraunhofer + enercast: Power forecasts for renewable energy with
CNN”, see: https://www.enercast.de/wp-content/uploads/2018/04/whitepaper-
prognosen-wind-solar-kuenstliche-intelligenz-neuronale-netze_110418_EN.pdf

Homework H6.2 – “Evaluate AI Technology of UC3”


Groupwork (2 Persons): Evaluate and find the underlying AI technology
which is used in “UC3 – Semantic Search: Predictive Basket with Fact-
Finder”. See: https://youtu.be/vSWLafBdHus

Page
Status: 1. Dezember 2020
179

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 179


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML6

Homework H6.3* (advanced) – “Create Summary to GO Article”


Groupwork (2 Persons): Read and summaries of the main results of the
article “Mastering the game of Go with deep neural networks and tree
search”. See also:
https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

Homework H6.4* (advanced) – “Create Summary to BERT Article”


Groupwork (2 Persons): Read and summaries of the main results of the
article about BERT. See Ref. [BERT]: Jacob Devlin and Other: “BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding”;
Google(USA); 2019; See for introduction toe BERT the following video:
https://www.youtube.com/watch?v=xI0HHN5XKDo

Page
Status: 1. Dezember 2020
180

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”. Details tod Homework-H6.4:


Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural
language processing (NLP) pre-training developed by Google. BERT was created and published
in 2018 by Jacob Devlin and his colleagues from Google.[1][2] Google is leveraging BERT to better
understand user searches.[3]
The original English-language BERT model comes with two pre-trained general types:[1] (1) the
BERTBASE model, a 12-layer, 768-hidden, 12-heads, 110M parameter neural network
architecture, and (2) the BERTLARGE model, a 24-layer, 1024-hidden, 16-heads, 340M parameter
neural network architecture; both of which were trained on the BooksCorpus[4] with 800M
words, and a version of the English Wikipedia with 2,500M words.

Status: 1 December 2020 Page: 180


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML7 – BackPropagation for Neural Networks

Page
Status: 1. Dezember 2020
181

See the following documents/scripts:

• “A Step by Step Backpropagation Example” by Matt Mazur (2018);


https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

• Scikit-learn 0.23.1 “ Ch. 1.17. Neural network models (supervised)” https://scikit-


learn.org/stable/modules/neural_networks_supervised.html?highlight=back%20p
ropagation

See following YouTube Videos:


• “Back Propagation in Neural Network with an Example | Machine Learning (2019);
https://www.youtube.com/watch?v=GJXKOrqZauk
• “Back propagation neural network with example in Hindi and How it works?”
(31.12.2018); https://www.youtube.com/watch?v=0_2nA_WoSmE

Status: 1 December 2020 Page: 181


Dr. Hermann Völlinger
Mathematics & IT-Architecture

What is BackPropagation

Page
Status: 1. Dezember 2020
182

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Status: 1 December 2020 Page: 182


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Simple Example of a Neural Network

Page
Status: 1. Dezember 2020
183

Status: 1 December 2020 Page: 183


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML7

Homework 7.1 – “Exercise of an Example with Python”

*******placeholder*************

Homework 7.2 – “Exercise of an Example with Python”

*******placeholder*************

Page
Status: 1. Dezember 2020
184

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 184


Dr. Hermann Völlinger
Mathematics & IT-Architecture
1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR)
6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM)

ML8 – Support Vector Machines (SVM)

Page
Status: 1. Dezember 2020
185

See also:
https://courses.cs.washington.edu/courses/cse573/05au/support-vector-
machines.ppt

More details: https://en.wikipedia.org/wiki/Support_vector_machine

Status: 1 December 2020 Page: 185


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Motivation and Definition of SVM

*******placeholder*************

Page
Status: 1. Dezember 2020
186

More details: https://en.wikipedia.org/wiki/Support_vector_machine

Status: 1 December 2020 Page: 186


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Linear Support Vector Machine (LSVM)

*******placeholder*************

Page
Status: 1. Dezember 2020
187

Status: 1 December 2020 Page: 187


Dr. Hermann Völlinger
Mathematics & IT-Architecture

LSVM – Kernel Trick

Page
Status: 1. Dezember 2020
188

Status: 1 December 2020 Page: 188


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Examples of SVM – Python Algorithm

***************** placeholder ***********

Page
Status: 1. Dezember 2020
189

https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html#sphx-glr-
auto-examples-svm-plot-iris-svc-py

Status: 1 December 2020 Page: 189


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML8

Homework H8.1 – “Exercise of an Example with Python”

*********** placeholder********************

Homework H8.2 – “Exercise of an Example with Python”

*********** placeholder ********************

Page
Status: 1. Dezember 2020
190

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 190


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Exercises to Lesson ML8

Homework 8.3 – “Exercise of an Example with Python”

*********** placeholder ********************

Homework H8.4 – “Exercise of an Example with Python”

*******placeholder*************

Page
Status: 1. Dezember 2020
191

Solutions are found in Ref. [HVö-4]: “Exercises2Lecture.pdf”

Status: 1 December 2020 Page: 191


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Appendix

Status: 1. Dezember 2020

Status: 1 December 2020 Page: 192


Dr. Hermann Völlinger
Mathematics & IT-Architecture

Appendix-1
1. ML1: Introduction to Machine Learning (ML) 1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo. 2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning 3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning 4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear 5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR) Regression (mLR)
6. ML6: Neural Networks: Convolutional NN 6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm 7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM) 8. ML8: Support Vector Machines (SVM)

1. ML1: Introduction to Machine Learning (ML) 1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo. 2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning 3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning 4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear 5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR) Regression (mLR)
6. ML6: Neural Networks: Convolutional NN 6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm 7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM) 8. ML8: Support Vector Machines (SVM)

1. ML1: Introduction to Machine Learning (ML) 1. ML1: Introduction to Machine Learning (ML)
2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo. 2. ML2: Concept Learning: VSpaces & Cand. Elim. Algo.
3. ML3: Supervised and Unsupervised Learning 3. ML3: Supervised and Unsupervised Learning
4. ML4: Decision Tree Learning 4. ML4: Decision Tree Learning
5. ML5: simple Linear Regression (sLR) & multiple Linear 5. ML5: simple Linear Regression (sLR) & multiple Linear
Regression (mLR) Regression (mLR)
6. ML6: Neural Networks: Convolutional NN 6. ML6: Neural Networks: Convolutional NN
7. ML7: Neural Network: BackPropagation Algorithm 7. ML7: Neural Network: BackPropagation Algorithm
8. ML8: Support Vector Machines (SVM) 8. ML8: Support Vector Machines (SVM)

Status: 1. Dezember 2020 Page


193

Status: 1 December 2020 Page: 193

You might also like