001-2023-0714 DLBDSIDS01 Course Book
001-2023-0714 DLBDSIDS01 Course Book
SCIENCE
DLBDSIDS01
INTRODUCTION TO DATA SCIENCE
MASTHEAD
Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.de
DLBDSIDS01
Version No.: 001-2023-0714
N. N.
2
PROF. DR. THOMAS ZÖLLER
Mr. Zöller teaches in the field of data science at IU International University of Applied Scien-
ces. He focuses on the fields of advanced analytics and artificial intelligence and their key
role in digital transformation.
After studying computer science with a minor in mathematics at the University of Bonn, Mr.
Zöller received his doctorate with a thesis in the field of machine learning in image process-
ing. This was followed by several years of application-oriented research, including time spent
at the Fraunhofer Society. Throughout his professional career, Mr. Zöller has worked in vari-
ous positions focusing on the fields of business intelligence, advanced analytics, analytics
strategy, and artificial intelligence, while also gaining experience in the areas of defense tech-
nology, logistics, trade, finance, and automotive.
3
TABLE OF CONTENTS
INTRODUCTION TO DATA SCIENCE
Module Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Basic Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Unit 1
Introduction to Data Science 13
Unit 2
Data 23
Unit 3
Data Science in Business 35
Unit 4
Statistics 49
4
Unit 5
Machine Learning 61
Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK
This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.
The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.
At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.
For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.
When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.
Good luck!
8
BASIC READING
Akerkar, R., & Sajja, P. S. (2016). Intelligent techniques for data science. New York, NY:
Springer International Publishing. Database: EBSCO
Hodeghatta, U. R., & Nayak, U. (2017). Business analytics using R—A practical approach.
New York, NY: Apress Publishing. Database: ProQuest
Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
New York, NY: Springer. Database: EBSCO
Skiena, S. S. (2017). The data science design manual. New York, NY: Springer International
Publishing. Database: EBSCO
9
FURTHER READING
UNIT 1
Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century.
Harvard Business Review, 90, 70—76. Database: EBSCO Business Source Ultimate
Horvitz, E., & Mitchell, T. (2010). From data to knowledge to action: A global enabler for the
21st century. Washington, WA: Computing Community Consortium. (Available online).
UNIT 2
Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From
big data to big impact. MIS Quarterly, 36(4), 1165—1188. Database: EBSCO
Cleveland, W. (2001). Data science: An action plan for expanding the technical areas of the
field of statistics. International Statistical Review, 69(1), 21—26. Database: EBSCO
UNIT 3
Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Per-
spectives, 19(1), 25—42. Database: EBSCO
UNIT 4
Mailund, T. (2017). Beginning data science in R, 125—204. New York, NY: Apress Publishing.
Database: ProQuest
Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and
data science. Cambridge: Cambridge University Press. (Available online).
UNIT 5
10
LEARNING OBJECTIVES
In the course book Introduction to Data Science, you will learn how and why data scien-
tists extract important information from data. After an overview of the definition of data
science and its benefits when applied to various situations, you will learn ways of labeling
different sources, and how to outline the main activities of data science. Because predic-
tive analysis, the base of data science, is the understanding of underlying data characteris-
tics, you will also learn the concepts of descriptive analytics and probability theory.
Afterwards, you will learn how to identify a data science use case in diverse organizations
and how to obtain the value proposition for every use case. Furthermore, you will learn
how to analyze the prediction model of the developed value through evaluation metrics,
as well as how to study the necessary key performance indicators in order to determine
whether or not its implementation in the business has been successful.
Because the raw data are different shapes and types, and they are coming from several
sources, you will learn about the quality issues that routinely degrade the data, and the
traditional methods used when dealing with missing values, irrelevant features, and data
duplications. This will result in clean data which is valid for predictive analysis.
Consequently, you will be aware of the different paradigms of machine learning and how a
prediction model is developed. Finally, you will understand how the model’s outputs can
be effectively shown to a related business manager as a complete framework of the under-
lying data, and how each of its parameters influence the current and future performance.
Hence, better decisions can be made and improved actions can be taken.
11
UNIT 1
INTRODUCTION TO DATA SCIENCE
STUDY GOALS
Case Study
Google is currently scanning and uploading the physical copies of books that have been
published in the last 200 years so that their data is available online. This process is almost
complete for 25 percent of all published books. The data acquired from these books is
used to improve the search results when keywords are entered in to the search engine. In
addition to building this archive of data, Google launched a program, Google Ngrams, that
allows us to observe language change throughout history by providing data such as when,
and with what frequency, words and phrases have been published over time (Michel,
2011). With this data, you can answer the following questions:
The above-mentioned example is a use case of “data science.” When the data is scanned
and uploaded, it is sorted into a specific field and undergoes a systematic analysis to
determine the information that can be taken from it.
Data Science The term “data science” applies to a wide variety of tools and techniques that help us to
This is the combination of learn from data and solve problems with it. Like other scientific disciplines, data science is
business, analytical, and
programming skills that focused on the ways that people can understand data and use it for their benefit.
are used to extract mean-
ingful insights from raw Data science is all about unlocking the real values and insights of the data. This is done by
data.
identifying complex fundamental behaviors, underlying trends, and hidden inferences. In
a business setting, these analyses can enable companies to make smarter business deci-
sions.
Modern technology is capable of collecting and storing huge volumes of data from, for
example, customers, sensors, or social media. The amount of data that can be extracted
from these sources could provide answers and solutions to many problems that busi-
nesses may have. Furthermore, the current advances in computing capabilities allow the
innovative analysis of data related to longstanding problems.
14
Why Data Science?
In 2016, Visual Capitalist published a chart that shows the top five trading companies over
15 years. The ranking is based on the total dollar market value of their company’s out-
standing shares of stock, as given in the following figure (Desjardins, 2016).
As seen in the above figure, the top five companies have been replaced by companies that
are involved in technology and online trade. These are Apple, Alphabet, Microsoft, Ama-
zon, and Facebook. The key resource (and the product) of these five companies is “data”,
and their daily work focuses on applying data science tools to that data.
Data science is not only implemented in technology related companies, but also in any
organization that has data to be analyzed. For example, a company that possesses data
about their users can apply data science to manage and analyze the data, gain meaningful
insights, and effectively extract useful information about the users.
The implementation of data science approaches can produce results from the data that
humans may not have previously picked up on. For example:
• One of the modern research methods in the field of biology is the use of data science
(particularly deep learning techniques) to predict a human’s age, blood pressure, Deep Learning
smoking status, and more by analyzing images of their retina. The application of com-
putational networks (with
• The Canadian government is currently initiating a research program that will establish a cascading layers of units)
prediction of suicide rates in the country using data science (mostly artificial intelli- to learning tasks.
gence techniques). Using the data collected from 160,000 anonymized social media Artificial Intelligence
accounts in Canada, the proposal is to identify underlying patterns associated with A set of approaches to
enable a computer to
those who talk about or exhibit behaviors that could be linked to suicide (Pollock, emulate and thus autom-
2018). atize cognitive processes
— often based on learning
from data.
15
• Researchers at Rutgers University investigated how data science can be applied to more
creative forms of data, which is one of the top challenges for machine intelligence. By
using a dataset of more than 13,000 impressionist paintings, the researchers have
designed a tool that recognizes artistic features and applies them to other images. This
tool is therefore able to produce images in the styles of famous painters (Saleh, 2014).
• Researchers at the University of Edinburgh and Stanford Hospital applied data science
and identified that skin cancer is the most common human malignancy. They utilized
Machine Learning machine learning techniques on 129,450 clinical images and their associated disease
A subset of artificial intel- labels. Through the developed tool, they were successful and were able to automati-
ligence where mathemat-
ical models are devel- cally identify the cancer. They were also able to determine that skin cancer is the most
oped to perform given dangerous to humans.
tasks based on provided
training examples.
Benefits of Data Science
The benefits of data science differ depending on the objectives of those applying it. The
biggest advantage of using data science in an organization is that it enables the organiza-
tion to improve its decision-making. When used in business, data science based decisions
produce amplified profitability and enhanced operational efficiency, business routine, and
workflows. Concerning the customer related aspects of business, data science recognizes
and informs companies of their target audiences, and assists the automated aspect of HR
recruitment so that it can perform more accurately when completing tasks such as short
listing candidates throughout the hiring process.
Additionally, shipment companies can discover the optimum transportation, routes, and
delivery times, and banking institutions can optimize the fraud detection process.
16
Figure 1: The Data Science Venn Diagram
Data science involves many diverse and/or overlapped subjects, which are, among others:
• machine learning,
• database storage and data processing,
• statistics,
• neuro-computing,
• knowledge discovery (KDD),
• data mining, Data Mining
• pattern recognition, and This is the process of dis-
covering patterns in large
• data visualization. datasets.
These subjects work together to develop the complete analysis mechanism of data sci-
ence which helps to discover useful information within the business’ data. This is presen-
ted in the extended data science Venn diagram, shown in the following figure.
17
Figure 2: The Extended Data Science Venn Diagram
It is worth noting that a commonly used term in business management field is business
Business Intelligence intelligence (BI). However, BI mainly focuses on the descriptive analysis of the underlying
This is a collection of rou- data to explain the historical performance of the associated business, whereas data sci-
tines that are used to ana-
lyze and deliver the busi- ence is utilized when performing predictive analysis, and used to predict the future trends
ness performance or provide evidence that can support strategic plans within the associated business.
metrics.
The commonly engaged terms in data science are explained in the following figures:
Data Handling
Training Set The dataset used by the machine learning model that will help it to learn
its desired task.
Testing Set These data are used to measure the performance of the developed
machine learning model.
18
Data Handling
Outlier A data record which is seen as exceptional and outside the distribution of
the normal input data.
Data Cleansing The process of removing redundant data, handling missing data entries
and removing, or at least alleviating, other data quality issues.
Data Features
Feature An observable measure of the data. For example, height, length, and
width of a solid object. Other terms such as property, attribute, or charac-
teristic are also used instead of feature.
Dimensionality Reduction The process of reducing the dataset into lesser dimensions, ensuring that
it conveys similar information.
Feature Selection The process of selecting relevant features of the provided dataset.
Learning Paradigms
Machine Learning Algorithms or mathematical models that use information extrated from
data in order to achieve a desired task or function.
Supervised Learning The subset of Machine Learning that is based on labeled data. It can be
further distinguished in regression and classification.
Unsupervised Learning The subset of Machine Learning that is based on un-labeled data. Typical
unsupervised learning tasks are clustering and dimensionality reduction.
Deep Learning The application of networks of computational units with cascading layers
of information processing used to learn through tasks.
Model Development
Decision Model A model assesses the relationships between the elements of provided
data to recommend a possible decision for a given situation.
Cluster Analysis A type of unsupervised learning used to partition a set of data records into
clusters. Records in a cluster are more similar to each other than to those
in other clusters.
19
Model Development
Model Performance
Probability Quantification of how likely it is that a certain event occurs, or the degree
of belief in a given proposition.
Standard Deviation A measure of how spread out the data values are.
Type I Error False positive output, meaning that it was actually negative but has been
predicted as positive.
Type II Error False negative output, meaning that it was actually positive but has been
predicted as negative.
In 2016, Glassdoor published a list of the best jobs, taking into consideration their salaries,
career opportunities, and job openings. The profession “data scientist” was placed at the
top of the list.
The job of a data scientist starts with data exploration, and when they receive a challeng-
ing data related question, they become detectives. They analyze the data and try to recog-
nize patterns within it. This may require the application of a quantitative technique such
as machine learning in order to delve further into the data and discover more information.
This is a core process that provides strategic support to guide business managers who
must decide how to act on the findings.
Effectively, a data scientist is someone who knows more about programming than a statis-
tician, and more about statistics than a software engineer. A data scientist is able to man-
age data science projects. They store and clean large amounts of data, explore data sets to
identify potential insights, build predictive machine learning models, and weave a story
around the findings which can then be presented to the decision makers.
20
Major Activities of Data Science
The data science activities exist simultaneously in the three dimensions shown in the fol-
lowing figure. These are data flow, data curation, and data analytics. Each dimension rep-
resents a group of data science challenges, their associated solution methodologies, and
their numerical techniques.
As a result, a data scientist follows a group of actions that encompasses all possible ele-
ments of the process that need to be addressed. This can be summarized as:
21
5. Analyze the data.
6. Communicate the results.
SUMMARY
Data science is a multidisciplinary field that has borrowed aspects from
statistics, pattern recognition, computer science, and operational
research. In short, data science derives information from data and
applies it to many different purposes, such as making predictions. The
importance of the extracted information depends on its application,
and, in general, provides a positive value when making decisions in an
associated organization.
22
UNIT 2
DATA
STUDY GOALS
Case Study
Human DNA consists of 3·109 base pairs that, in turn, are made of four building blocks (A,
T, C, and G). While a complete sequence of this type can be stored in about 750MB of mem-
ory, reading and transcription of this blue-print into proteins is a complex process that can
only be studied in detail by making use of the current advances in storage and computa-
tional capabilities. This allows bio-technology researchers to recognize complex DNA
sequences, analyze the data for possible chronic diseases, and adapt medications accord-
ing to a specific genomic structure. Hence, the relationships among the genetic features
are investigated by predictive modelling techniques to provide the physicians with a tool
to automatically identify the important patterns within the DNA strands.
In 2016, CrowdFlower conducted a survey on 80 data scientists to find out “What do data
scientists spend the most time doing?” The outcome of this survey, as shown in the follow-
ing figure, indicates that 60 percent of their time is spent cleaning and organizing data,
and another 19 percent is spent collecting data sets (Glasson, 2017).
The facts, observations, assumptions, or incidences of any business practice are defined
as the associated “data” of the underlying process. These data are processed to return the
most important information about the associated business. This information represents
the useful patterns and meaningful relationships among the data elements. The organiza-
tion should use the information that has been extracted from its business in order to
enhance the related sales, marketing strategies, and consumer needs. Therefore, any
piece of information should be relevant, concise, error-free, and reliable so that it can per-
form this objective. Hence, the efficient understanding and handling of the associated
24
business’ data have critical roles. In this unit, a detailed discussion about the possible
data types, sources, and shapes will be presented, alongside the standard issues that rou-
tinely influence the quality of the collected data. Therefore, the most critical issue in any
data science or modeling project is finding the right data set.
Types of Data
There are two types of data: quantitative and qualitative. Any characteristic of the collec-
ted data can be described as either a quantitative variable (i.e., numerical values), or a
qualitative variable (i.e., non-numerical values). Examples of quantitative data are number
of people, students’ GPA, and ambient temperature, whereas examples of qualitative data
include customer feedback, softness of a product, and the answer to an open-ended ques-
tion. A more detailed explanation about the differences between quantitative and qualita-
tive data is provided in the table below.
Data that describes qualities or characteristics. Data that can be expressed as a number or can be
quantified.
Data type: words, objects, pictures, observations, Data type: numbers and statistics.
and symbols.
Questions that the data answer: What characteris- Questions that the data answer: “How much?””
tic or property is present? and “How often?”
Purpose of data analysis: to identify important Purpose of data analysis: to test hypotheses,
themes and the conceptual framework in an area of develop predictions for the future, and check cause
study. and effect.
Examples: Examples:
• happiness rating • height of a student
• gender • duration of green light
• categories of plants • distance to planets
• descriptive temperature of coffee (e.g., warm) • temperature of coffee (e.g., 30°C)
25
Shapes of Data
Data reveals itself in three shapes: structured, unstructured, and streaming. Structured
data are those with a high level of construction, and are shaped in tabular rows (to include
the data transactions or data records) and columns (to include the data characteristics or
data variables). Alternatively, unstructured data is considered to be the raw shape of the
data with non-uniform structure, which often includes text, numbers, and/or images. An e-
mail is a simple example of an unstructured data shape, where the e-mail body may con-
tain words, values, and some images. A complex mathematical tool is required to handle
the unstructured data and transform them into a format that reveals the information and
patterns within the data. The following table shows a basic comparison between struc-
tured and unstructured data shapes.
Characteristics: Characteristics:
• predefined data models • no predefined data models
• usually only text or numerical • may be text, images, or other formats
• easy to search • difficult to search
Applications: Applications:
• inventory control • word processing
• airline reservation systems • tools for editing media
Examples: Examples:
• phone numbers • reports
• customer names • surveillance imagery
• transaction information • email messages
If the data involve both the structured (tabular) shape and the unstructured shape, it is
called semi-structured data. The streaming data is continuously generated by different
sources (e.g., sensors, cameras, etc.), typically at high speeds. Such data is processed
incrementally without having access to all of the data. It allows users to access the content
immediately, rather than waiting for it to be downloaded. A particular feature of streaming
data is the large amount of data being created. This can be demanding in terms of its stor-
age and processing requirements.
Sources of Data
Data sources should be trustworthy enough to ensure that the collected data is high qual-
ity and robust enough for the next steps of processing. Common sources of data are
described in the following paragraphs.
26
Organizational and trademarked data sources
Large companies like Google and Facebook possess enormous amounts of data. They pro-
vide bulk downloads of public data for offline analysis in order to enrich the organization’s
market visibility. Google and Facebook also have internal data that their employees use.
Almost all companies have data themselves. The first and most important point of access
is for the various internal systems recording the activities of their own business.
Federal governments are committed to open data so they can enable and enhance the
way that government fulfills its mission. Furthermore, governmental organizations release
demographic and economic data (e.g., population per area) every few years to be ana-
lyzed for the sake of better risk estimation.
Academic research creates large datasets, and many scientific journals require that these
datasets be made available to other researchers. Many fields are covered by the datasets,
including medical, economic, and historical research.
Webpages often provide valuable numerical and text data. For example, you can request
all tweets with a certain hash tag from the Twitter webpage (e.g., #iPhoneX), and apply
sentiment analysis on them in order to determine whether the majority of tweets contain-
ing that hashtag are positive or negative. The customer support division of an organiza-
tion associated with this topic (e.g., Apple) can use this information to improve their busi-
ness.
Media includes outputs such as video, audio, and podcasts which all provide quantitative
and qualitative visions concerning characteristics of user interaction. Since media crosses
all demographical borders, it is the quickest method for businesses to draw patterns and
enhance their decision making.
27
In data science, the underlying data is frequently big data. For example, in one minute,
approximately 220,000 new photos are uploaded to Instagram and three hundred hours of
videos are uploaded to YouTube (Jogawath, 2015). The main obstacles when handling
data and describing its overloads are volume, variety, veracity, validity, and velocity
(called the “5Vs of data”).
Volume
Volume refers to the amount and scale of the data. An airplane fitted with 5,000 sensors
generates about 10GB of data for every second it is in flight (Rapolu, 2016). Current esti-
mates have the current yearly data creation at around 50 zetabytes while forecasts see a
dramatic increase in the upcoming years (Holst, 2019). Current research and development
in computational and storage technology aims to alleviate the pertaining data handling
challenges.
Variety
There is a considerable variety of data. Previously, most data were generated in a struc-
tured shape to simplify the forthcoming data science processes. Today, much of the cre-
ated data can be considered unstructured, and requires more advanced mathematical
techniques to handle it.
Velocity
Velocity of data refers to the speed at which the data is created, stored, analyzed, and
visualized. Computational tools need significant periods to process data and update data-
bases.
Veracity
Veracity refers to the quality of the data. Pursuant to the volume, variety and velocity in
which data arrives in current information processing settings, it cannot be guaranteed
that the processed data is perfectly correct or precise. Such quality impairments, often
called noise, have to be accounted for when interpreting the outcomes of any form of
analysis.
Validity
The validity of data is another important aspect. The data may be correct and noise-free,
but it may be outdated or otherwise unsuitable for the question at hand. If this is the case,
it will not result in any meaningful conclusions after its analysis.
28
2.3 Data Quality
The collected data commonly suffers from quality issues because of imperfect data sour-
ces or issues in the data collection process. Such data is problematic due to the existence
of values which are noisy, inaccurate, incomplete, inconsistent, missing, duplicate, or out-
liers. It is important to note that there are “true” outliers and “fake” outliers. The fake out- Outlier
liers are data records which do not seem to match the patterns followed by the majority of An outlier is a data record
which is seen as excep-
the other data records, but they are still possible, although unlikely, outcomes of the tional and incompatible
underlying process. with the overall pattern of
the data.
There are many approaches that can be used when handling data quality issues, and more
than 80 percent of a data scientist’s time is spent dealing with these.
In some data records, there may be values which have not been observed (i.e., missing
values) or were incorrectly observed (e.g. outliers) during data collection.
Several methods are routinely employed to resolve the issue of missing values and outli-
ers.
1. Removal of data records that contain missing values and/or outliers: This method is
recommended for large datasets where the removal of some records will not affect
the data as a whole. This method can only be used after it has been confirmed that
removing the chosen records will not influence the results. For example, under
unusual conditions, a sensor may be unable to deliver a normal value. In such a case,
removing the record might lead to the exclusion of an interesting aspect of the data-
set that would provide valuable information about the operation of the sensor.
2. Replacement of the missing value or outlier with an interpolated value from neighbor-
ing records: For example, we have a dataset for temperatures at different times of the
day (time: {11:00, 11:01, 11:02}, temperature: {20 °C, x , 22.5 °C}) where x is a miss-
ing temperature value or an outlier (i.e., out-of-range value). The value of x is replaced
by the linearly-interpolated value that is obtained from the recorded temperatures on
either side of the missing value.
22.5+20
x= 2
= 21.25 °C
3. Replacement of the missing value or outlier with the average value of its variable
across all data records.
4. Replacement of the missing value or outlier with the most-often observed value for its
variable across all data records.
A new variable may be introduced into the dataset with a value of “0” for the normal data
records and “1” for the data records containing missing and/or outlier values that were
handled by one of the above methods. By doing so, we ensure that the original informa-
tion is not lost.
29
Duplicate Records
If there are duplicate records within the dataset, they are removed before proceeding with
the data analysis in order to reduce computing time and prevent the distortion of the ana-
lytics outcome.
Redundancy
Other issues that may appear within the dataset are related to the existence of redundant
and irrelevant variables. We identify these issues by applying correlation analysis to each
pair of variables. This allows us to resolve redundancies without losing any important
information from the dataset by removing the variables that show high correlation
towards other variables. The correlation between two variables can be seen in the follow-
ing figure where the highly irregular shape on the left indicates that the variables are not
correlated, the somewhat dispersed shape in the middle figure indicates partial correla-
tion, and the line shape on the right indicates strong correlation.
The correlation coefficient (ρ) between two data variables x and y is calculated as:
n
∑ xi − x yi − y
i=1
ρ x, y =
n n
2 2
∑ xi − x ∑ yi − y
i=1 i=1
Where x and y are the average value of variable x and y, respectively, for a dataset of n
records.
The correlation coefficient is a statistic measurement for the degree of the relationship
between two variables, and it is in the range of [–1, 1]. If (ρ=1), the two variables are fully
correlated, whereas (ρ=0) indicates no correlation or that the variables are independent.
Negative correlation coefficients imply that the variables are anti-correlated, meaning that
30
when x goes up, y goes down, and vice versa. We can set a threshold on the value of ρ,
and if the correlation exceeds this threshold, one of the two variables can be removed
from the dataset with negligible influence on performance.
Data engineering focuses on the practical applications of data collection and analysis. For
all the work that data scientists do in order to answer questions using data, there should
be tools to gather and validate that data, and then apply it to real-world operations.
The building of a reliable system to handle the data (especially for big data processes) is
not a straight forward task, and it may take 60 to 80 percent of the data scientist’s effort
before the data is ready for meaningful information and patterns to be extracted.
The collected data should be preserved without any wrong data samples or fake outliers,
and the way that the missing data values will be controlled should be determined. Improv-
ing data quality typically requires detailed knowledge of the domain in which the data are
recorded.
It is also essential to protect the data and determine the legal frameworks and policies
that should be followed. Physical threats and human errors should be identified whenever
possible.
In some scenarios, data transformation is required to convert the collected dataset into a
form suitable for applying data science. The main transformation methods are variable
scaling, decomposition, and aggregation, which are shown in the following table.
Variable scaling The dataset may include variables of mixed scales. For example, a
dataset may contain income values in dollars, number of purchases
per month, and amount of car fuel consumed per month. The mod-
eling techniques work on scaled variable values, e.g., between —1
and 1, to ensure that all analyzed variables are weighted equally.
The scaling may be performed by normalizing a variable’s value
with respect to its maximum value.
The other option is to remove the variable’s average and divide by
the standard deviation of the variable.
31
Transformation Method Description
Variable decomposition Some variables may need to be further decomposed for better data
representation. For example, a time variable may be decomposed
into hour and minute variables. Furthermore, it may turn out that
only one of the two variables (hour or minute) is relevant, so the
irrelevant variable is removed from the dataset.
Variable aggregation Alternatively, two variables may be more meaningful if they are
merged (i.e., aggregated) into one variable. For example, “gross
income” and “paid tax” variables may be aggregated into one varia-
ble, “net income.”
The algorithms and calculations used in data processing must be highly accurate, well-
built, and correctly performed so that there is no negative effect on the decisions made
based on the results. The benefits of data processing, especially in medium and large
organizations, are:
A case study is carried out so that the online merchants can gain a complete picture about
how customers are utilizing their web services; essentially, they are looking for a 360
degree view of their customer. Therefore, a large set of unstructured data is collected and
combined with the structured customers’ transaction data. By applying data processing to
this case study, valuable information is obtained. This information could be, for example,
which pages a customer visits, how long a customer stays on a page, and which products
the customer buys. This information will lead to improved business management and bet-
ter decisions being made regarding a certain product and/or service.
Another case study is the Internet of Things (IoT) which includes connected devices and
sensors on a platform. These devices are in the customer’s environment, and collect mil-
lions of data records every week about the usage of each device. This unstructured big
data should be transformed in to a set of structured data records in order to enable a fur-
ther analysis of the devices performance.
In general, the top applications that obtain and use data while also effectively employing
data science are given below.
32
Industrial processes data applications
Data are obtained at different levels of the production process. Ingredients and actuators
data are obtained at the field level, signals data at the control level, monitoring sensor
data at the execution level, and indicators data at the planning level.
The main goal of applying data science to industrial processes is to automate and opti-
mize them, and to improve the competitive situation of the company.
Data are obtained and analyzed in many business domains, such as customers’ data, port-
folio data, human resources data, marketing data, sales’ data, and pricing data.
The main goal of applying data science to business data is to better understand, motivate,
and drive the business processes. From the workflow of the management data system,
bottlenecks of the business processes can easily be identified and/or the sales predictions
can be estimated.
Data formatted in text serve as important information resources and are applied in many
different settings. Examples are text documents, e-mails, and web documents. These data
use particular organizational criteria such as record fields or attributes structure.
The main goal when applying data science to text data is to filter, search, extract, and
structure information.
Data formatted as images are easily obtained nowadays due to the advances in imaging
sensors technology. These sensors range from smartphone cameras to satellite cameras,
providing large amounts of two dimensional and/or three dimensional images data.
The main goal of applying data science to image data is to find and recognize objects, ana-
lyze and classify scenes, and relate image data to other information sources.
Data in the medical field are obtained at all stages of the patients/medicine laboratory
experiments. Furthermore, patient health records and clinical care programs are also
medical field related data.
33
The main goal of applying data science to medical data is to analyze, understand, and
annotate the influences and side effects of medication in order to detect and predict dif-
ferent levels of certain diseases.
SUMMARY
Data is collected from many different sources, such as companies, gov-
ernments, web pages, and media platforms. These data are either quan-
titative or qualitative, and they are formatted as structured, unstruc-
tured, or streaming data. There are five characteristics to be taken into
consideration when handling big data. These are volume, variety, veloc-
ity, veracity/validity, and value of the collected data.
34
UNIT 3
DATA SCIENCE IN BUSINESS
STUDY GOALS
Case Study
The finance sector contains many interesting and highly valuable opportunities for the
application of data science methods. As an example, take the determination of the credit
worthiness of a customer. In this task, the goal is to estimate how likely it is that a loan
given to that particular customer is paid back with the contractually agreed interest. The
information that such a decision can be based on includes their monthly/annual earnings,
real estate ownership or rental, any debt, deposits, and more. No matter which concrete
measures are used, the challenge lies in the estimation of future behavior based on data
of past transactions with the customer. This characteristic places the problem squarely
into the field of predictive analytics.
Prediction techniques are applied through DSUCs to extract valuable information from
collected data. The DSUC in any business can be identified through three main points:
effort, risk, and achieved value. The potential of a new project is often measured by how
much improvement can be made to the operational business. Therefore, an organization
should focus their analysis on reducing effort and increasing gain, as demonstrated in the
following figure.
36
Figure 6: Identification of an Organization’s Use Cases
Organizations must identify which use cases they are going to tackle, and then ensure the
availability of suitable datasets. Hence, some important questions have to be answered:
• What is the value of the knowledge gained by applying data science tools to that data-
set?
• What will be discovered about the input dataset and its hypothesis?
• What value will be added to the organization through applying data science techniques?
• What will the organization’s decision be if the data science produces disappointing
results?
The following figures show the obtained value(s) after applying data science techniques to
some more common use cases.
37
Figure 7: Achieved Value by Data Science in “Customer”-Related Use Cases
38
Figure 9: Achieved Value by Data Science in “Financial Fraud”-Related Use Cases
After identifying the DSUC, a data scientist needs to look into all resources that are availa-
ble to the business in order to find a relevant dataset. If one does not exist, then a new
dataset is built from the available resources. Depending on the type of data and its uses,
data could be sourced from internal or external databases, web scrapping, or sensor data.
Data collection is often a tedious and costly task as it may require human intervention.
Humans are involved in the data collection phase to study the data, label it, add valuable
comments, correct errors, and even observe some data anomalies. They must then decide
whether to manually correct these specific data points or just exclude them.
Once these tasks have been completed, the preprocessing techniques (e.g., data cleans-
ing, handling the missing values, etc.) are applied to the data to correct any kind of error
or noise, then redundant or missing values/records are scanned. The employees who
carry out the data scrubbing should have a significant knowledge of the domain of this
data in order to make efficient decisions concerning the way to deal with the monitored
data errors. By the end of the preprocessing stage, any incorrect or redundant information
should have been removed and the size of the dataset itself will be reduced.
Next, machine learning methods are applied when it comes to building a prediction model
on top of the refined dataset. The purpose of the model is to define the relationship
between data inputs and DSUC values. This is achieved by establishing a mathematical
function that accepts the inputs and produces outputs. The dataset is divided into two
parts: the training set and the testing set. The training set is used in the building and learn-
ing part of the model, whereas the testing set is used to check the accuracy of the devel-
oped prediction model.
39
Once the model’s accuracy is acceptable, it is implemented to deliver the DSUC values for
the current dataset with the ability to predict future values if new data scenarios are
added to the dataset. If the nature of the input dataset begins to change due to the addi-
tion of new data records, then the model must be updated accordingly.
Developed by Dorard (2017), the machine learning canvas is a tool that is used to identify
data science use cases and provide a visual user procedure. Managers can apply this to
their business problems, which can then be analysed by machine learning techniques, as
shown in the following figure.
The tool has a simple single page user interface that provides all of the required function-
alities ranging from identifying use cases to achieving the value proposition.
For example, the canvas can be used in the domain of real estate. It is useful when investi-
gating risky investments and comparing the real estate’s price predictions with the actual
prices to determine the best deals. The process of creating a machine learning prediction
model for this use case is shown below.
40
Figure 11: DSUC: Real Estate Problem
41
Model-Centric Evaluation: Performance Metrics
The output of the developed prediction model is either a class or category (classification
model), or a discrete number or probability (regression model). We will discuss the met-
rics that are routinely implemented to evaluate the performance of each type of these
models.
For a DSUC designed with only two possible outputs {“yes”, “no”}, the decision of the out-
put is dependent on a threshold assigned to the model. When the model is applied to a
data record, there are only four possible outcomes. These are true positive, true negative,
false positive and false negative.
• True positive (TP): In the case of a correct prediction, the classifier produces a label
“yes” for the data record.
• True negative (TN): In the case of a correct prediction, the classifier produces a label
“no” for the data record.
• False positive (FP): In the case of an incorrect prediction, the classifier produces a label
“yes” for the data record.
• False negative (FN): In the case of an incorrect prediction, the classifier produces a label
“no” for the data record.
These four possible results are usually presented in a matrix form called the confusion
matrix, as shown below.
Model Output
YES NO
For the four possible outputs, there are three performance metrics to measure the model
quality. These are precision, accuracy, and recall, as explained in the following equations.
number of TP
Precision = number of TP+number of FP
number of TP+number of TN
Accuracy = number of TP+number of TN + number of FP + number of FN
number of TP
Recall = number of TP+number of FN
42
To distinguish between the two classes of the classification model {“yes”, “no”}, a thresh-
old has to be applied. This cutoff value could be set to a certain percentage that is decided
upon during the analysis, and any output value that exceeds this cutoff value will be con-
sidered a “yes”, while all lower outputs will be considered a “no”. Therefore, the model
performance is dependent on the cutoff value which affects the number of true positives,
true negatives, false positives, and false negatives accordingly.
The receiver operator characteristic (ROC) curve shows how altering the cutoff value could
change the true positive and false positive rates. An ideal model would be able to com-
plete the classification operation with 100 percent accuracy, meaning that it could pro-
duce a true positive value of 100 percent and a false positive value of zero percent. Since
no model in reality can be that accurate, the ROC curve helps to find a more realistic
threshold value at which true positive is at its highest rate and false positive is at its lowest
rate. The following steps should be followed to create a ROC curve:
number of FP
False Positive Rate = number of FP+number of TN
and
number of TP
True Positive Rate = number of TP+number of FN
4. Every point on the ROC curve has coordinates of (False Positive Rate, True Positive
Rate).
5. Another cutoff frequency is chosen, and steps 2 and 4 are repeated, resulting in the
ROC shown below.
43
Figure 12: Receiver Operator Characteristic (ROC) Curve
The objective is to measure how close a regression model’s output (y) is to the desired
output (d). There are standard metrics that evaluate the accuracy and performance of the
model which are root mean square error, mean absolute error, absolute error, mean abso-
lute error, relative error, and square error, as given in the following equations.
Absolute error ε = d − y
d−y
Relative error ε * = d
· 100 %
n di − yi
1
Mean absolute percentage error MAPE = n ∑ di
· 100 %
i=1
2
Square error ε2 = d − y
n
1 2
Mean square error MSE = n ∑ di − yi
i=1
44
n
1
Mean absolute error MAE = n ∑ di − yi
i=1
n
1 2
Root mean square error RMSE = n ∑ di − yi
i=1
The end user decides how to line up the model’s output in order to fit in the business goals
and objectives. For example, in fraud detection, the user has the ability to decide the
range of percentages at which a suspicious transaction or behaviour is considered to be a
true fraud. In this case, the selected threshold will have a tradeoff between false negatives
and false positives. These tradeoffs should be taken into consideration to maximize the
effectiveness of the model. Using different values for the threshold enables the business
managers to consider different scenarios.
In some cases, the end user may have to make a decision that directly impacts data
records, as well as a decision about the value of the thresholds. For example, one feature
that may exist in a dataset is product price. In some cases the price of the product may
need to be modified. On such an occasion, the model should be capable of accommodat-
ing these changes and should be able to be re-trained.
The ultimate goal of a smart model is the automation of user’s decisions. These decisions
are often dependent on the model’s ability of prediction. For example, a model could be
designed to analyze hotel reviews and decide whether these reviews are fake or not. If the
model’s predictions are highly accurate, then a review can automatically be accepted or
rejected without the need for any human intervention.
After a model has been evaluated successfully with the aforementioned evaluation met-
rics, it is ready for deployment. At this stage, the model should be able to produce a trus-
ted DSUC value for the associated business problem. The decision makers must then be
confident that the DSUC is correctly implemented in a way that will help the company to
meet their business goals. Quantification of the model’s merit is achieved by defining so-
called Key Performance Indicators (KPIs). These are measurements that express to what
extent the business goals have been met or not. Most KPIs focus on increased efficiency,
reduced costs, improved revenue, and enhanced customer satisfaction.
45
Characteristics of effective KPIs
There are several characteristics that determine whether or not a KPI is considered to be a
way of measuring if the business goals have been achieved. These characteristics are sum-
marized as:
Examples of KPIs
Some effective and commonly utilized KPIs to measure the performance of DSUC from
business-centric point of view are shown in the following figure.
46
The following table presents four common cognitive biases and their associated de-bias-
ing techniques.
Table 11: The Common Cognitive Biases and Their De-biasing Techniques
Anchoring Occurs when the estimation of a numer- Remove anchors, have numerous and
ical value is based on an initial value counter anchors, use various experts
(anchor), which is then insufficiently using specific anchors.
adjusted to provide the final answer.
Confirmation Occurs when there is a desire to confirm Use multiple experts for assumptions,
one's belief, leading to unconscious counterfactual challenging probability
selectivity in the acquisition and use of assessments, use sample evidence for
evidence. alternative assumptions.
Insensitivity Sample sizes are ignored and extremes Use statistics to determine the likeli-
are considered equally in small and hood of extreme results in different
large samples. samples, use the sample data to prove
the logical reason behind extreme sta-
tistics.
SUMMARY
This unit focuses on the role and value of data science in business. Ele-
ments of a suitable use case are discussed, and the way that different
departments need to work together is explained. There is also an
explanation of how value is created and which types of use case are suit-
able for machine learning implementations. The evaluation of the per-
formance of a data science model is presented in the form of numerical
metrics, as well as the actual metrics which need to be monitored, such
as business critical KPIs. As a result, the machine learning predictions
can be turned into operational decisions in order to derive value from
them. From an evolutionary perspective, DSUCs are highly influenced by
cognitive biases, which can affect human judgment and the consequent
business decisions.
47
UNIT 4
STATISTICS
STUDY GOALS
Case Study
A data scientist working for a school is given a dataset that consists of the mathematics
grades that have been obtained by students over the last ten years. In order to extract the
main properties of the dataset and predict the students’ performances next year, statistics
and probability analysis must be applied to the underlying dataset.
Statistics can be broadly divided into two separate fields. Descriptive statistics delivers
tools that capture important properties of the data and summarize the observations
found within it. The main tools are the mean, minimum, count, sum, or median, which
helps us to reduce a large dataset into smaller statistics. Meanwhile, probability theory
offers a formal framework for considering the likelihood of possible events. This is used in
inferential statistics to predict how likely it is that an event will occur in the future. In this
case, it is a prediction of how likely it is that next year’s students will achieve higher marks
in mathematics.
Hence, the process of statistics analysis is mainly used to check whether or not data obser-
vations are making sense, while the probability investigates the possible consequences of
these observations.
The real world data found in most applications are neither deterministic (then the data
scientists would not need to apply statistical inference or machine learning on them), nor
completely random (then data scientists could not apply machine learning or any other
inferential technique and would therefore be unable to predict anything based on the
data). Nevertheless, real world data lies somewhere in between where the data scientists
are able to predict an output as well as the probability of its occurrence. Therefore, almost
all of the realistic systems can only be described in terms of probabilities.
For example, if the variable’s values are (2, 3, 4, 5, 6, 7, and 9), then:
50
2+3+4+5+6+7+9
Mean = 7
= 5.14
Median = 5
Probability Theory
Probability theory is the core theory for many of the data science techniques. If the occur-
rence of an event is impossible, its probability is P = 0. However, if an event is certain, Probability
Also written as (P), proba-
then its probability is P = 1. The probability of any event is a number between 0 and 1.
bility is simply defined as
The probability cannot be a negative value, and the sum of probabilities for all possible the chance of an event
outcomes must always be 1. happening.
Two contradicting events cannot happen to the same object at the same time. For exam-
ple, if a client’s account has made a profit, it cannot have simultaneously made a loss.
Opposite events such as this are defined as mutually exclusive events, which are descri-
bed in the following figure.
Meanwhile, two mutually independent events can happen simultaneously without affect-
ing one another. For example, a company can make profit and have legal issues at the
same time. These two events do not impact each other.
P A and B = P A ∩ B = P A · P B
P A or B = P A ∪ B = P A + P B − P A ∩ B
51
Figure 15: Mutually Independent Events
Conditional probability
When two events are correlated, the conditional probability p(A|B) is defined as the prob-
ability of an event A, given that event B has already occurred, as illustrated in the follow-
ing figure.
p A∩B
pA B = pB
A correlation between two variables does not imply that one of these variables is caused
by the other variable. For example, if medicine is consumed by a group of people, there is
a high correlation between the fact that they are taking the medicine and the probability
that these people are sick, but the medicine does not cause the sickness.
52
Probability distribution function
Consider a random variable that can take on a given set of values. The occurrence of each
of these values has a certain probability. The function that maps outcomes with their
respective probability is called a probability distribution function.
The probability distribution function can be visualized as a graph, where the x-axis repre-
sents the possible values of the variable, and the y-axis indicates the probability of each
value. For example, if a given dataset contains results from the rolling of two dice, where
the random variable is the sum of the outcomes of the two dice in each roll, then this vari-
able will have a minimum value of 2 (when each of the two dice show “1” as their output),
and a maximum value of 12 (when each of the two dice show “6” as their output). For each
possible outcome of the first die (e.g., “4”), there are six possible outcomes for the second
die (“1”, “2”, “3”, “4”, “5”, or “6”). As a result, the total number of possible value combina-
tions are (6 [first dice outcomes] · 6 [second dice outcomes] = 36 outcomes). However,
not all outcomes produce a unique sum.
The probability of obtaining the value of 5 for our random variable (i.e., the sum of the two
dice is 5), will be based on obtaining these possible dice rolling outcomes:
“first die”, “second die” = “1”, ”4” , “2”, ”3” , “3”, ”2” ,
“4”, ”1” = 4 events
This means the probability of the random variable having a value of 5 is:
4 possible outcomes
p5 = 36 outcomes
= 0.11
5 possible outcomes
p6 = 36 outcomes
= 0.138
In the same manner, we can calculate the probability that we will get each of the variable’s
values and therefore, we can form the probability density function as shown below.
53
Figure 17: The Probability Density Function
In principle, every random variable can have its unique distribution, however, in reality,
most random variables can be closely approximated by, or follow a distribution from, a set
of well-known parametric distribution functions. Some examples that regularly occur are
discussed below.
Normal distribution
Arguably one of the most common distributions, the normal distribution has a bell-sha-
ped curve. Since, in many naturally occurring scenarios, attributes distribute symmetri-
cally around their mean value. This distribution represents a significant amount of real-life
data. One such real-life example would be the performance assessment of an organiza-
tion’s employees. Some employees are considered to be high performers, while others can
be considered low performers. Most of the employees’ performances will be around the
average. This is represented in the normal distribution figure below. The normal distribu-
tion has about 68 percent of the possible values within one standard deviation from the
mean, while two standard deviations cover 95 percent of the values. Finally, the interval of
±3 standard deviations contains 99.7 percent of the values.
54
Figure 18: The Normal Distribution
Binomial distribution
If a coin is tossed twice, what is the probability of “heads” occurring once? What is the
probability of “heads” occurring twice? What is the probability that “heads” will not occur
at all? The table below represents the possible outcomes when a coin is tossed twice.
1 Heads Heads
2 Heads Tails
3 Tails Heads
4 Tails Tails
55
The probability of “heads” occurring twice is only once out of four possible outcomes
1
p two heads = 4
= 0.25 . However, the probability of “heads” occurring once in two
2
throws is twice in four possible outcomes p one head = 4 = 0.5 . The probability of
“heads” not occurring at all is only once out of four possible outcomes
1
p no heads = 4 = 0.25 . These probabilities can be represented by binomial distribu-
tion as demonstrated in the figure below:
Poisson distribution
e−μμx
px = x!
where μ is the mean number of occurrences, and x is the required number of occurrences.
If an average of 10 calls per day are sent to a call center, what is the probability that the
call center will receive exactly seven calls on a given day?
e−10107
p 70 = 7!
= 0.09
56
Figure 20: The Poisson Distribution
There are many processes which are considered to follow a Poisson distribution, e.g. sales
records, cosmic rays, and radioactive decay. A Poisson distribution is justified when the
occurring events are discrete, i.e. can be counted, are independent, and no two events can
occur at exactly the same time while the rate at which events occur can be considered
constant.
Bayesian Statistics
Bayesian statistics is a unique branch of statistics that does not interpret probabilities as
frequencies of occurrences, but rather as an expectation of belief.
In general, the Bayes theorem is established by the following conditional probability equa-
tion for two random events A and B.
pB ApA
pA B = pB
Finally, p(A|B) is the posterior belief of the variable A after observing the evidence B. This
is summarized in the following figure.
57
Figure 21: Bayesian Statistics
In any statistical setting, we typically consider the sample data to be created according to
a fixed yet unknown parametric probability distribution. For the task of statistical analysis,
however, all we have is the one sample realization, and we try to infer the properties of
that unknown distribution. Bayes theorem employs the rules of conditional probability to
probabilistically quantify our knowledge about the relevant parameters of the data gener-
ating process.
An example of Bayesian statistics is Helmenstine’s drug test analysis (2017), shown in the
following figure.
58
Figure 22: Drug Test Example
Here, U, U, + , − stands for a drug user, a non-user, a positive drug test, and a nega-
tive drug test, respectively. If 0.5 percent of the training set are drug users (P(U) = 0.5%),
and the probability that a drug test will be positive when taken by a drug user is 99 per-
cent (P(+|U) = 99%), the probability that the outcome will be negative is the remaining
one percent (P(–|U) = 1%). Consequently, all other conditional probabilities are repor-
ted in the above figure.
What will be the probability that a new data record (i.e., a new person in the training set)
with a positive drug test outcome is actually a drug user?
P+ UPU
PU + = P +
P+ UPU
PU + = P + U P U +P + U P U
0.99·0.005
PU + = 0.99 · 0.005 + 0.01 · 0.995
= 33.2%
Since the probability is only 33.2 percent, it implies that even if a test produces a positive
result for a person, it is more likely that the person is not a drug user.
This result applies if we assume that the person follows the same prior probability as the
general population (i.e., P(U)=0.5%). However, if we know more about a specific person
(e.g., if they have used drugs in the past, if they have a medical condition that makes rec-
reational drug use dangerous etc.), the prior changes, as does the posterior prediction.
59
Hence, knowledge of the posterior is critical and represents our best knowledge. This is
important because, in machine learning, the training data represent our best knowledge
about a dataset, and it is critical to be certain that the dataset is as accurate as possible;
the insurance of the data quality is very important.
As a concluding remark, the above example shows how the previous probability p(U) is
updated to present the posterior probability p(U|+) when taking in to account the
defined model output (i.e., positive test result). This update is the result of a classifier
which is designed to predict the output occurrence for a new training set.
SUMMARY
In this unit, an overview of the importance of probability and statistics in
data science is presented. The data in most real world applications is
neither deterministic (where the application of machine learning
becomes obsolete) nor completely random (where no predicted outputs
can be estimated). However, the data is somewhere in between those
two definitions, where the data scientists can apply machine learning
techniques to obtain probabilities for the predictions with some uncer-
tainties (or a volatility).
60
UNIT 5
MACHINE LEARNING
STUDY GOALS
Case Study
A company selling home appliances plans to offer an online purchasing service for its
products. The marketing team has proposed different advertisement campaigns to attract
customers that visited certain webpages. After few months, the data analytics team were
able to see which campaign(s) brought in the highest revenue, as well as who the top cus-
tomers were. The data analytics team had to examine the purchasing data in order to find
the patterns that indicated which campaign had the highest revenue. After analyzing the
data, they could also make a prediction about the type of customer who bought a large
number of products or spent a lot of money. When dealing with a vast amount of purchas-
ing data, which is continually added to over time, the task becomes almost impossible for
human or manual implementations. Therefore, machine learning tools are introduced.
They are able to employ the high performance capabilities of a personal computer (i.e.,
machine) to handle this large amount of purchasing data, to develop a model that learns
this data, and to automatically and efficiently achieve both the finding of the patterns and
the prediction tasks.
Machine learning provides systems with the ability to improve and learn from experience
without being explicitly programmed. In machine learning, a learning algorithm is devel-
oped to extract knowledge and uncover the properties of the data. Ultimately, its goal is to
predict future outcomes, taking in to account any new data scenarios that are inserted.
Results are evaluated using predefined accuracy metrics, then they are transformed into
an enhanced business decision using a given objective and key performance indicators.
To this end, machine learning employs descriptive statistics to summarize salient proper-
ties of the data and predictive analytical techniques to derive insights from training data
that are useful in subsequent applications.
In order to see why learning to perform a task from data is better than operating based on
a fixed set of instructions, consider the task of recognizing a person from an image.
Undoubtedly, it is difficult to write a classical computer program with the objective of rec-
ognizing a specific person. If, however, we have constructed a learning system that can use
62
example data in a given task, we can provide it with a large amount of human facial
images as well as the information about who is depicted in these images. Then, a mathe-
matical or algorithmic model is fit to the data in order to uncover the underlying relation
and use it to generate the outputs for new samples. The developed model is called a
machine learning model.
As the name already indicates, machine learning uses some form of computational
resource or device to process the data, for example, an input could be advertising cam-
paigns (i.e. dependent variables), and their associated output revenue (i.e., independent,
or target variable). The output of the machine learning process is then a model (i.e., pro-
gram) that is able to predict the returns for future campaigns, as schematically described
in the following figure.
The developed machine learning models are applied in a variety of different settings such
as vision/language processing, forecasting, pattern recognition, games, data mining,
expert systems, and robotics. The applications of machine learning cover many fields such
as:
It is all about data! We are given a dataset of inputs (independent variables) and outputs
(labels or dependent variables), and the machine learning is implemented to discover the
patterns within the inputs and/or to predict the relationships between the inputs and out-
puts. These relationships can be utilized later to predict the outputs for new inputs. The
63
possible outputs in any dataset are either continuous or discrete values. For example, the
outputs in a dataset of students’ marks are continuous values ranging from 0 to 100, while
the outputs in a weather dataset may be discrete values such as [0: for windy, 1: for sunny,
and 2: for cloudy].
For continuous outputs, machine learning builds a prediction model called a regression
model. For discrete outputs, the prediction model is called a classification model, and the
outputs are mainly classes for each of the possible discrete output.
On the other hand, when the objective is to discover the hidden patterns within the data
inputs (i.e., the outputs may not be provided), machine learning performs a clustering
analysis to group the inputs into clusters according to their level of similarity. This is based
on the values of the given independent variables.
There is a broad variety of machine learning paradigms in existence today, each of them
corresponding to a particular abstract learning task. For our purposes, the three most
important ones are supervised learning, unsupervised learning, and semi-supervised
learning. Supervised learning denotes the learning task when both the data inputs and the
desired outputs are provided, and includes classification and regression approaches. The
notion of unsupervised learning relates to the discovery of patterns in the data inputs and
includes the clustering analysis. Unsupervised learning is considered an important fron-
tier of machine learning because most big datasets do not come with labels (i.e., the
desired outputs are not known). Semi-supervised learning covers tasks that involve parti-
ally labelled data sets.
Supervised learning is a paradigm of machine learning that is used when the given dataset
contains both inputs (independent variables, xi) and desired output (dependent variable,
y). The objective of supervised learning is to develop a mathematical model (f) that
relates the output to the inputs, and can predict the output for future inputs, as clarified in
the following equation:
y = f xi i = 1…n
Where n is the total number of variables (i.e., characteristics) of the data samples.
In classification, the output (y) belongs to a set of finite and discrete values that denote
the predicted class or classes. In regression, however, the output (y) belongs to a range of
infinite continuous values that define the numerical outcome(s).
64
During the model’s learning, its parameters are continuously updated until the optimum
setting is achieved. This updating process is governed by a specific loss function, and the
objective is to adjust the parameters so that this loss function is minimized. For regression
problems, this loss function can be the mean squared error (MSE), and for classification
problems, the loss function can be the number of wrongly classified instances.
The structure of the supervised learning procedure is shown in the figure below. In the fig-
ure, the inputs from the training set are used to teach the model through one of the availa-
ble classification/regression algorithms. Afterwards, the model is implemented to predict
the output for the testing sample(s), which were not presented during the model’s train-
ing. If the predicted output is within an acceptable range of the desired output, the model
is accepted and can complete the prediction task in the future. Otherwise, the learning
process has to be repeated.
Classification analysis
An example dataset includes hundreds of e-mails and their attributes, where each e-mail
is labeled as either spam or not spam. Can we predict whether a newly received e-mail is
spam? Another example dataset related to red wine samples uses their physicochemical
variables as inputs, and their associated quality (poor, normal, or excellent) as outputs. If
a new wine sample is provided, along with its physicochemical variables, can we predict
its quality?
The two problems described above are typical classification problems, where the dataset
is a collection of labeled-data records in the form: {independent variables as inputs, and
the associated classes (i.e., labels) as outputs}. The task is to develop a machine learning
65
model to relate the inputs to the outputs, and to predict the class of new inputs. In classifi-
cation, the outputs are finite and categorical, and the developed model has to assign a
single class to new inputs.
In practice, the dataset is divided into two sets, the training set and the testing set. The
training set is employed to develop the classification model, while the testing set is uti-
lized afterwards to evaluate the accuracy of the developed model. The outputs of the
model can be presented in the form of a confusion matrix, and the evaluation of a classifi-
cation model is done by precision, recall, ROC, or AUC.
Regression analysis
In regression problems, the task is to develop a machine learning model that predicts a
numerical value, not a class. Therefore, the desired outputs form a continuum of values. In
general, the developed model is a mathematical function that relates the outputs to the
inputs.
One example is the historical dataset of a real estate valuation. If multiple instances of
characteristics of houses in a city zone are provided, as well as the price for each of these
houses, can we predict the price of a different house if its characteristics are known?
As in the case of classification, there are many techniques that can be used to develop a
regression model for the given problem. The regression’s evaluation metrics are routinely
calculated so that the best model can be voted for. Examples of eligible measures include
MAPE, MSE, and MAE.
There are many applications where supervised learning is implemented, as seen in the fol-
lowing table.
Previous loans that were paid Will this client default on a loan? Classification
Previous weeks’ visa applications How many businesspeople will apply for a Regression
visa next week?
66
Example Dataset Prediction Type
Decision tree based methods: A flowchart-like tree structure, shown in the following fig-
ure, where each internal node denotes a test on a particular variable of the dataset, each
branch denotes the outcome of the test, and each leaf node holds a class label.
K-nearest neighbors method: The training set is represented as points in the Euclidean
space, and the class label for each element of the testing set is determined according to
the label of the K closest training points (i.e., nearest neighbors), as demonstrated in the
following figure.
67
Figure 26: K- Nearest Neighbors Method
Naïve Bayes method: This method is based on Bayes theorem, and is designed for catego-
rical data. An entity is sorted into the class with the highest posterior probability in rela-
tion to the values of the features in the corresponding data record. The different features
in the record are assumed to be independent random variables. This simplifies the calcu-
lation of probabilities to a tractable problem. The qualifier “naïve” in the name Naïve
Bayes stems from the fact that this presumed independence cannot be taken for granted,
yet is assumed anyway.
Support Vector Machines (SVM) method: This is a binary classification method (i.e. a
method for separating the input data into two classes) that seeks to construct linear boun-
dary between the classes. In real-world settings, however, data-points from different
classes are rarely linearly separable. The support vector method therefore addresses this
problem by projecting the data to a higher-dimensional space where a linear separation is
feasible. The SVM technique seeks to adjust the classification boundary so that the margin
is maximized in order to obtain the optimum separation between the two classes. Data
elements lying on the margin are called support vectors.
68
Figure 27: Support Vector Machines (SVM) Method
Linear regression method: Linear regression is used to find a linear function that best rep-
resents a set of given sample points. Denoting the target variable as y and the independ-
ent variables as [x1, x2, …, xm], the model is described by the following equation:
Where w0 is the so-called bias, and the coefficients (w1, w2,…, wm) are the weights. The
goal is to optimize bias and weights in such a way that the error term (ε) is minimal
between model output and the given target values of the training set.
69
Figure 28: Linear Regression Method
70
Artificial neural network (ANN) method: ANN was first proposed in the 1950s where it was
established that it would implement a theoretical model of computational processes in
nerve cells. Compared to our current understanding of neural processes, this model can
only be seen as a rough analog, yet its application to machine learning problems has been
highly fruitful. The network is composed of many layers of computational units, so-called
neurons. The input layer is for the input values of the dataset variables, the output layer
produces the value of the target variable, and the intermediate layers are called the hid-
den layers, as demonstrated in the following figure. The construction of artificial neural
networks with numerous cascading hidden layers is called deep learning.
Unsupervised Learning
If you are given a basket with some unlabeled objects, and you plan to group objects that
are the same, you will pick a random object and select any physical characteristic of it,
such as its surface shape. Afterwards, you will pick all other objects that have similar
shapes to the initial object, and group them together. Then, you will repeat the process
until all objects are clustered into groups. This process is called unsupervised learning
because you do not know the name of any of the given objects.
The unsupervised machine learning is implemented in order to manage the problems that
have unlabled datasets. Thus, the provided dataset consists of inputs (independent varia-
bles, xi) while the output (dependent variable, y) is not known. One reason that this may
be common is that acquiring labels can become expensive in many big data applications.
The aim of unsupervised learning is to discover the natural patterns within the given
inputs, which may result in dimensionality reduction, and/or clustering the data instances
into groups according to their relative similarity. The structure of unsupervised machine
learning is shown in the following figure.
71
Figure 31: Unsupervised Learning Structure
While supervised learning tries to find a functional relationship between dependent and
independent variables, unsupervised learning aims to find intrinsic structure or patterns
in the data. Additionally, unsupervised learning techniques are used to reduce the dimen-
sionality of data while retaining important structural properties.
The cost function in an unsupervised learning model can be the minimum quantization
error, the minimum distance between similar data instances, or the maximum likelihood
estimation of the correct cluster.
Clustering analysis
Unsupervised learning is utilized in the situations where the outcomes are unknown.
Thus, we can either cluster the data to reveal meaningful partitions and hierarchies, or
find association rules that relate to the involved data’s features.
For example, a core theme in marketing is obtaining insights into the customer demo-
graphic. One way to achieve this is to find so-called customer segments, i.e. groups of sim-
ilar or comparable customers. Once these segments and their relative sizes are found,
marketing, or even product design efforts, can be targeted specifically for these segments.
Since there are no pre-defined labels which could inform such a segmentation, the defini-
tion of segments has to be met entirely based on patterns in the customer features.
Clustering is used to gather data records into natural groups (i.e., clusters) of similar sam-
ples according to predefined similarity/dissimilarity metrics, resulting in extracting a set of
patterns from the given dataset. The contents of any cluster should be very similar to each
72
other, which is called high intra-cluster similarity. However, the contents of any cluster
should be very different from the contents of other clusters. This is called low inter-cluster
similarity.
There are two commonly implemented, simple forms of the distance function, which are
the Euclidean distance and the Manhattan distance.
For two dimensional datasets (i.e., having two features), the Euclidean distance function is
given by the following equation:
2 2
dA, B = xA − xB + yA − yB
dA, B = xA − xB + yA − yB
Where (xA,yA) and (xB,yB) are the coordinates (i.e., features) of data records A and B
respectively, d is the value that represents the distance between the two data records.
It is worth mentioning that features of a dataset with scales using widely differing ranges
should be standardized to the same scale before beginning in the clustering analysis.
The clustering evaluation is usually completed by manual inspection of the results, bench-
marking on existing labels, and/or by distance measures to denote the similarity level
within a cluster and the dissimilarity level across the clusters.
The clustering analysis is applied in many fields including pattern recognition, image pro-
cessing, spatial data analysis, bio-informatics, crime analysis, medical imaging, climatol-
ogy, and robotics. One of the most famous areas for clustering applications is the market
segmentation, which focuses on grouping customers into clusters of different characteris-
tics (payment history, customers’ interests, etc.). Another common application is to imple-
ment clustering analysis in order to develop a recommendation system, for example to
cluster similar documents together or to recommend similar songs/movies.
Some examples of unsupervised learning applications can be seen in the following table.
73
Table 14: Unsupervised Learning Examples
Since clustering has been, and still is, an active area of research, there are many methods
and techniques that have been developed to determine how the grouping of data records
is performed. The basic clustering techniques are the K means clustering method and the
agglomerative clustering method.
The K-means clustering method is an algorithm used to group given N data records into K
clusters. The algorithm is straightforward and can be explained in the following steps:
2 2 2
d i, c = x1, i − x1, c + x2, i − x2, c + … + xM, i − xM, c
Where (x1, x2, …, xM) are the M data variables, while i and c denote the ith data record
and the cluster’s centroid respectively.
4. Recalculate the new centroid for each cluster by averaging its included data records.
5. Repeat steps (3) and (4) until there are no further changes in the calculated centroids.
6. The final clusters are formed by their included data records.
1. Assign each record of the given N data records to a unique cluster, forming N clusters.
2. Afterwards, the data records (i.e., clusters) with minimum Euclidean distance between
them are merged into a single cluster.
3. The process is repeated until we are left with one cluster, hence forming a hierarchy of
clusters.
74
Semi-Supervised Learning
The model trained on the supervised examples is then used to label the hitherto unla-
beled data instances. From these newly labeled data points, the ones with highest confi-
dence are added to the supervised training set. Iteratively repeating this procedure finally
leads to a classification boundary that makes use of all the available information, as
shown in the following figure.
75
Figure 33: Semi-Supervised Learning (Clustering Step)
The advantage is that a lot of effort and computational cost are saved, because collecting
and labelling large datasets can be very expensive. Furthermore, the patterns and similari-
ties among the data instances are discovered, which brings more insight into the dataset
structure.
A popular application for semi-supervised learning is speech analysis. Here, the task is to
identify words from audio files of utterances. While recording spoken words is easily
accomplished and data of this kind is abundant, labeling the data is a very time consum-
ing process.
SUMMARY
In this unit, an introduction to machine learning in data science is pre-
sented, giving an overview of the involved definitions and concepts.
Machine learning is an inductive process that automatically builds a pre-
diction model and extracts relevant patterns by learning the natural
structure of a given dataset.
76
The output of the developed model is a discrete value in classification
problems, and a continuous value in regression problems. Meanwhile, if
the datasets are not labeled with an output variable, the machine learn-
ing objective is to retrieve the important patterns by applying clustering
analysis.
77
BACKMATTER
LIST OF REFERENCES
Baldassarre, M. (2016). Think big: Learning contexts, algorithms and data science.
Research on Education and Media, 8(2), 69—83. Retrieved from https://content.sciendo
.com/view/journals/rem/8/2/article-p69.xml
Brownlee, J. (2019, September 18). How to create an ARIMA model for time-series forecast-
ing in Python [blog post]. Retrieved from https://machinelearningmastery.com/arima-
for-time-series-forecasting-with-python/
Dalinina, R. (2017, January 10). Introduction to forecasting with ARIMA in R [blog post].
Retrieved from https://www.datascience.com/blog/introduction-to-forecasting-with-
arima-in-r-learn-data-science-tutorials
Desjardins, J. (2016, August 12). The largest companies by market cap over 15 years
[chart]. Retrieved from https://www.visualcapitalist.com/chart-largest-companies-ma
rket-cap-15-years/
Dorard, L. (n.d.). The machine learning canvas [PDF document]. Retrieved from https://ww
w.louisdorard.com/machine-learning-canvas
Giasson, F. (2017, March 10). A machine learning workflow [blog post]. Retrieved from http
://fgiasson.com/blog/index.php/category/artificial-intelligence/
Hackernoon. (2018, June 2). General vs narrow AI [blog post]. Retrieved from https://hacke
rnoon.com/general-vs-narrow-ai-3d0d02ef3e28
Helmenstine, A. M. (2017, August 12). Bayes theorem definition and examples [blog post].
Retrieved from https://www.thoughtco.com/bayes-theorem-4155845
Jogawath, A. K. (2015, September 28). Introducing Hadoop—HDFS and map reduce [blog].
Retrieved from https://ajaykumarjogawath.wordpress.com/tag/big-data/
Le Dem, J. (2016). Efficient data formats for analytics with Parquet and Arrow [presenta-
tion slides]. Retrieved from https://2016.berlinbuzzwords.de/sites/2016.berlinbuzzwo
rds.de/files/media/documents/berlin_buzzwords_2016_parquet_arrow.pdf
Malhotra, A. (2018, February 1). Tutorial on feed forward neural network: Part 1. Retrieved
from https://medium.com/@akankshamalhotra24/tutorial-on-feedforward-neural-ne
twork-part-1-659eeff574c3
80
MerchDope. (2019, September 29). 37 mind blowing YouTube facts, figures, and statistics:
2019 [blog post]. Retrieved from https://merchdope.com/youtube-stats/
Michel, J., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, ... Aiden,
E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science,
331(6014), 176—182. Retrieved from https://science.sciencemag.org/content/331/601
4/176
Montibeller, G., & Winterfeldt, D. (2015). Cognitive and motivational biases in decision and
risk analysis. Risk Analysis, 35(7), 1230—1251.
Nau, R. (2014). Notes on nonseasonal ARIMA models [PDF document]. Retrieved from http:
//people.duke.edu/~rnau/Notes_on_nonseasonal_ARIMA_models--Robert_Nau.pdf
PeerXP. (2017, October 17). The 6 stages of data processing cycle [blog post]. Retrieved
from https://medium.com/peerxp/the-6-stages-of-data-processing-cycle-3c2927c466f
f
Pollock, N. J., Healey, G. K., Jong, M., Valcour, J. E., & Mulay, S. (2018). Tracking progress in
suicide prevention in Indigenous communities: A challenge for public health surveil-
lance in Canada. BMC Public Health, 18(1320). Retrieved from https://bmcpublichealth
.biomedcentral.com/articles/10.1186/s12889-018-6224-9
Polson, N., & Scott, S. (2011). Data augmentation for support vector machines. Bayesian
Analysis, 6(1), 1—23. Retrieved from https://projecteuclid.org/download/pdf_1/euclid.
ba/1339611936
Prakash, R. (2018, June 19). 5 different types of data processing [video]. Retrieved from htt
ps://www.loginworks.com/blogs/5-different-types-of-data-processing/
Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
Wiesbaden: Springer Vieweg.
Saleh, B., Abe, K., Arora, R. S., & Elgammal, A. (2014). Toward automated discovery of artis-
tic influence. Multimedia Tools and Applications, 75, 3565—3591.
Shaikh, F. (2017, January 19). Simple beginner’s guide to reinforcement learning & its
implementation [blog post]. Retrieved from https://www.analyticsvidhya.com/blog/2
017/01/introduction-to-reinforcement-learning-implementation/
Statista. (2020). Volume of data/information created worldwide from 2010 to 2025 [chart].
Retrieved from https://www.statista.com/statistics/871513/worldwide-data-created/
81
Thakur, D. (2017). What is data transmission? Types of data transmission [article].
Retrieved from http://ecomputernotes.com/computernetworkingnotes/communicati
on-networks/data-transmission
Tierney, B. (2012, June 13). Data science is multidisciplinary [blog post]. Retrieved from htt
ps://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html
Wenzel, F., Galy-Fajou, T., Deutsch, M., & Kloft, M. (2017). Bayesian nonlinear support vec-
tor machines for big data, presented at the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases, Skopje, 2017.
Retrieved from https://arxiv.org/abs/1707.05532
82
LIST OF TABLES AND
FIGURES
Table 1: Top Five Traded Companies (2001–2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
83
Table 10: The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Table 11: The Common Cognitive Biases and Their De-biasing Techniques . . . . . . . . . . . . 47
84
Figure 31: Unsupervised Learning Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
85
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt
Mailing Address
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.org