0% found this document useful (0 votes)
83 views90 pages

001-2023-0714 DLBDSIDS01 Course Book

Uploaded by

francisokolo7888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views90 pages

001-2023-0714 DLBDSIDS01 Course Book

Uploaded by

francisokolo7888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

INTRODUCTION TO DATA

SCIENCE
DLBDSIDS01
INTRODUCTION TO DATA SCIENCE
MASTHEAD

Publisher:
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt

Mailing address:
Albert-Proeller-Straße 15-19
D-86675 Buchdorf
[email protected]
www.iu.de

DLBDSIDS01
Version No.: 001-2023-0714
N. N.

© 2023 IU Internationale Hochschule GmbH


This course book is protected by copyright. All rights reserved.
This course book may not be reproduced and/or electronically edited, duplicated, or dis-
tributed in any kind of form without written permission by the IU Internationale Hoch-
schule GmbH (hereinafter referred to as IU).
The authors/publishers have identified the authors and sources of all graphics to the best
of their abilities. However, if any erroneous information has been provided, please notify
us accordingly.

2
PROF. DR. THOMAS ZÖLLER

Mr. Zöller teaches in the field of data science at IU International University of Applied Scien-
ces. He focuses on the fields of advanced analytics and artificial intelligence and their key
role in digital transformation.

After studying computer science with a minor in mathematics at the University of Bonn, Mr.
Zöller received his doctorate with a thesis in the field of machine learning in image process-
ing. This was followed by several years of application-oriented research, including time spent
at the Fraunhofer Society. Throughout his professional career, Mr. Zöller has worked in vari-
ous positions focusing on the fields of business intelligence, advanced analytics, analytics
strategy, and artificial intelligence, while also gaining experience in the areas of defense tech-
nology, logistics, trade, finance, and automotive.

3
TABLE OF CONTENTS
INTRODUCTION TO DATA SCIENCE

Module Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Introduction
Signposts Throughout the Course Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Basic Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Unit 1
Introduction to Data Science 13

1.1 “Data Science” Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


1.2 Data Science’s Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Data Science’s Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Unit 2
Data 23

2.1 Data Types & Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


2.2 The 5Vs of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Unit 3
Data Science in Business 35

3.1 Identification of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Data-Driven Operational Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Cognitive Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Unit 4
Statistics 49

4.1 Importance of Statistics in Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50


4.2 Important Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4
Unit 5
Machine Learning 61

5.1 Role of Machine Learning in Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62


5.2 Overview of ML Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Appendix
List of References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
List of Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5
INTRODUCTION
WELCOME
SIGNPOSTS THROUGHOUT THE COURSE BOOK

This course book contains the core content for this course. Additional learning materials
can be found on the learning platform, but this course book should form the basis for your
learning.

The content of this course book is divided into units, which are divided further into sec-
tions. Each section contains only one new key concept to allow you to quickly and effi-
ciently add new learning material to your existing knowledge.

At the end of each section of the digital course book, you will find self-check questions.
These questions are designed to help you check whether you have understood the con-
cepts in each section.

For all modules with a final exam, you must complete the knowledge tests on the learning
platform. You will pass the knowledge test for each unit when you answer at least 80% of
the questions correctly.

When you have passed the knowledge tests for all the units, the course is considered fin-
ished and you will be able to register for the final assessment. Please ensure that you com-
plete the evaluation prior to registering for the assessment.

Good luck!

8
BASIC READING
Akerkar, R., & Sajja, P. S. (2016). Intelligent techniques for data science. New York, NY:
Springer International Publishing. Database: EBSCO

Hodeghatta, U. R., & Nayak, U. (2017). Business analytics using R—A practical approach.
New York, NY: Apress Publishing. Database: ProQuest

Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
New York, NY: Springer. Database: EBSCO

Skiena, S. S. (2017). The data science design manual. New York, NY: Springer International
Publishing. Database: EBSCO

9
FURTHER READING
UNIT 1

Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century.
Harvard Business Review, 90, 70—76. Database: EBSCO Business Source Ultimate

Horvitz, E., & Mitchell, T. (2010). From data to knowledge to action: A global enabler for the
21st century. Washington, WA: Computing Community Consortium. (Available online).

UNIT 2

Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and analytics: From
big data to big impact. MIS Quarterly, 36(4), 1165—1188. Database: EBSCO

Cleveland, W. (2001). Data science: An action plan for expanding the technical areas of the
field of statistics. International Statistical Review, 69(1), 21—26. Database: EBSCO

UNIT 3

Dorard, L. (2017). The machine learning canvas [website]. (Available online).

Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Per-
spectives, 19(1), 25—42. Database: EBSCO

UNIT 4

Mailund, T. (2017). Beginning data science in R, 125—204. New York, NY: Apress Publishing.
Database: ProQuest

Efron, B., & Hastie, T. (2016). Computer age statistical inference: Algorithms, evidence, and
data science. Cambridge: Cambridge University Press. (Available online).

UNIT 5

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT


Press. Database: ProQuest

Shalev-Shwartz, S. (2015). Understanding machine learning: From theory to algorithms.


Cambridge: Cambridge University Press. (Available online).

10
LEARNING OBJECTIVES
In the course book Introduction to Data Science, you will learn how and why data scien-
tists extract important information from data. After an overview of the definition of data
science and its benefits when applied to various situations, you will learn ways of labeling
different sources, and how to outline the main activities of data science. Because predic-
tive analysis, the base of data science, is the understanding of underlying data characteris-
tics, you will also learn the concepts of descriptive analytics and probability theory.

Afterwards, you will learn how to identify a data science use case in diverse organizations
and how to obtain the value proposition for every use case. Furthermore, you will learn
how to analyze the prediction model of the developed value through evaluation metrics,
as well as how to study the necessary key performance indicators in order to determine
whether or not its implementation in the business has been successful.

Because the raw data are different shapes and types, and they are coming from several
sources, you will learn about the quality issues that routinely degrade the data, and the
traditional methods used when dealing with missing values, irrelevant features, and data
duplications. This will result in clean data which is valid for predictive analysis.

Consequently, you will be aware of the different paradigms of machine learning and how a
prediction model is developed. Finally, you will understand how the model’s outputs can
be effectively shown to a related business manager as a complete framework of the under-
lying data, and how each of its parameters influence the current and future performance.
Hence, better decisions can be made and improved actions can be taken.

11
UNIT 1
INTRODUCTION TO DATA SCIENCE

STUDY GOALS

On completion of this unit, you will have learned …

– what is meant by data science.


– why we need data science.
– the main terms and definitions relating to data science.
– what the role of a data scientist is.
– the typical activities carried out within the field of data science.
1. INTRODUCTION TO DATA SCIENCE

Case Study
Google is currently scanning and uploading the physical copies of books that have been
published in the last 200 years so that their data is available online. This process is almost
complete for 25 percent of all published books. The data acquired from these books is
used to improve the search results when keywords are entered in to the search engine. In
addition to building this archive of data, Google launched a program, Google Ngrams, that
allows us to observe language change throughout history by providing data such as when,
and with what frequency, words and phrases have been published over time (Michel,
2011). With this data, you can answer the following questions:

• How has language changed over time?


• How and why do new words become popular?
• How are the meanings of the words changing over time?
• How are the standards of spelling improving/weakening over time?
• Which words frequently appear together in publications?

The above-mentioned example is a use case of “data science.” When the data is scanned
and uploaded, it is sorted into a specific field and undergoes a systematic analysis to
determine the information that can be taken from it.

1.1 “Data Science” Definition


What is Data Science?

Data Science The term “data science” applies to a wide variety of tools and techniques that help us to
This is the combination of learn from data and solve problems with it. Like other scientific disciplines, data science is
business, analytical, and
programming skills that focused on the ways that people can understand data and use it for their benefit.
are used to extract mean-
ingful insights from raw Data science is all about unlocking the real values and insights of the data. This is done by
data.
identifying complex fundamental behaviors, underlying trends, and hidden inferences. In
a business setting, these analyses can enable companies to make smarter business deci-
sions.

Modern technology is capable of collecting and storing huge volumes of data from, for
example, customers, sensors, or social media. The amount of data that can be extracted
from these sources could provide answers and solutions to many problems that busi-
nesses may have. Furthermore, the current advances in computing capabilities allow the
innovative analysis of data related to longstanding problems.

14
Why Data Science?

In 2016, Visual Capitalist published a chart that shows the top five trading companies over
15 years. The ranking is based on the total dollar market value of their company’s out-
standing shares of stock, as given in the following figure (Desjardins, 2016).

Table 1: Top Five Traded Companies (2001–2016)

2001 2006 2011 2016

#1 General Electric Exxon Exxon Apple

#2 Microsoft General Electric Apple Alphabet

#3 Exxon Total Petro China Microsoft

#4 Citi Microsoft Shell Amazon

#5 Walmart Citi ICBC Facebook

Source: Author, based on Desjardins (2016)

As seen in the above figure, the top five companies have been replaced by companies that
are involved in technology and online trade. These are Apple, Alphabet, Microsoft, Ama-
zon, and Facebook. The key resource (and the product) of these five companies is “data”,
and their daily work focuses on applying data science tools to that data.

Data science is not only implemented in technology related companies, but also in any
organization that has data to be analyzed. For example, a company that possesses data
about their users can apply data science to manage and analyze the data, gain meaningful
insights, and effectively extract useful information about the users.

The implementation of data science approaches can produce results from the data that
humans may not have previously picked up on. For example:

• One of the modern research methods in the field of biology is the use of data science
(particularly deep learning techniques) to predict a human’s age, blood pressure, Deep Learning
smoking status, and more by analyzing images of their retina. The application of com-
putational networks (with
• The Canadian government is currently initiating a research program that will establish a cascading layers of units)
prediction of suicide rates in the country using data science (mostly artificial intelli- to learning tasks.
gence techniques). Using the data collected from 160,000 anonymized social media Artificial Intelligence
accounts in Canada, the proposal is to identify underlying patterns associated with A set of approaches to
enable a computer to
those who talk about or exhibit behaviors that could be linked to suicide (Pollock, emulate and thus autom-
2018). atize cognitive processes
— often based on learning
from data.

15
• Researchers at Rutgers University investigated how data science can be applied to more
creative forms of data, which is one of the top challenges for machine intelligence. By
using a dataset of more than 13,000 impressionist paintings, the researchers have
designed a tool that recognizes artistic features and applies them to other images. This
tool is therefore able to produce images in the styles of famous painters (Saleh, 2014).
• Researchers at the University of Edinburgh and Stanford Hospital applied data science
and identified that skin cancer is the most common human malignancy. They utilized
Machine Learning machine learning techniques on 129,450 clinical images and their associated disease
A subset of artificial intel- labels. Through the developed tool, they were successful and were able to automati-
ligence where mathemat-
ical models are devel- cally identify the cancer. They were also able to determine that skin cancer is the most
oped to perform given dangerous to humans.
tasks based on provided
training examples.
Benefits of Data Science

The benefits of data science differ depending on the objectives of those applying it. The
biggest advantage of using data science in an organization is that it enables the organiza-
tion to improve its decision-making. When used in business, data science based decisions
produce amplified profitability and enhanced operational efficiency, business routine, and
workflows. Concerning the customer related aspects of business, data science recognizes
and informs companies of their target audiences, and assists the automated aspect of HR
recruitment so that it can perform more accurately when completing tasks such as short
listing candidates throughout the hiring process.

Additionally, shipment companies can discover the optimum transportation, routes, and
delivery times, and banking institutions can optimize the fraud detection process.

1.2 Data Science’s Related Fields


Data science is intrinsically data driven, and is considered the intersection between statis-
tics, computer science, and business management as depicted in the following figure. The
computer science field is the platform on which to generate and share the data. The statis-
tics field uses the data as its backbone and applies numerical techniques to organize and
model the data. The statistics and computer science are employed together to maximize
the useful information that can be gained from the business management data.

16
Figure 1: The Data Science Venn Diagram

Source: Created on behalf of IU (2023).

Data science involves many diverse and/or overlapped subjects, which are, among others:

• machine learning,
• database storage and data processing,
• statistics,
• neuro-computing,
• knowledge discovery (KDD),
• data mining, Data Mining
• pattern recognition, and This is the process of dis-
covering patterns in large
• data visualization. datasets.

These subjects work together to develop the complete analysis mechanism of data sci-
ence which helps to discover useful information within the business’ data. This is presen-
ted in the extended data science Venn diagram, shown in the following figure.

17
Figure 2: The Extended Data Science Venn Diagram

Source: Tierney (2012)

It is worth noting that a commonly used term in business management field is business
Business Intelligence intelligence (BI). However, BI mainly focuses on the descriptive analysis of the underlying
This is a collection of rou- data to explain the historical performance of the associated business, whereas data sci-
tines that are used to ana-
lyze and deliver the busi- ence is utilized when performing predictive analysis, and used to predict the future trends
ness performance or provide evidence that can support strategic plans within the associated business.
metrics.

Data Science Terms

The commonly engaged terms in data science are explained in the following figures:

Table 2: Data Handling Terms

Data Handling

Training Set The dataset used by the machine learning model that will help it to learn
its desired task.

Testing Set These data are used to measure the performance of the developed
machine learning model.

18
Data Handling

Outlier A data record which is seen as exceptional and outside the distribution of
the normal input data.

Data Cleansing The process of removing redundant data, handling missing data entries
and removing, or at least alleviating, other data quality issues.

Source: Created on behalf of IU (2023).

Table 3: Data Features Terms

Data Features

Feature An observable measure of the data. For example, height, length, and
width of a solid object. Other terms such as property, attribute, or charac-
teristic are also used instead of feature.

Dimensionality Reduction The process of reducing the dataset into lesser dimensions, ensuring that
it conveys similar information.

Feature Selection The process of selecting relevant features of the provided dataset.

Source: Created on behalf of IU (2023).

Table 4: Artificial Intelligence Terms

Learning Paradigms

Machine Learning Algorithms or mathematical models that use information extrated from
data in order to achieve a desired task or function.

Supervised Learning The subset of Machine Learning that is based on labeled data. It can be
further distinguished in regression and classification.

Unsupervised Learning The subset of Machine Learning that is based on un-labeled data. Typical
unsupervised learning tasks are clustering and dimensionality reduction.

Deep Learning The application of networks of computational units with cascading layers
of information processing used to learn through tasks.

Source: Created on behalf of IU (2023).

Table 5: Model Development Terms

Model Development

Decision Model A model assesses the relationships between the elements of provided
data to recommend a possible decision for a given situation.

Regression A forecasting technique to estimate the functional dependence between


input and output variables.

Cluster Analysis A type of unsupervised learning used to partition a set of data records into
clusters. Records in a cluster are more similar to each other than to those
in other clusters.

19
Model Development

Classification A machine learning approach to categorize entities into predefined


classes.

Source: Created on behalf of IU (2023).

Table 6: Model Performance Terms

Model Performance

Probability Quantification of how likely it is that a certain event occurs, or the degree
of belief in a given proposition.

Standard Deviation A measure of how spread out the data values are.

Type I Error False positive output, meaning that it was actually negative but has been
predicted as positive.

Type II Error False negative output, meaning that it was actually positive but has been
predicted as negative.

Source: Created on behalf of IU (2023).

1.3 Data Science’s Activities


What Do Data Scientists Do?

In 2016, Glassdoor published a list of the best jobs, taking into consideration their salaries,
career opportunities, and job openings. The profession “data scientist” was placed at the
top of the list.

The job of a data scientist starts with data exploration, and when they receive a challeng-
ing data related question, they become detectives. They analyze the data and try to recog-
nize patterns within it. This may require the application of a quantitative technique such
as machine learning in order to delve further into the data and discover more information.
This is a core process that provides strategic support to guide business managers who
must decide how to act on the findings.

Effectively, a data scientist is someone who knows more about programming than a statis-
tician, and more about statistics than a software engineer. A data scientist is able to man-
age data science projects. They store and clean large amounts of data, explore data sets to
identify potential insights, build predictive machine learning models, and weave a story
around the findings which can then be presented to the decision makers.

20
Major Activities of Data Science

The data science activities exist simultaneously in the three dimensions shown in the fol-
lowing figure. These are data flow, data curation, and data analytics. Each dimension rep-
resents a group of data science challenges, their associated solution methodologies, and
their numerical techniques.

Figure 3: Data Science Activities

Source: Created on behalf of IU (2023).

As a result, a data scientist follows a group of actions that encompasses all possible ele-
ments of the process that need to be addressed. This can be summarized as:

1. Understand the problem.


2. Collect enough data.
3. Process the raw data.
4. Explore the data.

21
5. Analyze the data.
6. Communicate the results.

SUMMARY
Data science is a multidisciplinary field that has borrowed aspects from
statistics, pattern recognition, computer science, and operational
research. In short, data science derives information from data and
applies it to many different purposes, such as making predictions. The
importance of the extracted information depends on its application,
and, in general, provides a positive value when making decisions in an
associated organization.

This unit is an introduction to data science, where the term is explained


and the importance of data science applications in several domains is
discussed. The most commonly utilized terms and the data science rela-
ted fields are reviewed, along with the role and the daily activities of a
data scientist.

22
UNIT 2
DATA

STUDY GOALS

On completion of this unit, you will have learned …

– what is meant by data and information.


– the different types and shapes of data.
– the typical sources of data.
– the 5Vs of big data.
– the issues concerning data quality.
– the challenges associated with the data engineering process.
2. DATA

Case Study
Human DNA consists of 3·109 base pairs that, in turn, are made of four building blocks (A,
T, C, and G). While a complete sequence of this type can be stored in about 750MB of mem-
ory, reading and transcription of this blue-print into proteins is a complex process that can
only be studied in detail by making use of the current advances in storage and computa-
tional capabilities. This allows bio-technology researchers to recognize complex DNA
sequences, analyze the data for possible chronic diseases, and adapt medications accord-
ing to a specific genomic structure. Hence, the relationships among the genetic features
are investigated by predictive modelling techniques to provide the physicians with a tool
to automatically identify the important patterns within the DNA strands.

In 2016, CrowdFlower conducted a survey on 80 data scientists to find out “What do data
scientists spend the most time doing?” The outcome of this survey, as shown in the follow-
ing figure, indicates that 60 percent of their time is spent cleaning and organizing data,
and another 19 percent is spent collecting data sets (Glasson, 2017).

Figure 4: What Do Data Scientists Spend the Most Time Doing?

Source: Author, based on Giasson (2017)

The facts, observations, assumptions, or incidences of any business practice are defined
as the associated “data” of the underlying process. These data are processed to return the
most important information about the associated business. This information represents
the useful patterns and meaningful relationships among the data elements. The organiza-
tion should use the information that has been extracted from its business in order to
enhance the related sales, marketing strategies, and consumer needs. Therefore, any
piece of information should be relevant, concise, error-free, and reliable so that it can per-
form this objective. Hence, the efficient understanding and handling of the associated

24
business’ data have critical roles. In this unit, a detailed discussion about the possible
data types, sources, and shapes will be presented, alongside the standard issues that rou-
tinely influence the quality of the collected data. Therefore, the most critical issue in any
data science or modeling project is finding the right data set.

2.1 Data Types & Sources


The common methods of data collection are statistical populations, research experiments,
sample surveys, and byproduct operations. The collection and handling of data is not
always an easy task, particularly if there is redundancy or contradiction within the collec-
ted data. Significant domain knowledge may be required to correctly prepare the data,
and possession of this knowledge is important because data that is not carefully prepared
and screened can result in misleading information.

Types of Data

There are two types of data: quantitative and qualitative. Any characteristic of the collec-
ted data can be described as either a quantitative variable (i.e., numerical values), or a
qualitative variable (i.e., non-numerical values). Examples of quantitative data are number
of people, students’ GPA, and ambient temperature, whereas examples of qualitative data
include customer feedback, softness of a product, and the answer to an open-ended ques-
tion. A more detailed explanation about the differences between quantitative and qualita-
tive data is provided in the table below.

Table 7: Qualitative Vs Quantitative Data

Qualitative Data Quantitative Data

Data that describes qualities or characteristics. Data that can be expressed as a number or can be
quantified.

Data that cannot be counted. Data that can be counted.

Data type: words, objects, pictures, observations, Data type: numbers and statistics.
and symbols.

Questions that the data answer: What characteris- Questions that the data answer: “How much?””
tic or property is present? and “How often?”

Purpose of data analysis: to identify important Purpose of data analysis: to test hypotheses,
themes and the conceptual framework in an area of develop predictions for the future, and check cause
study. and effect.

Examples: Examples:
• happiness rating • height of a student
• gender • duration of green light
• categories of plants • distance to planets
• descriptive temperature of coffee (e.g., warm) • temperature of coffee (e.g., 30°C)

Source: Created on behalf of IU (2023).

25
Shapes of Data

Data reveals itself in three shapes: structured, unstructured, and streaming. Structured
data are those with a high level of construction, and are shaped in tabular rows (to include
the data transactions or data records) and columns (to include the data characteristics or
data variables). Alternatively, unstructured data is considered to be the raw shape of the
data with non-uniform structure, which often includes text, numbers, and/or images. An e-
mail is a simple example of an unstructured data shape, where the e-mail body may con-
tain words, values, and some images. A complex mathematical tool is required to handle
the unstructured data and transform them into a format that reveals the information and
patterns within the data. The following table shows a basic comparison between struc-
tured and unstructured data shapes.

Table 8: Structured Vs Unstructured Data

Structured Data Unstructured Data

Characteristics: Characteristics:
• predefined data models • no predefined data models
• usually only text or numerical • may be text, images, or other formats
• easy to search • difficult to search

Applications: Applications:
• inventory control • word processing
• airline reservation systems • tools for editing media

Examples: Examples:
• phone numbers • reports
• customer names • surveillance imagery
• transaction information • email messages

Source: Created on behalf of IU (2023).

If the data involve both the structured (tabular) shape and the unstructured shape, it is
called semi-structured data. The streaming data is continuously generated by different
sources (e.g., sensors, cameras, etc.), typically at high speeds. Such data is processed
incrementally without having access to all of the data. It allows users to access the content
immediately, rather than waiting for it to be downloaded. A particular feature of streaming
data is the large amount of data being created. This can be demanding in terms of its stor-
age and processing requirements.

Sources of Data

Data sources should be trustworthy enough to ensure that the collected data is high qual-
ity and robust enough for the next steps of processing. Common sources of data are
described in the following paragraphs.

26
Organizational and trademarked data sources

Large companies like Google and Facebook possess enormous amounts of data. They pro-
vide bulk downloads of public data for offline analysis in order to enrich the organization’s
market visibility. Google and Facebook also have internal data that their employees use.

Almost all companies have data themselves. The first and most important point of access
is for the various internal systems recording the activities of their own business.

Government data sources

Federal governments are committed to open data so they can enable and enhance the
way that government fulfills its mission. Furthermore, governmental organizations release
demographic and economic data (e.g., population per area) every few years to be ana-
lyzed for the sake of better risk estimation.

Academic data sources

Academic research creates large datasets, and many scientific journals require that these
datasets be made available to other researchers. Many fields are covered by the datasets,
including medical, economic, and historical research.

Webpage data sources

Webpages often provide valuable numerical and text data. For example, you can request
all tweets with a certain hash tag from the Twitter webpage (e.g., #iPhoneX), and apply
sentiment analysis on them in order to determine whether the majority of tweets contain-
ing that hashtag are positive or negative. The customer support division of an organiza-
tion associated with this topic (e.g., Apple) can use this information to improve their busi-
ness.

Media data sources

Media includes outputs such as video, audio, and podcasts which all provide quantitative
and qualitative visions concerning characteristics of user interaction. Since media crosses
all demographical borders, it is the quickest method for businesses to draw patterns and
enhance their decision making.

2.2 The 5Vs of Big Data


Big data is made up of high-volume, -velocity, and -variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight
and decision making. Big data is a term that identifies data sets which are too large and/or
complex to be handled efficiently by typical mathematical tools. Challenges within big
data processing include analysis, storage, visualization, and information retrieval.

27
In data science, the underlying data is frequently big data. For example, in one minute,
approximately 220,000 new photos are uploaded to Instagram and three hundred hours of
videos are uploaded to YouTube (Jogawath, 2015). The main obstacles when handling
data and describing its overloads are volume, variety, veracity, validity, and velocity
(called the “5Vs of data”).

Volume

Volume refers to the amount and scale of the data. An airplane fitted with 5,000 sensors
generates about 10GB of data for every second it is in flight (Rapolu, 2016). Current esti-
mates have the current yearly data creation at around 50 zetabytes while forecasts see a
dramatic increase in the upcoming years (Holst, 2019). Current research and development
in computational and storage technology aims to alleviate the pertaining data handling
challenges.

Variety

There is a considerable variety of data. Previously, most data were generated in a struc-
tured shape to simplify the forthcoming data science processes. Today, much of the cre-
ated data can be considered unstructured, and requires more advanced mathematical
techniques to handle it.

Velocity

Velocity of data refers to the speed at which the data is created, stored, analyzed, and
visualized. Computational tools need significant periods to process data and update data-
bases.

Veracity

Veracity refers to the quality of the data. Pursuant to the volume, variety and velocity in
which data arrives in current information processing settings, it cannot be guaranteed
that the processed data is perfectly correct or precise. Such quality impairments, often
called noise, have to be accounted for when interpreting the outcomes of any form of
analysis.

Validity

The validity of data is another important aspect. The data may be correct and noise-free,
but it may be outdated or otherwise unsuitable for the question at hand. If this is the case,
it will not result in any meaningful conclusions after its analysis.

28
2.3 Data Quality
The collected data commonly suffers from quality issues because of imperfect data sour-
ces or issues in the data collection process. Such data is problematic due to the existence
of values which are noisy, inaccurate, incomplete, inconsistent, missing, duplicate, or out-
liers. It is important to note that there are “true” outliers and “fake” outliers. The fake out- Outlier
liers are data records which do not seem to match the patterns followed by the majority of An outlier is a data record
which is seen as excep-
the other data records, but they are still possible, although unlikely, outcomes of the tional and incompatible
underlying process. with the overall pattern of
the data.

There are many approaches that can be used when handling data quality issues, and more
than 80 percent of a data scientist’s time is spent dealing with these.

Missing Values and Outliers

In some data records, there may be values which have not been observed (i.e., missing
values) or were incorrectly observed (e.g. outliers) during data collection.

Several methods are routinely employed to resolve the issue of missing values and outli-
ers.

1. Removal of data records that contain missing values and/or outliers: This method is
recommended for large datasets where the removal of some records will not affect
the data as a whole. This method can only be used after it has been confirmed that
removing the chosen records will not influence the results. For example, under
unusual conditions, a sensor may be unable to deliver a normal value. In such a case,
removing the record might lead to the exclusion of an interesting aspect of the data-
set that would provide valuable information about the operation of the sensor.
2. Replacement of the missing value or outlier with an interpolated value from neighbor-
ing records: For example, we have a dataset for temperatures at different times of the
day (time: {11:00, 11:01, 11:02}, temperature: {20 °C, x , 22.5 °C}) where x is a miss-
ing temperature value or an outlier (i.e., out-of-range value). The value of x is replaced
by the linearly-interpolated value that is obtained from the recorded temperatures on
either side of the missing value.

22.5+20
x= 2
= 21.25 °C

3. Replacement of the missing value or outlier with the average value of its variable
across all data records.
4. Replacement of the missing value or outlier with the most-often observed value for its
variable across all data records.

A new variable may be introduced into the dataset with a value of “0” for the normal data
records and “1” for the data records containing missing and/or outlier values that were
handled by one of the above methods. By doing so, we ensure that the original informa-
tion is not lost.

29
Duplicate Records

If there are duplicate records within the dataset, they are removed before proceeding with
the data analysis in order to reduce computing time and prevent the distortion of the ana-
lytics outcome.

Redundancy

Other issues that may appear within the dataset are related to the existence of redundant
and irrelevant variables. We identify these issues by applying correlation analysis to each
pair of variables. This allows us to resolve redundancies without losing any important
information from the dataset by removing the variables that show high correlation
towards other variables. The correlation between two variables can be seen in the follow-
ing figure where the highly irregular shape on the left indicates that the variables are not
correlated, the somewhat dispersed shape in the middle figure indicates partial correla-
tion, and the line shape on the right indicates strong correlation.

Figure 5: The Correlation Between Two Variables

Source: Created on behalf of IU (2023).

The correlation coefficient (ρ) between two data variables x and y is calculated as:

n
∑ xi − x yi − y
i=1
ρ x, y =
n n
2 2
∑ xi − x ∑ yi − y
i=1 i=1

Where x and y are the average value of variable x and y, respectively, for a dataset of n
records.

The correlation coefficient is a statistic measurement for the degree of the relationship
between two variables, and it is in the range of [–1, 1]. If (ρ=1), the two variables are fully
correlated, whereas (ρ=0) indicates no correlation or that the variables are independent.
Negative correlation coefficients imply that the variables are anti-correlated, meaning that

30
when x goes up, y goes down, and vice versa. We can set a threshold on the value of ρ,
and if the correlation exceeds this threshold, one of the two variables can be removed
from the dataset with negligible influence on performance.

Furthermore, one of the dimensionality reduction approaches may be applied. Dimen-


sionality reduction aims to simplify the data by removing data properties that are non-
informative in relation to the analytical question at hand.

2.4 Data Engineering


Continuing with the above discussion concerning the data related aspects, it is noted that
handling the given data and converting it into a suitable form for further analysis (i.e., data
engineering) may be a complex process.

Data engineering focuses on the practical applications of data collection and analysis. For
all the work that data scientists do in order to answer questions using data, there should
be tools to gather and validate that data, and then apply it to real-world operations.

The building of a reliable system to handle the data (especially for big data processes) is
not a straight forward task, and it may take 60 to 80 percent of the data scientist’s effort
before the data is ready for meaningful information and patterns to be extracted.

The collected data should be preserved without any wrong data samples or fake outliers,
and the way that the missing data values will be controlled should be determined. Improv-
ing data quality typically requires detailed knowledge of the domain in which the data are
recorded.

It is also essential to protect the data and determine the legal frameworks and policies
that should be followed. Physical threats and human errors should be identified whenever
possible.

In some scenarios, data transformation is required to convert the collected dataset into a
form suitable for applying data science. The main transformation methods are variable
scaling, decomposition, and aggregation, which are shown in the following table.

Table 9: Data Transformation Methods

Transformation Method Description

Variable scaling The dataset may include variables of mixed scales. For example, a
dataset may contain income values in dollars, number of purchases
per month, and amount of car fuel consumed per month. The mod-
eling techniques work on scaled variable values, e.g., between —1
and 1, to ensure that all analyzed variables are weighted equally.
The scaling may be performed by normalizing a variable’s value
with respect to its maximum value.
The other option is to remove the variable’s average and divide by
the standard deviation of the variable.

31
Transformation Method Description

Variable decomposition Some variables may need to be further decomposed for better data
representation. For example, a time variable may be decomposed
into hour and minute variables. Furthermore, it may turn out that
only one of the two variables (hour or minute) is relevant, so the
irrelevant variable is removed from the dataset.

Variable aggregation Alternatively, two variables may be more meaningful if they are
merged (i.e., aggregated) into one variable. For example, “gross
income” and “paid tax” variables may be aggregated into one varia-
ble, “net income.”

Source: Created on behalf of IU (2023).

The algorithms and calculations used in data processing must be highly accurate, well-
built, and correctly performed so that there is no negative effect on the decisions made
based on the results. The benefits of data processing, especially in medium and large
organizations, are:

• improved analysis and demonstration of the organization’s data,


• reduction of data so that only the most meaningful information is present,
• easier storage and distribution of data,
• simplified report creation,
• enhanced productivity and increased profits, and
• further accurate decision-making.

Real Life Examples

A case study is carried out so that the online merchants can gain a complete picture about
how customers are utilizing their web services; essentially, they are looking for a 360
degree view of their customer. Therefore, a large set of unstructured data is collected and
combined with the structured customers’ transaction data. By applying data processing to
this case study, valuable information is obtained. This information could be, for example,
which pages a customer visits, how long a customer stays on a page, and which products
the customer buys. This information will lead to improved business management and bet-
ter decisions being made regarding a certain product and/or service.

Another case study is the Internet of Things (IoT) which includes connected devices and
sensors on a platform. These devices are in the customer’s environment, and collect mil-
lions of data records every week about the usage of each device. This unstructured big
data should be transformed in to a set of structured data records in order to enable a fur-
ther analysis of the devices performance.

In general, the top applications that obtain and use data while also effectively employing
data science are given below.

32
Industrial processes data applications

Data are obtained at different levels of the production process. Ingredients and actuators
data are obtained at the field level, signals data at the control level, monitoring sensor
data at the execution level, and indicators data at the planning level.

The main goal of applying data science to industrial processes is to automate and opti-
mize them, and to improve the competitive situation of the company.

Business data applications

Data are obtained and analyzed in many business domains, such as customers’ data, port-
folio data, human resources data, marketing data, sales’ data, and pricing data.

The main goal of applying data science to business data is to better understand, motivate,
and drive the business processes. From the workflow of the management data system,
bottlenecks of the business processes can easily be identified and/or the sales predictions
can be estimated.

Text data applications

Data formatted in text serve as important information resources and are applied in many
different settings. Examples are text documents, e-mails, and web documents. These data
use particular organizational criteria such as record fields or attributes structure.

The main goal when applying data science to text data is to filter, search, extract, and
structure information.

Image data applications

Data formatted as images are easily obtained nowadays due to the advances in imaging
sensors technology. These sensors range from smartphone cameras to satellite cameras,
providing large amounts of two dimensional and/or three dimensional images data.

The main goal of applying data science to image data is to find and recognize objects, ana-
lyze and classify scenes, and relate image data to other information sources.

Medical data applications

Data in the medical field are obtained at all stages of the patients/medicine laboratory
experiments. Furthermore, patient health records and clinical care programs are also
medical field related data.

33
The main goal of applying data science to medical data is to analyze, understand, and
annotate the influences and side effects of medication in order to detect and predict dif-
ferent levels of certain diseases.

SUMMARY
Data is collected from many different sources, such as companies, gov-
ernments, web pages, and media platforms. These data are either quan-
titative or qualitative, and they are formatted as structured, unstruc-
tured, or streaming data. There are five characteristics to be taken into
consideration when handling big data. These are volume, variety, veloc-
ity, veracity/validity, and value of the collected data.

One of the most important steps in data science is preprocessing. In this


step, the raw data are cleaned so that noise and errors are removed. The
missing values and outliers are handled either by removing them
entirely or estimating reasonable values for them. Duplicate records are
checked and then deleted to reduce the dataset size. Furthermore, a cor-
relation analysis is applied to avoid the presence of highly correlated
variables in the dataset.

Data processing can be applied in many different scenarios such as


automating office environments, administrating event ticketing sys-
tems, and managing work time.

Data analysis primarily requires the transformation of a dataset’s varia-


bles into more representative forms. The three main transformation
methods are scaling, decomposition, and aggregation. In a scaling trans-
formation, the variable is scaled to the same value range as the other
variables. For decomposition, the variable is split into more than one
variable in order to obtain a deeper overview of data variations. In an
aggregation transformation, the variable is merged with one or more of
the other variables for a better explanation of the dataset.

The importance level of the extracted information depends on the par-


ticular application and, in general, it provides a positive value in the
decision making of an associated organization.

34
UNIT 3
DATA SCIENCE IN BUSINESS

STUDY GOALS

On completion of this unit, you will have learned …

– what a data science use case is.


– about the machine learning canvas.
– about the model-centric performance evaluation.
– the role of KPIs in operational decisions.
– the influence of the cognitive biases.
3. DATA SCIENCE IN BUSINESS

Case Study
The finance sector contains many interesting and highly valuable opportunities for the
application of data science methods. As an example, take the determination of the credit
worthiness of a customer. In this task, the goal is to estimate how likely it is that a loan
given to that particular customer is paid back with the contractually agreed interest. The
information that such a decision can be based on includes their monthly/annual earnings,
real estate ownership or rental, any debt, deposits, and more. No matter which concrete
measures are used, the challenge lies in the estimation of future behavior based on data
of past transactions with the customer. This characteristic places the problem squarely
into the field of predictive analytics.

3.1 Identification of Use Cases


Data science is incredibly valuable since it enables businesses to refine crucial information
from their transactional data. Businesses can become more valuable when they take an
in-depth look their data and identify the suitable data science use cases (DSUC) for their
business objectives. These types of data come with their own set of challenges that busi-
nesses also have to handle. A well matched DSUC could give valuable insights that would
help to address the current state of business, challenges, competition, and future
improvements.

Prediction techniques are applied through DSUCs to extract valuable information from
collected data. The DSUC in any business can be identified through three main points:
effort, risk, and achieved value. The potential of a new project is often measured by how
much improvement can be made to the operational business. Therefore, an organization
should focus their analysis on reducing effort and increasing gain, as demonstrated in the
following figure.

36
Figure 6: Identification of an Organization’s Use Cases

Source: Created on behalf of IU (2023).

Organizations must identify which use cases they are going to tackle, and then ensure the
availability of suitable datasets. Hence, some important questions have to be answered:

• What is the value of the knowledge gained by applying data science tools to that data-
set?
• What will be discovered about the input dataset and its hypothesis?
• What value will be added to the organization through applying data science techniques?
• What will the organization’s decision be if the data science produces disappointing
results?

The following figures show the obtained value(s) after applying data science techniques to
some more common use cases.

37
Figure 7: Achieved Value by Data Science in “Customer”-Related Use Cases

Source: Created on behalf of IU (2023).

Figure 8: Achieved Value by Data Science in “Operational”-Related Use Cases

Source: Created on behalf of IU (2023).

38
Figure 9: Achieved Value by Data Science in “Financial Fraud”-Related Use Cases

Source: Created on behalf of IU (2023).

Data Handling and Analysis

After identifying the DSUC, a data scientist needs to look into all resources that are availa-
ble to the business in order to find a relevant dataset. If one does not exist, then a new
dataset is built from the available resources. Depending on the type of data and its uses,
data could be sourced from internal or external databases, web scrapping, or sensor data.
Data collection is often a tedious and costly task as it may require human intervention.
Humans are involved in the data collection phase to study the data, label it, add valuable
comments, correct errors, and even observe some data anomalies. They must then decide
whether to manually correct these specific data points or just exclude them.

Once these tasks have been completed, the preprocessing techniques (e.g., data cleans-
ing, handling the missing values, etc.) are applied to the data to correct any kind of error
or noise, then redundant or missing values/records are scanned. The employees who
carry out the data scrubbing should have a significant knowledge of the domain of this
data in order to make efficient decisions concerning the way to deal with the monitored
data errors. By the end of the preprocessing stage, any incorrect or redundant information
should have been removed and the size of the dataset itself will be reduced.

The dataset variables/features could be numerical, textual, or categorical. Features have


to be carefully selected, as not all of these features are compatible to the DSUC values.
Therefore, relevant features are selected as they play a main role in the quality of the out-
put DSUC values.

Next, machine learning methods are applied when it comes to building a prediction model
on top of the refined dataset. The purpose of the model is to define the relationship
between data inputs and DSUC values. This is achieved by establishing a mathematical
function that accepts the inputs and produces outputs. The dataset is divided into two
parts: the training set and the testing set. The training set is used in the building and learn-
ing part of the model, whereas the testing set is used to check the accuracy of the devel-
oped prediction model.

39
Once the model’s accuracy is acceptable, it is implemented to deliver the DSUC values for
the current dataset with the ability to predict future values if new data scenarios are
added to the dataset. If the nature of the input dataset begins to change due to the addi-
tion of new data records, then the model must be updated accordingly.

There is no predefined machine learning approach to build a prediction model, but it


depends on the business’ associated use case and the required output. If the output can
be categorized into definitive classes, then a classification approach is used. However, if
the output is a continuous variable, then a regression model would be more suitable.

Machine Learning Canvas

Developed by Dorard (2017), the machine learning canvas is a tool that is used to identify
data science use cases and provide a visual user procedure. Managers can apply this to
their business problems, which can then be analysed by machine learning techniques, as
shown in the following figure.

Figure 10: Machine Learning Canvas

Source: Dorard (2017).

The tool has a simple single page user interface that provides all of the required function-
alities ranging from identifying use cases to achieving the value proposition.

For example, the canvas can be used in the domain of real estate. It is useful when investi-
gating risky investments and comparing the real estate’s price predictions with the actual
prices to determine the best deals. The process of creating a machine learning prediction
model for this use case is shown below.

40
Figure 11: DSUC: Real Estate Problem

Source: Created on behalf of IU (2023)., based on Dorard (2017).

3.2 Performance Evaluation


There are two approaches that can be used to properly evaluate the success of a DSUC
model and whether or not it accomplishes its business-oriented objectives. The first
approach is to evaluate the model by comparing its output through a list of well-estab-
lished numerical metrics. The second approach is to evaluate the ways that the model
influenced the business by helping it to improve and achieve its goals.

41
Model-Centric Evaluation: Performance Metrics

The output of the developed prediction model is either a class or category (classification
model), or a discrete number or probability (regression model). We will discuss the met-
rics that are routinely implemented to evaluate the performance of each type of these
models.

Evaluation metrics for a classification model

For a DSUC designed with only two possible outputs {“yes”, “no”}, the decision of the out-
put is dependent on a threshold assigned to the model. When the model is applied to a
data record, there are only four possible outcomes. These are true positive, true negative,
false positive and false negative.

• True positive (TP): In the case of a correct prediction, the classifier produces a label
“yes” for the data record.
• True negative (TN): In the case of a correct prediction, the classifier produces a label
“no” for the data record.
• False positive (FP): In the case of an incorrect prediction, the classifier produces a label
“yes” for the data record.
• False negative (FN): In the case of an incorrect prediction, the classifier produces a label
“no” for the data record.

These four possible results are usually presented in a matrix form called the confusion
matrix, as shown below.

Table 10: The Confusion Matrix

Model Output

YES NO

Desired Output YES Number of TPs Number of FNs

NO Number of FPs Number of TNs

Source: Created on behalf of IU (2023).

For the four possible outputs, there are three performance metrics to measure the model
quality. These are precision, accuracy, and recall, as explained in the following equations.

number of TP
Precision = number of TP+number of FP

number of TP+number of TN
Accuracy = number of TP+number of TN + number of FP + number of FN

number of TP
Recall = number of TP+number of FN

42
To distinguish between the two classes of the classification model {“yes”, “no”}, a thresh-
old has to be applied. This cutoff value could be set to a certain percentage that is decided
upon during the analysis, and any output value that exceeds this cutoff value will be con-
sidered a “yes”, while all lower outputs will be considered a “no”. Therefore, the model
performance is dependent on the cutoff value which affects the number of true positives,
true negatives, false positives, and false negatives accordingly.

The receiver operator characteristic (ROC) curve shows how altering the cutoff value could
change the true positive and false positive rates. An ideal model would be able to com-
plete the classification operation with 100 percent accuracy, meaning that it could pro-
duce a true positive value of 100 percent and a false positive value of zero percent. Since
no model in reality can be that accurate, the ROC curve helps to find a more realistic
threshold value at which true positive is at its highest rate and false positive is at its lowest
rate. The following steps should be followed to create a ROC curve:

1. A cut off value has to be chosen ranging from 0 to 100.


2. The model is applied to a test set, and the numbers of TP, TN, FP and FN are recorded.
3. Calculate:

number of FP
False Positive Rate = number of FP+number of TN

and

number of TP
True Positive Rate = number of TP+number of FN

4. Every point on the ROC curve has coordinates of (False Positive Rate, True Positive
Rate).
5. Another cutoff frequency is chosen, and steps 2 and 4 are repeated, resulting in the
ROC shown below.

43
Figure 12: Receiver Operator Characteristic (ROC) Curve

Source: Created on behalf of IU (2023).

Evaluation metrics for a regression model

The objective is to measure how close a regression model’s output (y) is to the desired
output (d). There are standard metrics that evaluate the accuracy and performance of the
model which are root mean square error, mean absolute error, absolute error, mean abso-
lute error, relative error, and square error, as given in the following equations.

Absolute error ε = d − y

d−y
Relative error ε * = d
· 100 %

n di − yi
1
Mean absolute percentage error MAPE = n ∑ di
· 100 %
i=1

2
Square error ε2 = d − y

n
1 2
Mean square error MSE = n ∑ di − yi
i=1

44
n
1
Mean absolute error MAE = n ∑ di − yi
i=1

n
1 2
Root mean square error RMSE = n ∑ di − yi
i=1

3.3 Data-Driven Operational Decisions


A crucial aspect in the praxis of data science lies in the operationalization of the insights
derived from the employed analytical models. To this end, it is of vital importance that
analytics results are communicated and made available in such a way that they are useful
for the relevant decision makers inside an organization. Moreover, it is usually helpful to
explain the rationale behind the modeling approach so that the end-user can make an
informed interpretation of model results.

The end user decides how to line up the model’s output in order to fit in the business goals
and objectives. For example, in fraud detection, the user has the ability to decide the
range of percentages at which a suspicious transaction or behaviour is considered to be a
true fraud. In this case, the selected threshold will have a tradeoff between false negatives
and false positives. These tradeoffs should be taken into consideration to maximize the
effectiveness of the model. Using different values for the threshold enables the business
managers to consider different scenarios.

In some cases, the end user may have to make a decision that directly impacts data
records, as well as a decision about the value of the thresholds. For example, one feature
that may exist in a dataset is product price. In some cases the price of the product may
need to be modified. On such an occasion, the model should be capable of accommodat-
ing these changes and should be able to be re-trained.

The ultimate goal of a smart model is the automation of user’s decisions. These decisions
are often dependent on the model’s ability of prediction. For example, a model could be
designed to analyze hotel reviews and decide whether these reviews are fake or not. If the
model’s predictions are highly accurate, then a review can automatically be accepted or
rejected without the need for any human intervention.

Business-Centric Evaluation: The Role of KPIs

After a model has been evaluated successfully with the aforementioned evaluation met-
rics, it is ready for deployment. At this stage, the model should be able to produce a trus-
ted DSUC value for the associated business problem. The decision makers must then be
confident that the DSUC is correctly implemented in a way that will help the company to
meet their business goals. Quantification of the model’s merit is achieved by defining so-
called Key Performance Indicators (KPIs). These are measurements that express to what
extent the business goals have been met or not. Most KPIs focus on increased efficiency,
reduced costs, improved revenue, and enhanced customer satisfaction.

45
Characteristics of effective KPIs

There are several characteristics that determine whether or not a KPI is considered to be a
way of measuring if the business goals have been achieved. These characteristics are sum-
marized as:

• easy to comprehend and simple to measure,


• assists the splitting of the overall objective into the daily operations of the staff respon-
sible for it,
• visible across the entire organization,
• able to indicate positive/negative variations from the business objective,
• has a defined length of time including start and end dates of its measuring, and
• achievable through the available resources (e.g. machines, staff, etc.).

Examples of KPIs

Some effective and commonly utilized KPIs to measure the performance of DSUC from
business-centric point of view are shown in the following figure.

Figure 13: Some Commonly Utilized KPIs

Source: Created on behalf of IU (2023).

3.4 Cognitive Biases


Montibeller & Winterfeldt (2015, p. 1230) reported that “behavioral decision research has
demonstrated that judgments and decisions of ordinary people and experts are subject to
numerous biases.” From an evolutionary perspective, DSUC are subject to cognitive biases
which can highly influence the judgment of the business’ performance and/or settings.
These biases seriously influence the quality of the developed prediction model. As a
result, decision-making may be inaccurate.

46
The following table presents four common cognitive biases and their associated de-bias-
ing techniques.

Table 11: The Common Cognitive Biases and Their De-biasing Techniques

Cognitive Bias Description De-biasing Technique

Anchoring Occurs when the estimation of a numer- Remove anchors, have numerous and
ical value is based on an initial value counter anchors, use various experts
(anchor), which is then insufficiently using specific anchors.
adjusted to provide the final answer.

Confirmation Occurs when there is a desire to confirm Use multiple experts for assumptions,
one's belief, leading to unconscious counterfactual challenging probability
selectivity in the acquisition and use of assessments, use sample evidence for
evidence. alternative assumptions.

Desirability Favoring alternative options due to a Use multi-stakeholder studies of differ-


bias that leads to underestimating or ent perspectives, use multiple experts
overestimating consequences. with different views, use appropriate
transparency rates.

Insensitivity Sample sizes are ignored and extremes Use statistics to determine the likeli-
are considered equally in small and hood of extreme results in different
large samples. samples, use the sample data to prove
the logical reason behind extreme sta-
tistics.

Source: Created on behalf of IU (2023).

SUMMARY
This unit focuses on the role and value of data science in business. Ele-
ments of a suitable use case are discussed, and the way that different
departments need to work together is explained. There is also an
explanation of how value is created and which types of use case are suit-
able for machine learning implementations. The evaluation of the per-
formance of a data science model is presented in the form of numerical
metrics, as well as the actual metrics which need to be monitored, such
as business critical KPIs. As a result, the machine learning predictions
can be turned into operational decisions in order to derive value from
them. From an evolutionary perspective, DSUCs are highly influenced by
cognitive biases, which can affect human judgment and the consequent
business decisions.

47
UNIT 4
STATISTICS

STUDY GOALS

On completion of this unit, you will have learned …

– the importance of statistics in data science.


– about probability and its relation to the prediction model’s outputs.
– about conditional probability and the probability density function.
– the different probability distributions.
– the Bayesian statistics.
4. STATISTICS

Case Study
A data scientist working for a school is given a dataset that consists of the mathematics
grades that have been obtained by students over the last ten years. In order to extract the
main properties of the dataset and predict the students’ performances next year, statistics
and probability analysis must be applied to the underlying dataset.

Statistics can be broadly divided into two separate fields. Descriptive statistics delivers
tools that capture important properties of the data and summarize the observations
found within it. The main tools are the mean, minimum, count, sum, or median, which
helps us to reduce a large dataset into smaller statistics. Meanwhile, probability theory
offers a formal framework for considering the likelihood of possible events. This is used in
inferential statistics to predict how likely it is that an event will occur in the future. In this
case, it is a prediction of how likely it is that next year’s students will achieve higher marks
in mathematics.

Hence, the process of statistics analysis is mainly used to check whether or not data obser-
vations are making sense, while the probability investigates the possible consequences of
these observations.

The real world data found in most applications are neither deterministic (then the data
scientists would not need to apply statistical inference or machine learning on them), nor
completely random (then data scientists could not apply machine learning or any other
inferential technique and would therefore be unable to predict anything based on the
data). Nevertheless, real world data lies somewhere in between where the data scientists
are able to predict an output as well as the probability of its occurrence. Therefore, almost
all of the realistic systems can only be described in terms of probabilities.

4.1 Importance of Statistics in Data


Science
Before the proliferation and broad accessibility of modern information, processing, and
storage technology, data analysts sought to summarize sample data quantitatively in the
form of a compact set of measures. These statistical parameters include mean, maximum,
Standard Deviation minimum, median, and standard deviation.
This is a measure of how
spread out the data val-
ues are, which is typically The mean is the arithmetic average of the variable’s values, while the median is the value
applied to normal distrib- located exactly at the middle point of a variable’s sorted values. The mean is more sensi-
uted data. tive to extreme values than the median.

For example, if the variable’s values are (2, 3, 4, 5, 6, 7, and 9), then:

50
2+3+4+5+6+7+9
Mean = 7
= 5.14

Median = 5

Probability Theory

Probability theory is the core theory for many of the data science techniques. If the occur-
rence of an event is impossible, its probability is P = 0. However, if an event is certain, Probability
Also written as (P), proba-
then its probability is P = 1. The probability of any event is a number between 0 and 1.
bility is simply defined as
The probability cannot be a negative value, and the sum of probabilities for all possible the chance of an event
outcomes must always be 1. happening.

Two contradicting events cannot happen to the same object at the same time. For exam-
ple, if a client’s account has made a profit, it cannot have simultaneously made a loss.
Opposite events such as this are defined as mutually exclusive events, which are descri-
bed in the following figure.

Figure 14: Mutually Exclusive Events

Source: Created on behalf of IU (2023).

Meanwhile, two mutually independent events can happen simultaneously without affect-
ing one another. For example, a company can make profit and have legal issues at the
same time. These two events do not impact each other.

P A and B = P A ∩ B = P A · P B
P A or B = P A ∪ B = P A + P B − P A ∩ B

51
Figure 15: Mutually Independent Events

Source: Created on behalf of IU (2023).

Conditional probability

When two events are correlated, the conditional probability p(A|B) is defined as the prob-
ability of an event A, given that event B has already occurred, as illustrated in the follow-
ing figure.

p A∩B
pA B = pB

Figure 16: Conditional Probability

Source: Created on behalf of IU (2023).

Correlation vs. causation

A correlation between two variables does not imply that one of these variables is caused
by the other variable. For example, if medicine is consumed by a group of people, there is
a high correlation between the fact that they are taking the medicine and the probability
that these people are sick, but the medicine does not cause the sickness.

52
Probability distribution function

Consider a random variable that can take on a given set of values. The occurrence of each
of these values has a certain probability. The function that maps outcomes with their
respective probability is called a probability distribution function.

The probability distribution function can be visualized as a graph, where the x-axis repre-
sents the possible values of the variable, and the y-axis indicates the probability of each
value. For example, if a given dataset contains results from the rolling of two dice, where
the random variable is the sum of the outcomes of the two dice in each roll, then this vari-
able will have a minimum value of 2 (when each of the two dice show “1” as their output),
and a maximum value of 12 (when each of the two dice show “6” as their output). For each
possible outcome of the first die (e.g., “4”), there are six possible outcomes for the second
die (“1”, “2”, “3”, “4”, “5”, or “6”). As a result, the total number of possible value combina-
tions are (6 [first dice outcomes] · 6 [second dice outcomes] = 36 outcomes). However,
not all outcomes produce a unique sum.

The probability of obtaining the value of 5 for our random variable (i.e., the sum of the two
dice is 5), will be based on obtaining these possible dice rolling outcomes:

“first die”, “second die” = “1”, ”4” , “2”, ”3” , “3”, ”2” ,
“4”, ”1” = 4 events

This means the probability of the random variable having a value of 5 is:

4 possible outcomes
p5 = 36 outcomes
= 0.11

Furthermore, to get a sum of 6, the possible outcomes are:

“first die”,“second die” = “1”, ”5” , “2”, ”4” , “3”, ”3” ,


“4”, ”2” , “5”, ”1” = 5 outcomes

Which results in:

5 possible outcomes
p6 = 36 outcomes
= 0.138

In the same manner, we can calculate the probability that we will get each of the variable’s
values and therefore, we can form the probability density function as shown below.

53
Figure 17: The Probability Density Function

Source: Created on behalf of IU (2023).

4.2 Important Statistical Concepts


Probability Distributions

In principle, every random variable can have its unique distribution, however, in reality,
most random variables can be closely approximated by, or follow a distribution from, a set
of well-known parametric distribution functions. Some examples that regularly occur are
discussed below.

Normal distribution

Arguably one of the most common distributions, the normal distribution has a bell-sha-
ped curve. Since, in many naturally occurring scenarios, attributes distribute symmetri-
cally around their mean value. This distribution represents a significant amount of real-life
data. One such real-life example would be the performance assessment of an organiza-
tion’s employees. Some employees are considered to be high performers, while others can
be considered low performers. Most of the employees’ performances will be around the
average. This is represented in the normal distribution figure below. The normal distribu-
tion has about 68 percent of the possible values within one standard deviation from the
mean, while two standard deviations cover 95 percent of the values. Finally, the interval of
±3 standard deviations contains 99.7 percent of the values.

54
Figure 18: The Normal Distribution

Source: Created on behalf of IU (2023).

Binomial distribution

The binomial distribution is the probability distribution of the number of successes in a


sequence of independent trials that each can be described by a binary random variable,
i.e. a random variable that can only take on one of two possible values. An example of a
binomial distribution is repeatedly tossing a coin with the possible outcomes heads or
tails.

If a coin is tossed twice, what is the probability of “heads” occurring once? What is the
probability of “heads” occurring twice? What is the probability that “heads” will not occur
at all? The table below represents the possible outcomes when a coin is tossed twice.

Table 12: Possible Outcomes of Tossing a Coin

Outcome 1st toss 2nd toss

1 Heads Heads

2 Heads Tails

3 Tails Heads

4 Tails Tails

Source: Created on behalf of IU (2023).

55
The probability of “heads” occurring twice is only once out of four possible outcomes
1
p two heads = 4
= 0.25 . However, the probability of “heads” occurring once in two
2
throws is twice in four possible outcomes p one head = 4 = 0.5 . The probability of
“heads” not occurring at all is only once out of four possible outcomes
1
p no heads = 4 = 0.25 . These probabilities can be represented by binomial distribu-
tion as demonstrated in the figure below:

Figure 19: The Binomial Distribution

Source: Created on behalf of IU (2023).

Poisson distribution

The Poisson distribution quantifies the probability of a given number of independent


events occurring in a fixed time interval. Depending on the mean number of occurrences,
probabilities of various prevalences are calculated. The Poisson distribution is represen-
ted as follows:

e−μμx
px = x!

where μ is the mean number of occurrences, and x is the required number of occurrences.

If an average of 10 calls per day are sent to a call center, what is the probability that the
call center will receive exactly seven calls on a given day?

e−10107
p 70 = 7!
= 0.09

A Poisson distribution of the number of calls is shown in the following graph:

56
Figure 20: The Poisson Distribution

Source: Created on behalf of IU (2023).

There are many processes which are considered to follow a Poisson distribution, e.g. sales
records, cosmic rays, and radioactive decay. A Poisson distribution is justified when the
occurring events are discrete, i.e. can be counted, are independent, and no two events can
occur at exactly the same time while the rate at which events occur can be considered
constant.

Bayesian Statistics

Bayesian statistics is a unique branch of statistics that does not interpret probabilities as
frequencies of occurrences, but rather as an expectation of belief.

In general, the Bayes theorem is established by the following conditional probability equa-
tion for two random events A and B.

pB ApA
pA B = pB

Finally, p(A|B) is the posterior belief of the variable A after observing the evidence B. This
is summarized in the following figure.

57
Figure 21: Bayesian Statistics

Source: Created on behalf of IU (2023).

In any statistical setting, we typically consider the sample data to be created according to
a fixed yet unknown parametric probability distribution. For the task of statistical analysis,
however, all we have is the one sample realization, and we try to infer the properties of
that unknown distribution. Bayes theorem employs the rules of conditional probability to
probabilistically quantify our knowledge about the relevant parameters of the data gener-
ating process.

An example of Bayesian statistics is Helmenstine’s drug test analysis (2017), shown in the
following figure.

58
Figure 22: Drug Test Example

Source: Helmenstine (2017)

Here, U, U, + , − stands for a drug user, a non-user, a positive drug test, and a nega-
tive drug test, respectively. If 0.5 percent of the training set are drug users (P(U) = 0.5%),
and the probability that a drug test will be positive when taken by a drug user is 99 per-
cent (P(+|U) = 99%), the probability that the outcome will be negative is the remaining
one percent (P(–|U) = 1%). Consequently, all other conditional probabilities are repor-
ted in the above figure.

What will be the probability that a new data record (i.e., a new person in the training set)
with a positive drug test outcome is actually a drug user?

P+ UPU
PU + = P +
P+ UPU
PU + = P + U P U +P + U P U
0.99·0.005
PU + = 0.99 · 0.005 + 0.01 · 0.995
= 33.2%

Since the probability is only 33.2 percent, it implies that even if a test produces a positive
result for a person, it is more likely that the person is not a drug user.

This result applies if we assume that the person follows the same prior probability as the
general population (i.e., P(U)=0.5%). However, if we know more about a specific person
(e.g., if they have used drugs in the past, if they have a medical condition that makes rec-
reational drug use dangerous etc.), the prior changes, as does the posterior prediction.

59
Hence, knowledge of the posterior is critical and represents our best knowledge. This is
important because, in machine learning, the training data represent our best knowledge
about a dataset, and it is critical to be certain that the dataset is as accurate as possible;
the insurance of the data quality is very important.

As a concluding remark, the above example shows how the previous probability p(U) is
updated to present the posterior probability p(U|+) when taking in to account the
defined model output (i.e., positive test result). This update is the result of a classifier
which is designed to predict the output occurrence for a new training set.

SUMMARY
In this unit, an overview of the importance of probability and statistics in
data science is presented. The data in most real world applications is
neither deterministic (where the application of machine learning
becomes obsolete) nor completely random (where no predicted outputs
can be estimated). However, the data is somewhere in between those
two definitions, where the data scientists can apply machine learning
techniques to obtain probabilities for the predictions with some uncer-
tainties (or a volatility).

The unit describes the importance of statistics in data science, as well as


the importance of probability theory. The conditional probability and
the probability density function are explained as the major phenom-
enon in the probability analysis. The commonly existent probability dis-
tributions are also explained, and the Bayesian statistics is discussed in
order to establish its importance in relation to the classification analysis
in the prediction of the machine learning models.

60
UNIT 5
MACHINE LEARNING

STUDY GOALS

On completion of this unit, you will have learned …

– what is meant by machine learning.


– the different applications of machine learning.
– the concepts of classification and regression.
– the difference between each of the machine learning paradigms.
– the basic machine learning approaches.
5. MACHINE LEARNING

Case Study
A company selling home appliances plans to offer an online purchasing service for its
products. The marketing team has proposed different advertisement campaigns to attract
customers that visited certain webpages. After few months, the data analytics team were
able to see which campaign(s) brought in the highest revenue, as well as who the top cus-
tomers were. The data analytics team had to examine the purchasing data in order to find
the patterns that indicated which campaign had the highest revenue. After analyzing the
data, they could also make a prediction about the type of customer who bought a large
number of products or spent a lot of money. When dealing with a vast amount of purchas-
ing data, which is continually added to over time, the task becomes almost impossible for
human or manual implementations. Therefore, machine learning tools are introduced.
They are able to employ the high performance capabilities of a personal computer (i.e.,
machine) to handle this large amount of purchasing data, to develop a model that learns
this data, and to automatically and efficiently achieve both the finding of the patterns and
the prediction tasks.

Machine learning provides systems with the ability to improve and learn from experience
without being explicitly programmed. In machine learning, a learning algorithm is devel-
oped to extract knowledge and uncover the properties of the data. Ultimately, its goal is to
predict future outcomes, taking in to account any new data scenarios that are inserted.
Results are evaluated using predefined accuracy metrics, then they are transformed into
an enhanced business decision using a given objective and key performance indicators.

5.1 Role of Machine Learning in Data


Science
According to Samuel (1959), machine learning is a “field of study that gives computers the
ability to learn without being explicitly programmed.” Thus, machine learning is a mathe-
matical algorithmic approach that builds a generalized model from the given data to per-
form data-driven predictions. Because of this, the learned model is able to make predic-
tions about new data scenarios that it has not seen before.

To this end, machine learning employs descriptive statistics to summarize salient proper-
ties of the data and predictive analytical techniques to derive insights from training data
that are useful in subsequent applications.

In order to see why learning to perform a task from data is better than operating based on
a fixed set of instructions, consider the task of recognizing a person from an image.
Undoubtedly, it is difficult to write a classical computer program with the objective of rec-
ognizing a specific person. If, however, we have constructed a learning system that can use

62
example data in a given task, we can provide it with a large amount of human facial
images as well as the information about who is depicted in these images. Then, a mathe-
matical or algorithmic model is fit to the data in order to uncover the underlying relation
and use it to generate the outputs for new samples. The developed model is called a
machine learning model.

As the name already indicates, machine learning uses some form of computational
resource or device to process the data, for example, an input could be advertising cam-
paigns (i.e. dependent variables), and their associated output revenue (i.e., independent,
or target variable). The output of the machine learning process is then a model (i.e., pro-
gram) that is able to predict the returns for future campaigns, as schematically described
in the following figure.

Figure 23: Machine Learning

Source: Created on behalf of IU (2023).

The developed machine learning models are applied in a variety of different settings such
as vision/language processing, forecasting, pattern recognition, games, data mining,
expert systems, and robotics. The applications of machine learning cover many fields such
as:

• recognizing patterns for facial identification and emotion detection,


• recognizing handwritten characters and speech,
• recognizing objects of interest in medical images,
• detection of motion in consequent images,
• recognizing unusual credit card transactions, unusual sensor readings, unusual engine
sounds, etc.,
• prediction of future stock prices or currency exchange rates,
• detection of fraud and spam filtering,
• establishing a recommendation system,
• grouping documents or images with similar contents, and
• dimensionality reduction by displaying a huge database in a revealing way.

It is all about data! We are given a dataset of inputs (independent variables) and outputs
(labels or dependent variables), and the machine learning is implemented to discover the
patterns within the inputs and/or to predict the relationships between the inputs and out-
puts. These relationships can be utilized later to predict the outputs for new inputs. The

63
possible outputs in any dataset are either continuous or discrete values. For example, the
outputs in a dataset of students’ marks are continuous values ranging from 0 to 100, while
the outputs in a weather dataset may be discrete values such as [0: for windy, 1: for sunny,
and 2: for cloudy].

For continuous outputs, machine learning builds a prediction model called a regression
model. For discrete outputs, the prediction model is called a classification model, and the
outputs are mainly classes for each of the possible discrete output.

On the other hand, when the objective is to discover the hidden patterns within the data
inputs (i.e., the outputs may not be provided), machine learning performs a clustering
analysis to group the inputs into clusters according to their level of similarity. This is based
on the values of the given independent variables.

There is a broad variety of machine learning paradigms in existence today, each of them
corresponding to a particular abstract learning task. For our purposes, the three most
important ones are supervised learning, unsupervised learning, and semi-supervised
learning. Supervised learning denotes the learning task when both the data inputs and the
desired outputs are provided, and includes classification and regression approaches. The
notion of unsupervised learning relates to the discovery of patterns in the data inputs and
includes the clustering analysis. Unsupervised learning is considered an important fron-
tier of machine learning because most big datasets do not come with labels (i.e., the
desired outputs are not known). Semi-supervised learning covers tasks that involve parti-
ally labelled data sets.

5.2 Overview of ML Approaches


Supervised Learning

Supervised learning is a paradigm of machine learning that is used when the given dataset
contains both inputs (independent variables, xi) and desired output (dependent variable,
y). The objective of supervised learning is to develop a mathematical model (f) that
relates the output to the inputs, and can predict the output for future inputs, as clarified in
the following equation:

y = f xi i = 1…n

Where n is the total number of variables (i.e., characteristics) of the data samples.

In classification, the output (y) belongs to a set of finite and discrete values that denote
the predicted class or classes. In regression, however, the output (y) belongs to a range of
infinite continuous values that define the numerical outcome(s).

64
During the model’s learning, its parameters are continuously updated until the optimum
setting is achieved. This updating process is governed by a specific loss function, and the
objective is to adjust the parameters so that this loss function is minimized. For regression
problems, this loss function can be the mean squared error (MSE), and for classification
problems, the loss function can be the number of wrongly classified instances.

The structure of the supervised learning procedure is shown in the figure below. In the fig-
ure, the inputs from the training set are used to teach the model through one of the availa-
ble classification/regression algorithms. Afterwards, the model is implemented to predict
the output for the testing sample(s), which were not presented during the model’s train-
ing. If the predicted output is within an acceptable range of the desired output, the model
is accepted and can complete the prediction task in the future. Otherwise, the learning
process has to be repeated.

Figure 24: Supervised Learning Structure

Source: Created on behalf of IU (2023).

Classification analysis

An example dataset includes hundreds of e-mails and their attributes, where each e-mail
is labeled as either spam or not spam. Can we predict whether a newly received e-mail is
spam? Another example dataset related to red wine samples uses their physicochemical
variables as inputs, and their associated quality (poor, normal, or excellent) as outputs. If
a new wine sample is provided, along with its physicochemical variables, can we predict
its quality?

The two problems described above are typical classification problems, where the dataset
is a collection of labeled-data records in the form: {independent variables as inputs, and
the associated classes (i.e., labels) as outputs}. The task is to develop a machine learning

65
model to relate the inputs to the outputs, and to predict the class of new inputs. In classifi-
cation, the outputs are finite and categorical, and the developed model has to assign a
single class to new inputs.

In practice, the dataset is divided into two sets, the training set and the testing set. The
training set is employed to develop the classification model, while the testing set is uti-
lized afterwards to evaluate the accuracy of the developed model. The outputs of the
model can be presented in the form of a confusion matrix, and the evaluation of a classifi-
cation model is done by precision, recall, ROC, or AUC.

In typical application settings, several classification techniques are implemented, yielding


a corresponding number of models. The model that returns the highest evaluation per-
formance is selected. It is very difficult to decide which classification technique will per-
form better at the beginning of the analysis, especially if there is no prior information
about the nature of the classification task. Therefore, the experience of the data scientists
with a wide range of possible classification techniques plays an important part in finding
the best solution to the given problem.

Regression analysis

In regression problems, the task is to develop a machine learning model that predicts a
numerical value, not a class. Therefore, the desired outputs form a continuum of values. In
general, the developed model is a mathematical function that relates the outputs to the
inputs.

One example is the historical dataset of a real estate valuation. If multiple instances of
characteristics of houses in a city zone are provided, as well as the price for each of these
houses, can we predict the price of a different house if its characteristics are known?

As in the case of classification, there are many techniques that can be used to develop a
regression model for the given problem. The regression’s evaluation metrics are routinely
calculated so that the best model can be voted for. Examples of eligible measures include
MAPE, MSE, and MAE.

Supervised learning techniques

There are many applications where supervised learning is implemented, as seen in the fol-
lowing table.

Table 13: Supervised Learning Examples

Example Dataset Prediction Type

Previous home sales How much is a specific home worth? Regression

Previous loans that were paid Will this client default on a loan? Classification

Previous weeks’ visa applications How many businesspeople will apply for a Regression
visa next week?

66
Example Dataset Prediction Type

Previous statistics of benign/ Is this cancer malignant? Classification


malignant cancers

Source: Created on behalf of IU (2023).

The traditionally applied techniques for classification analysis are:

Decision tree based methods: A flowchart-like tree structure, shown in the following fig-
ure, where each internal node denotes a test on a particular variable of the dataset, each
branch denotes the outcome of the test, and each leaf node holds a class label.

Figure 25: Decision Tree Based Method

Source: Created on behalf of IU (2023).

K-nearest neighbors method: The training set is represented as points in the Euclidean
space, and the class label for each element of the testing set is determined according to
the label of the K closest training points (i.e., nearest neighbors), as demonstrated in the
following figure.

67
Figure 26: K- Nearest Neighbors Method

Source: Created on behalf of IU (2023).

Naïve Bayes method: This method is based on Bayes theorem, and is designed for catego-
rical data. An entity is sorted into the class with the highest posterior probability in rela-
tion to the values of the features in the corresponding data record. The different features
in the record are assumed to be independent random variables. This simplifies the calcu-
lation of probabilities to a tractable problem. The qualifier “naïve” in the name Naïve
Bayes stems from the fact that this presumed independence cannot be taken for granted,
yet is assumed anyway.

Support Vector Machines (SVM) method: This is a binary classification method (i.e. a
method for separating the input data into two classes) that seeks to construct linear boun-
dary between the classes. In real-world settings, however, data-points from different
classes are rarely linearly separable. The support vector method therefore addresses this
problem by projecting the data to a higher-dimensional space where a linear separation is
feasible. The SVM technique seeks to adjust the classification boundary so that the margin
is maximized in order to obtain the optimum separation between the two classes. Data
elements lying on the margin are called support vectors.

68
Figure 27: Support Vector Machines (SVM) Method

Source: Created on behalf of IU (2023).

Linear regression method: Linear regression is used to find a linear function that best rep-
resents a set of given sample points. Denoting the target variable as y and the independ-
ent variables as [x1, x2, …, xm], the model is described by the following equation:

y = w0 + w1x1 + w2x2 + …wmxm

Where w0 is the so-called bias, and the coefficients (w1, w2,…, wm) are the weights. The
goal is to optimize bias and weights in such a way that the error term (ε) is minimal
between model output and the given target values of the training set.

69
Figure 28: Linear Regression Method

Source: Created on behalf of IU (2023).

Logistic regression method: This is considered to be an extension of the linear regression


analysis for cases where the dependent variable follows binomial distribution and takes
categorical values (e.g., yes/no), or discrete values (e.g., {0,1, and 2}), as shown in the fol-
lowing figure.

Figure 29: Logistic Regression Method

Source: Created on behalf of IU (2023).

70
Artificial neural network (ANN) method: ANN was first proposed in the 1950s where it was
established that it would implement a theoretical model of computational processes in
nerve cells. Compared to our current understanding of neural processes, this model can
only be seen as a rough analog, yet its application to machine learning problems has been
highly fruitful. The network is composed of many layers of computational units, so-called
neurons. The input layer is for the input values of the dataset variables, the output layer
produces the value of the target variable, and the intermediate layers are called the hid-
den layers, as demonstrated in the following figure. The construction of artificial neural
networks with numerous cascading hidden layers is called deep learning.

Figure 30: Artificial Neural Network Method

Source: Created on behalf of IU (2023).

Unsupervised Learning

If you are given a basket with some unlabeled objects, and you plan to group objects that
are the same, you will pick a random object and select any physical characteristic of it,
such as its surface shape. Afterwards, you will pick all other objects that have similar
shapes to the initial object, and group them together. Then, you will repeat the process
until all objects are clustered into groups. This process is called unsupervised learning
because you do not know the name of any of the given objects.

The unsupervised machine learning is implemented in order to manage the problems that
have unlabled datasets. Thus, the provided dataset consists of inputs (independent varia-
bles, xi) while the output (dependent variable, y) is not known. One reason that this may
be common is that acquiring labels can become expensive in many big data applications.
The aim of unsupervised learning is to discover the natural patterns within the given
inputs, which may result in dimensionality reduction, and/or clustering the data instances
into groups according to their relative similarity. The structure of unsupervised machine
learning is shown in the following figure.

71
Figure 31: Unsupervised Learning Structure

Source: Created on behalf of IU (2023).

While supervised learning tries to find a functional relationship between dependent and
independent variables, unsupervised learning aims to find intrinsic structure or patterns
in the data. Additionally, unsupervised learning techniques are used to reduce the dimen-
sionality of data while retaining important structural properties.

The cost function in an unsupervised learning model can be the minimum quantization
error, the minimum distance between similar data instances, or the maximum likelihood
estimation of the correct cluster.

Clustering analysis

Unsupervised learning is utilized in the situations where the outcomes are unknown.
Thus, we can either cluster the data to reveal meaningful partitions and hierarchies, or
find association rules that relate to the involved data’s features.

For example, a core theme in marketing is obtaining insights into the customer demo-
graphic. One way to achieve this is to find so-called customer segments, i.e. groups of sim-
ilar or comparable customers. Once these segments and their relative sizes are found,
marketing, or even product design efforts, can be targeted specifically for these segments.
Since there are no pre-defined labels which could inform such a segmentation, the defini-
tion of segments has to be met entirely based on patterns in the customer features.

Clustering is used to gather data records into natural groups (i.e., clusters) of similar sam-
ples according to predefined similarity/dissimilarity metrics, resulting in extracting a set of
patterns from the given dataset. The contents of any cluster should be very similar to each

72
other, which is called high intra-cluster similarity. However, the contents of any cluster
should be very different from the contents of other clusters. This is called low inter-cluster
similarity.

The similarity/dissimilarity metric that is routinely utilized in clustering analysis is a form


of distance function between each pair of data records (e.g., A and B). The value of the pre-
defined distance function is therefore a measure of how close A and B are to each other,
and a decision is made concerning whether or not to combine A and B in one cluster.

There are two commonly implemented, simple forms of the distance function, which are
the Euclidean distance and the Manhattan distance.

For two dimensional datasets (i.e., having two features), the Euclidean distance function is
given by the following equation:

2 2
dA, B = xA − xB + yA − yB

While the Manhattan distance function is:

dA, B = xA − xB + yA − yB

Where (xA,yA) and (xB,yB) are the coordinates (i.e., features) of data records A and B
respectively, d is the value that represents the distance between the two data records.

It is worth mentioning that features of a dataset with scales using widely differing ranges
should be standardized to the same scale before beginning in the clustering analysis.

The clustering evaluation is usually completed by manual inspection of the results, bench-
marking on existing labels, and/or by distance measures to denote the similarity level
within a cluster and the dissimilarity level across the clusters.

The clustering analysis is applied in many fields including pattern recognition, image pro-
cessing, spatial data analysis, bio-informatics, crime analysis, medical imaging, climatol-
ogy, and robotics. One of the most famous areas for clustering applications is the market
segmentation, which focuses on grouping customers into clusters of different characteris-
tics (payment history, customers’ interests, etc.). Another common application is to imple-
ment clustering analysis in order to develop a recommendation system, for example to
cluster similar documents together or to recommend similar songs/movies.

Unsupervised learning techniques

Some examples of unsupervised learning applications can be seen in the following table.

73
Table 14: Unsupervised Learning Examples

Example dataset Discovered patterns Type

Customers profiles Are these customers similar? Clusters

Previous transactions Is a specific transaction odd? Anomaly detection

Previous purchasing Are these products purchased Association discovery


together?

Source: Created on behalf of IU (2023).

Since clustering has been, and still is, an active area of research, there are many methods
and techniques that have been developed to determine how the grouping of data records
is performed. The basic clustering techniques are the K means clustering method and the
agglomerative clustering method.

The K-means clustering method is an algorithm used to group given N data records into K
clusters. The algorithm is straightforward and can be explained in the following steps:

1. Decide on the number of clusters K.


2. Select random data records to represent the centroids of these clusters.
3. Calculate the distances between each data record and the defined centroids, then
assign the data record to the cluster where it is closest to the centroid. The Euclidean
distance d(i,c) is the easily employed distance measurement during K-means cluster-
ing, as given in the equation below:

2 2 2
d i, c = x1, i − x1, c + x2, i − x2, c + … + xM, i − xM, c

Where (x1, x2, …, xM) are the M data variables, while i and c denote the ith data record
and the cluster’s centroid respectively.

4. Recalculate the new centroid for each cluster by averaging its included data records.
5. Repeat steps (3) and (4) until there are no further changes in the calculated centroids.
6. The final clusters are formed by their included data records.

The agglomerative clustering method is also known as hierarchical clustering, and is


mainly applied to the data generated from a process defined by an underlying hierarchy.
The technique develops a bottom-up tree (dendrogram) of clusters that repeatedly merge
the two closest points or clusters into a bigger super cluster. Hence, the leaves of the
developed tree are the individual data records, and its root is the universe of these
records. The algorithm can be summarized as follows:

1. Assign each record of the given N data records to a unique cluster, forming N clusters.
2. Afterwards, the data records (i.e., clusters) with minimum Euclidean distance between
them are merged into a single cluster.
3. The process is repeated until we are left with one cluster, hence forming a hierarchy of
clusters.

74
Semi-Supervised Learning

The semi-supervised machine learning is implemented in the learning problems where


the output (dependent variable, y) is only given for few instances of the inputs (independ-
ent variables, xi). Therefore, the semi-supervised learning is a mix of the unsupervised and
supervised learning paradigms, and combines the properties of both paradigms.

In self-training, the oldest approach to semi-supervised learning, a basic classification


model is designed based on the few labeled data instances. This is called the semi-super-
vised classification step, and is shown in the following figure.

Figure 32: Semi-Supervised Learning (Classification Step)

Source: Created on behalf of IU (2023).

The model trained on the supervised examples is then used to label the hitherto unla-
beled data instances. From these newly labeled data points, the ones with highest confi-
dence are added to the supervised training set. Iteratively repeating this procedure finally
leads to a classification boundary that makes use of all the available information, as
shown in the following figure.

75
Figure 33: Semi-Supervised Learning (Clustering Step)

Source: Created on behalf of IU (2023).

The advantage is that a lot of effort and computational cost are saved, because collecting
and labelling large datasets can be very expensive. Furthermore, the patterns and similari-
ties among the data instances are discovered, which brings more insight into the dataset
structure.

A popular application for semi-supervised learning is speech analysis. Here, the task is to
identify words from audio files of utterances. While recording spoken words is easily
accomplished and data of this kind is abundant, labeling the data is a very time consum-
ing process.

SUMMARY
In this unit, an introduction to machine learning in data science is pre-
sented, giving an overview of the involved definitions and concepts.
Machine learning is an inductive process that automatically builds a pre-
diction model and extracts relevant patterns by learning the natural
structure of a given dataset.

There are three major paradigms of machine learning: supervised, unsu-


pervised, and semi-supervised learning. Each paradigm of machine
learning is characterized by the type of data that they require and the
type of output that they generate, meaning that they use different algo-
rithms to build their learning models.

76
The output of the developed model is a discrete value in classification
problems, and a continuous value in regression problems. Meanwhile, if
the datasets are not labeled with an output variable, the machine learn-
ing objective is to retrieve the important patterns by applying clustering
analysis.

77
BACKMATTER
LIST OF REFERENCES
Baldassarre, M. (2016). Think big: Learning contexts, algorithms and data science.
Research on Education and Media, 8(2), 69—83. Retrieved from https://content.sciendo
.com/view/journals/rem/8/2/article-p69.xml

Brownlee, J. (2019, September 18). How to create an ARIMA model for time-series forecast-
ing in Python [blog post]. Retrieved from https://machinelearningmastery.com/arima-
for-time-series-forecasting-with-python/

Dalinina, R. (2017, January 10). Introduction to forecasting with ARIMA in R [blog post].
Retrieved from https://www.datascience.com/blog/introduction-to-forecasting-with-
arima-in-r-learn-data-science-tutorials

Desjardins, J. (2016, August 12). The largest companies by market cap over 15 years
[chart]. Retrieved from https://www.visualcapitalist.com/chart-largest-companies-ma
rket-cap-15-years/

Dorard, L. (n.d.). The machine learning canvas [PDF document]. Retrieved from https://ww
w.louisdorard.com/machine-learning-canvas

Giasson, F. (2017, March 10). A machine learning workflow [blog post]. Retrieved from http
://fgiasson.com/blog/index.php/category/artificial-intelligence/

Hackernoon. (2018, June 2). General vs narrow AI [blog post]. Retrieved from https://hacke
rnoon.com/general-vs-narrow-ai-3d0d02ef3e28

Haykin, S. (2012). Feed forward neural networks: An introduction [PDF document].


Retrieved from http://media.wiley.com/product_data/excerpt/19/04713491/04713491
19.pdf

Helmenstine, A. M. (2017, August 12). Bayes theorem definition and examples [blog post].
Retrieved from https://www.thoughtco.com/bayes-theorem-4155845

Jogawath, A. K. (2015, September 28). Introducing Hadoop—HDFS and map reduce [blog].
Retrieved from https://ajaykumarjogawath.wordpress.com/tag/big-data/

Le Dem, J. (2016). Efficient data formats for analytics with Parquet and Arrow [presenta-
tion slides]. Retrieved from https://2016.berlinbuzzwords.de/sites/2016.berlinbuzzwo
rds.de/files/media/documents/berlin_buzzwords_2016_parquet_arrow.pdf

Malhotra, A. (2018, February 1). Tutorial on feed forward neural network: Part 1. Retrieved
from https://medium.com/@akankshamalhotra24/tutorial-on-feedforward-neural-ne
twork-part-1-659eeff574c3

80
MerchDope. (2019, September 29). 37 mind blowing YouTube facts, figures, and statistics:
2019 [blog post]. Retrieved from https://merchdope.com/youtube-stats/

Michel, J., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, ... Aiden,
E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science,
331(6014), 176—182. Retrieved from https://science.sciencemag.org/content/331/601
4/176

Montibeller, G., & Winterfeldt, D. (2015). Cognitive and motivational biases in decision and
risk analysis. Risk Analysis, 35(7), 1230—1251.

Nau, R. (2014). Notes on nonseasonal ARIMA models [PDF document]. Retrieved from http:
//people.duke.edu/~rnau/Notes_on_nonseasonal_ARIMA_models--Robert_Nau.pdf

PeerXP. (2017, October 17). The 6 stages of data processing cycle [blog post]. Retrieved
from https://medium.com/peerxp/the-6-stages-of-data-processing-cycle-3c2927c466f
f

Pollock, N. J., Healey, G. K., Jong, M., Valcour, J. E., & Mulay, S. (2018). Tracking progress in
suicide prevention in Indigenous communities: A challenge for public health surveil-
lance in Canada. BMC Public Health, 18(1320). Retrieved from https://bmcpublichealth
.biomedcentral.com/articles/10.1186/s12889-018-6224-9

Polson, N., & Scott, S. (2011). Data augmentation for support vector machines. Bayesian
Analysis, 6(1), 1—23. Retrieved from https://projecteuclid.org/download/pdf_1/euclid.
ba/1339611936

Prakash, R. (2018, June 19). 5 different types of data processing [video]. Retrieved from htt
ps://www.loginworks.com/blogs/5-different-types-of-data-processing/

Rapolu, B. (2016). Internet of aircraft things: An industry set to be transformed [article].


Retrieved from http://aviationweek.com/connected-aerospace/internet-aircraft-thing
s-industry-set-be-transformed

Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
Wiesbaden: Springer Vieweg.

Saleh, B., Abe, K., Arora, R. S., & Elgammal, A. (2014). Toward automated discovery of artis-
tic influence. Multimedia Tools and Applications, 75, 3565—3591.

Shaikh, F. (2017, January 19). Simple beginner’s guide to reinforcement learning & its
implementation [blog post]. Retrieved from https://www.analyticsvidhya.com/blog/2
017/01/introduction-to-reinforcement-learning-implementation/

Statista. (2020). Volume of data/information created worldwide from 2010 to 2025 [chart].
Retrieved from https://www.statista.com/statistics/871513/worldwide-data-created/

81
Thakur, D. (2017). What is data transmission? Types of data transmission [article].
Retrieved from http://ecomputernotes.com/computernetworkingnotes/communicati
on-networks/data-transmission

Tierney, B. (2012, June 13). Data science is multidisciplinary [blog post]. Retrieved from htt
ps://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html

Wenzel, F., Galy-Fajou, T., Deutsch, M., & Kloft, M. (2017). Bayesian nonlinear support vec-
tor machines for big data, presented at the European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery in Databases, Skopje, 2017.
Retrieved from https://arxiv.org/abs/1707.05532

82
LIST OF TABLES AND
FIGURES
Table 1: Top Five Traded Companies (2001–2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Figure 1: The Data Science Venn Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Figure 2: The Extended Data Science Venn Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Table 2: Data Handling Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Table 3: Data Features Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 4: Artificial Intelligence Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 5: Model Development Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 6: Model Performance Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 3: Data Science Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 4: What Do Data Scientists Spend the Most Time Doing? . . . . . . . . . . . . . . . . . . . . . . . 24

Table 7: Qualitative Vs Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Table 8: Structured Vs Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 5: The Correlation Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Table 9: Data Transformation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 6: Identification of an Organization’s Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 7: Achieved Value by Data Science in “Customer”-Related Use Cases . . . . . . . . . . . 38

Figure 8: Achieved Value by Data Science in “Operational”-Related Use Cases . . . . . . . . . 38

Figure 9: Achieved Value by Data Science in “Financial Fraud”-Related Use Cases . . . . . . 39

Figure 10: Machine Learning Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 11: DSUC: Real Estate Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

83
Table 10: The Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figure 12: Receiver Operator Characteristic (ROC) Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 13: Some Commonly Utilized KPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 11: The Common Cognitive Biases and Their De-biasing Techniques . . . . . . . . . . . . 47

Figure 14: Mutually Exclusive Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Figure 15: Mutually Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 16: Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 17: The Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 18: The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Table 12: Possible Outcomes of Tossing a Coin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 19: The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Figure 20: The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 21: Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 22: Drug Test Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Figure 23: Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Figure 24: Supervised Learning Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Table 13: Supervised Learning Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 25: Decision Tree Based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 26: K- Nearest Neighbors Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Figure 27: Support Vector Machines (SVM) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 28: Linear Regression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 29: Logistic Regression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 30: Artificial Neural Network Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

84
Figure 31: Unsupervised Learning Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Table 14: Unsupervised Learning Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 32: Semi-Supervised Learning (Classification Step) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 33: Semi-Supervised Learning (Clustering Step) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

85
IU Internationale Hochschule GmbH
IU International University of Applied Sciences
Juri-Gagarin-Ring 152
D-99084 Erfurt

Mailing Address
Albert-Proeller-Straße 15-19
D-86675 Buchdorf

[email protected]
www.iu.org

Help & Contacts (FAQ)


On myCampus you can always find answers
to questions concerning your studies.

You might also like