Artificial Intelligence For Business Analytics Algorithms Platforms and Application Scenarios 9783658375997 9783658375980 365837599x
Artificial Intelligence For Business Analytics Algorithms Platforms and Application Scenarios 9783658375997 9783658375980 365837599x
The translation was done with the help of artificial intelligence (machine
translation by the service DeepL.com). A subsequent human revision was done
primarily in terms of content.
This work is subject to copyright. All rights are reserved by the Publisher,
whether the whole or part of the material is concerned, specifically the rights of
translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or
by similar or dissimilar methodology now known or hereafter developed.
The publisher, the authors, and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of
publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral
with regard to jurisdictional claims in published maps and institutional
affiliations.
2 To be honest, it must also be said that the success of German retailers in the past would not have
necessitated a deeper examination of these issues.
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023
F. Weber, Artificial Intelligence for Business Analytics
https://doi.org/10.1007/978-3-658-37599-7_1
Requirement: provide the right data to the right people at the right time
for decision support.
Volume refers to the size of the data. Large amounts of data are specified in
several terabytes and petabytes. A terabyte stores as much data as would fit on
1500 CDs or 220 DVDs, enough to store about 16 million Facebook photos.
Facebook processes up to one million photos per second [6]. A petabyte is
equivalent to 1024 terabytes. However, the definitions of big data depend on the
industry and the type of data and do not allow us to easily define a specific
threshold for big data. For example, two data sets of the same size may require
different technologies to be processed depending on the type (table vs. video
data).
Variety refers to the structural heterogeneity in a data set. Modern
technologies allow companies to use different types of structured, semi-
structured, and unstructured data. Structured data refers to the tabular data in
spreadsheets or relational databases. Text, images, audio, and video are examples
of unstructured data that lack structural order but are required for some types of
analysis. Across a continuum between fully structured and unstructured data, the
format of semi-structured data does not meet strict standards on either side.
Extensible markup language (XML) is a textual language for exchanging data on
the web and is a typical example of semi-structured data. XML documents
contain user-defined data tags that make them machine-readable.
Velocity refers to the speed at which data is generated and the speed at which
it is to be analyzed and processed. The proliferation of digital devices, such as
smartphones and sensors has led to an unprecedented rate of data creation and
continues to drive an ever-increasing need for real-time analytics. Even
conventional retailers generate high-frequency data; Wal-Mart, for example,
processes more than one million transactions per hour [7].
Also, new technologies, such as in-memory databases (where the data is
permanently located in the physical main memory of the computer) make it
possible not only to process larger amounts of data but even in a shorter time. In
a conventional database system, the data is disk-resident and the required data
can be temporarily stored in the main memory for access and to be processed
there, whereas in an in-memory database, the data is stored memory-resident and
only as a backup copy on the hard disk - otherwise it remains fully in the main
memory. In both cases, a given object can have copies both in memory and on
disk. The main difference, however, is that with the in-memory database, the
primary copy remains permanently in the main memory. As there has been a
trend in recent years for main memory to become cheaper, it is now possible to
move ever-larger databases into main memory. Since data can be accessed
directly in memory, much better response times and transaction throughputs can
be enabled. This is particularly important for real-time applications where
transactions must be completed within specified time limits.
Also, the prevailing paradigm of software reference is currently changing.
The increasing use of cloud solutions (where software and data are not hosted at
the user site) tends to enable a shorter time-to-market and the possibility to
create initial tests and prototypes with new technologies earlier.
The aforementioned changes in the availability of data, the greater volume of
data, and the availability of new software and software reference models for
storage and processing serve as the basis for another trend: the increased use of
analytical models for the automated control of entire operational processes.
Thus, the key step that leads us to speak of business analytics rather than
business intelligence (see the elaboration in section “Distinction Between
Business Intelligence and Business Analytics”) is that of transferring decisions
from people to systems. Here are some examples:
In purely digital processes, such as omnichannel marketing, decisions have
already been transferred to the IT systems today. Customer communication is
sent directly from the system to the customers, based on the systemic
assessment of the individual customer. Examples include Amazon’s
promotional emails or Netflix’s recommendations. Based on the customer’s
data history, recommender systems optimize the communication with the
customer. But also the trading of stocks and currencies is now almost
completely automated and the algorithms of the different trading companies
compete against each other. Of course, the most successful investor here is the
one who uses the best algorithm.
Semiphysically digitized processes are processes in which analytics is used,
for example, to predict future demand and automatically reorder the necessary
goods. In this case, too, the winner in the market will be the company that
executes the processes using the best-optimized algorithms. The Internet of
Things (IoT) is another new term that describes the mappability of previously
purely physical processes through sensors and sensor data in all kinds of
everyday objects. For example, there are dairy farmers who have their cows
milked almost entirely automatically by robots. Humans are only called in
when necessary, such as when cows are found to be ill and need treatment,
which cannot be done by machines (,yet?). For this purpose, a wide variety of
sensors from the barn, the environment, and the individual animals themselves
are used and evaluated.
Fully digitally controlled physical processes, such as the use of robots in the
automated production of goods or cars. These robots react to external physical
input and algorithms decide on the necessary response. They must be able to
make autonomous decisions based on algorithms, using speech and video
recognition to understand the physiological environment in which they
operate.
In recent years, a huge number of processes have been digitally mapped,
digitized, or completely automated, and the manual decisions associated with
them have disappeared. In many ways, we are seeing today what people
expected during the “dot-com” era, which was all about the possibilities of new
automated and digitized business processes that allowed companies to compete
globally based on extremely scalable business models. Even then, new entrants
like Amazon.com were redefining the sale of books by transforming a purely
physical process (the bookstore around the corner) into a physically digitized
process (buying physical goods online). Later, Apple and Amazon began
producing physical devices to further increase the ability to consume content
(books, music, and movies) over the internet and thus the degree to which digital
value is created. Less noticed by the public, physical production processes have
become increasingly digital. Over the last decade, more and more business
processes have been digitized to the point where the nearest competitor is only
an app away (Apple iBooks vs. Google Play Books). The market-leading app is
often the one that is integrated on the leading platform, offers the best user
experience, and contains recommendations optimized for the individual based on
customer-related data analyses.
Since analytics is increasingly used in digital processes and these processes
can also be automated by analytics as business analytics nowadays is also much
more than decision support for humans within a company. It is also about
providing data but above all about controlling digitized processes intelligently
and automatically. Support for people is clearly moving into the background.
Based on these findings the definition of business analytics is derived:
Most authors draw the line between business intelligence and business
analytics in relation to the temporal objective of the different kinds of
application. Thus, BI is generally assumed to have a purely ex-post and BA an
ex-ante perspective. This demarcation is certainly correct from a technical point
of view, for example, when looking purely at the algorithms used, as these are
indeed based on key performance indicators (KPIs) formed by the aggregation of
historical data. However, if we now extend the consideration to an overall
perspective and look at BI from an overall corporate point of view, detached
from the technical and operational level, BI certainly fits into a larger context of
operational decision support. BI is never an end in itself but a support tool for
decisions, carried out by people. Every KPI that is determined is basically only
used to make an assessment and decision for future changes (or not) based on it.
The KPI on sales by geographic sales region and the corresponding KPI on the
change of the same can serve several purposes. Figure 1.1 illustrates the time-
logic perspective with a timeline of data used to build predictive models or
business intelligence reports. The vertical line in the middle represents the time
when the model is built (today/now). The data used to build the models is on the
left, as this always represents historical data - logically, data from the future
cannot exist. When predictive models, which are the foundation of business
analytics, are built to predict a “future” event, the data selected to build the
predictive models will be based on a time before the date when the future event
is expected. For example, if you want to build a model to predict whether a
customer will respond to an email campaign, you start with the date the
campaign was created (when all responses were received) to identify all
participants. This is the date for the label “Definition and setting of the target
variables” in Fig. 1.1. The attributes used as inputs must be known before the
date of the mailing itself, so these values are collected to the left of the date the
target variables were collected. In other words, the data is created with all the
modeling data in the past but the target variable is still in the future until the date
when the attributes are collected in the timeline of the data used for modeling.
However, it is important to clarify that both business intelligence and business
analytics analyses are based on the same data and the data is historical in both
cases. The assumption is that future behavior to the right of the vertical line in
Fig. 1.1. is consistent with the past behavior. If a predictive model identifies
patterns in the past that predicts (in the past) that a customer will buy a product,
the assumption is that this relationship will continue into the future – at least to a
certain probability.
Fig. 1.1 The analytics, data, and implementation perspectives of business analytics (own illustration)
Descriptive Analytics
Descriptive analytics is known as the conventional approach to business
intelligence and aims to present or “summarize” facts and figures in an
understandable form to prepare data for communication processes or even
further analysis by a human. Two primary techniques are used: data aggregation
and “data mining“to identify past events. The goal here is to prepare historical
data in an easily communicable format for the benefit of a broad business
audience. A common example of descriptive analytics is company reports and
KPIs that simply provide an overview of a company’s operations, revenue,
finances, customers, and stakeholders. Descriptive analytics helps describe and
present data in a format that is easily understood by a variety of different
audiences. Descriptive analytics rarely attempts to examine or establish cause
and effect relationships. Because this form of analytics does not usually go
beyond a cursory examination, the validity of results is easier to achieve. Some
common methods used in descriptive analytics are observations, case studies,
and surveys. Therefore, the collection and interpretation of large amounts of data
in the big data environment can also play a role in this type of analytics, as it is
relatively irrelevant how many individual values a KPI is aggregated from.
Descriptive analytics is more suited to a historical presentation or a summary of
past data and is usually reflected in the use of pure statistical calculations. Some
common uses of descriptive analytics are:
Creation of KPIs to describe the utilization of machines, delivery times, or
waiting times of customers.
Reports on market shares or related changes.
Summary of past events from regional sales, customer churn, or marketing
campaign success (click-through rates or cost/profit calculations).
Tabular recording of social key figures, such as Facebook likes or followers.
Reporting on general trends and developments (inflation, unemployment rate).
Predictive Analytics
Predictive analytics and statistics have a considerable overlap, with some
statisticians arguing that predictive analytics is, at least at its core, just an
extension of statistics [8]. For their part, predictive modelers often use
algorithms and tests common to statistics. Most often, however, they do so
without considering the (final) validation, which any statistician would do to
ensure that the models are built “correctly” (validly). Nevertheless, there are
significant differences between the typical approaches in the two fields. The
table (Table 1.2) shows this clearly – maybe a bit to simplistic. Statistics is
driven by theory, while predictive analytics does not follow this requirement.
This is because many algorithms come from other fields (especially artificial
intelligence and machine learning) and these usually do not have a provable
optimal solution. But perhaps the most fundamental difference between the two
fields (summarized in the last row of the table) is: for statistics, the model is the
central element, while for predictive analytics it is the data.
Table 1.2 Comparison between statistics and predictive analytics
Prescriptive Analytics
The field of prescriptive analytics allows the user to “predict” a number of
different possible actions and can guide them to an (optimized) solution. In
short, these analytics are about advice. Prescriptive analytics attempts to quantify
the impact of future decisions in order to advise on possible outcomes before the
decisions are actually made. At its best, prescriptive analytics predicts not only
what will happen but why it will happen and recommends actions that realize the
benefits of the predictions.
These analyses go beyond descriptive and predictive analytics by
recommending one or more possible courses of action. Essentially, they forecast
multiple futures and allow companies to evaluate a range of possible outcomes
based on their actions. Prescriptive analytics uses a combination of techniques
and tools, such as algorithms, methods of artificial intelligence or machine
learning, and modeling techniques. These techniques are applied to various data
sets, including historical and transactional data, real-time data feeds, or big data.
Initiating, defining, implementing, and then using prescriptive analytics is
relatively complex and most companies do not yet use them in their day-to-day
business operations. When implemented properly, they can have a major impact
on the way decisions are made, and thus on the company’s bottom line. Larger
companies are successfully using prescriptive analytics to optimize production,
planning, and inventory in the supply chain to ensure they deliver the right
products at the right time and optimize the customer experience.
Data Sources
Big data is also characterized by different types of data that can be processed for
analysis. The left section “Data Sources” in the BA.TF shows which types of
data are (can be) available to the organization.
Structured data still make up the majority of data used for analysis according
to different surveys [18]. Structured data are mostly located in spreadsheets
(Microsoft Excel), tables, and relational databases that conform to a data model
that includes the properties and relationships between them. These data have
known data lengths, data types, and data constraints. Therefore, they can be
easily captured, organized, and queried because of the known structure. The
BA.TF displays structured data from sources, such as internal systems that
generate reports, operational systems that capture transactional data, and
automated systems that capture machine data, such as customer activity logs.
Unstructured data comes in many different forms that do not adhere to
traditional data models and are therefore typically not well suited for a relational
database. Thanks to the development of alternative platforms for storing and
managing such data, it is becoming increasingly common in IT systems. Unlike
traditionally structured data, such as transactional data, unstructured data can be
maintained in disparate formats. One of the most common types of unstructured
data is plain text. Unstructured text is generated and stored in a variety of forms,
including Word documents, email messages, PowerPoint presentations, survey
responses, transcripts of call center interactions, and posts or comments from
blogs and social media sites. Other types of unstructured data include images,
audio, and video files. Machine data is another category that is growing rapidly
in many organizations. For example, log files from websites, servers, networks,
and applications - especially mobile - provide a wealth of activity and
performance data. In addition, companies are increasingly collecting and
analyzing data from sensors on production equipment and other devices
connected to the Internet of Things (IoT).
In some cases, such data can be considered semi-structured - for example,
when metadata tags are added to provide information and context about the
content of the data. The boundary between unstructured and semi-structured data
is not absolute. Semistructured data is even more widely used for analysis [18]
because while this data does not have a strict and rigid structure, it does contain
identifiable features. For example, photos and images can be tagged with time,
date, creator, and keywords to help users search and organize them. Emails
contain fixed tags, such as sender, date, time, and recipient attached to the
content. Web pages have identifiable elements that allow companies to share
information with their business partners.
Data Preparation
First of all, data preparation includes the classic processes of extracting,
transforming, and loading (ETL) data and data cleansing. ETL processes
require expert knowledge and are essential as a basis for analysis. Once data is
identified as relevant, a human (for example, the team responsible for the data
warehouse or data lake section “When to Use Which Algorithm?”) extracts data
from primary sources and transforms it to support the decision objective [15].
For example, a customer-centric decision may require that records from different
sources, such as an operational transaction system and social media customer
complaints, be merged and linked by a customer identifier, such as a zip code.
Source systems may be incomplete, inaccurate, or difficult to access, so data
must be cleansed to ensure data integrity. Data may need to be transformed to be
useful for analysis, such as creating new fields to describe customer lifetime
value (contribution margin a customer brings to the company throughout the
relationship). The data can be loaded into a traditional data warehouse, a data
lake, or more specific into Hadoop clusters (see section “When to Use Which
Algorithm?”). Loading can be done in a variety of methods with a data
warehouse either sequentially or parallel through tasks, such as overwriting
existing data, updating data hourly or weekly.
For the use of a data lake, the ETL processes are not required, rather the data
can be loaded directly into the application (data loading).
Lastly, depending on the deployment scenario, streaming data is also
processed. This is data that is continuously generated by thousands of data
sources that typically send in the data sets simultaneously and in small sizes
(order of kilobytes). Streaming data includes a variety of data, such as log files
created by customers using their mobile or web applications, e-commerce
purchases, in-game player activity, information from social networks, financial
trading floors, or geospatial services, and telemetry from connected devices or
instruments in data centers.
This data must be processed sequentially and incrementally on a record-by-
record basis or over rolling time windows and used for a variety of analyses,
including correlations, aggregations, filtering, and sampling. The information
derived from such analytics provides companies with insight into many aspects
of their business and customer activities, such as service usage (for measurement
and billing purposes), server activity, website clicks, and geolocation of devices,
people, and physical assets, and enables them to respond quickly to new
situations. For example, companies can track changes in public opinion about
their brands and products by continuously analyzing social media streams and
reacting in a timely manner when needed. Streaming data processing requires
two layers: a storage layer and a processing layer. The storage layer must support
record ordering and strong consistency to enable fast, low-cost, and replayable
reading and writing of large data streams. The processing layer is responsible for
consuming data from the storage layer, performing computations on that data,
and then informing the storage layer to delete data that is no longer needed.
Data Storage
Traditionally, data is loaded into a data store that is subject-oriented (modeled by
business concepts), integrated (standardized), time-varying (allows new
versions), and non-volatile (unchanged and preserved over time) [19]. Therefore,
data loading requires an established data dictionary4 and data warehouse, which
serves as a repository for verified data that the business uses for analysis. Data
related to specific applications or departments can be grouped into a data mart5
for ease of access or to restrict access. Moving and processing extremely large
amounts of data as a monolithic dataset on a singular piece of hardware (server)
is possible with today’s technology up to a certain size but it is not very
economical or practical. Therefore, storing and analyzing big data require
dividing the processing among networked computers that can communicate with
each other and coordinate actions in a distributed manner. Hadoop is an open-
source framework that enables such distributed processing of data across small
to large clusters. In this context, Hadoop is not an ETL tool but it supports ETL
processes that run in parallel with and complement the data warehouse [20]. The
results of the Hadoop cluster can be forwarded to the data warehouse or the
analyses can be run directly on the clusters.
Depending on the requirements, a company needs both a data warehouse and
a data lake, as the two concepts fulfill different requirements and use cases. For
example, a data warehouse is first just a database that is optimized for analyzing
relational data from transaction systems and business applications. The data
structure and schema are defined in advance to optimize it for fast SQL queries,
with the results typically used for operational reporting and analysis. Data is
cleansed, enriched, and transformed so that it can act as a “single source of
truth” that users can trust. A data lake differs from this because it stores
relational data from the business application industry and non-relational data
from mobile applications, IoT devices, and social media. The structure of the
data or schema is not defined when the data is collected. This means that all data
can be stored without any prior design or cleansing. The idea here is that data is
simply stored as it arrives, and it is first secondary to what issues the data might
be needed for in the future - but most of the time, with a data warehouse and
ETL processes, this must necessarily be known in advance. Different types of
analyses on the data are possible: SQL queries, big data analytics, full-text
search, real-time analyses, and machine learning methods can all be used on a
data lake. For a more in-depth technical description of the two concepts, see
section “When to Use Which Algorithm?”.
Analysis
The analysis encompasses a wide range of activities that can occur at different
stages of data management and use [21]. Querying data is often the first step in
an analytics process and is a predefined and often routine call to data storage for
a specific piece of information; in contrast, ad hoc querying is unplanned and
used when data is needed. Descriptive analytics is a class of tools and statistics
used to describe data in summary form. For example, analysts may report on the
number of occurrences of various metrics such as the number of clicks or
number of people in certain age groups, or use summary statistics like means and
standard deviations to characterize data. Descriptive analytics can use
exploratory methods to try to understand data; for example, clustering can
identify affinity groups. Exploratory analytics is often helpful in identifying a
potential data element of interest for future studies or in selecting variables to
include in an analysis. Predictive analytics refers to a group of methods that use
historical data to predict the future of a particular target variable. Some of the
most popular predictive methods are regression and neural networks.
Prescriptive analytics is an emerging field that has received more attention with
the advent of big data, as more future states and a wider variety of data types can
be examined than in the past. This analysis attempts to explore different courses
of action to find the optimal one by anticipating the outcome of different
decision options [20].
Many of these processes have long been standard in data analytics. What
differs with big data is the greater volume and variety of data being considered,
and possibly the real-time nature of data collection and analysis. For example,
Hadoop can be used to process and even store raw data from supplier websites,
identify fraud-prone patterns and develop a predictive model in a flexible and
interactive way. The predictive model could be developed on Hadoop and then
copied into the data warehouse to find sales activity with the identified pattern. A
fraudulent supplier would then be further investigated and possibly excluded
[22]. As another example, graphical images of sale items could be analyzed to
identify tags that a consumer is most likely to use to search for an item. The
results can lead to improved labels to increase sales.
The “analytics sandbox“presented in the BA.TF is a scalable, development-
oriented platform for data scientists to explore data, combine data from internal
and external sources, develop advanced analytic models, and propose
alternatives without changing the current data state of an enterprise. The sandbox
can be a standalone platform in the Hadoop cluster or a logical partition in the
enterprise data warehouse [23]. Traditional architectures use a schema-on-write
and save-and-process paradigm, where data is first cleaned and prepared, stored,
and only then queried. Complex event processing is proactive in-process
monitoring of real-time events that enables organizations to make decisions and
respond quickly to events, such as potential threats or opportunities [24]. The
combination of real-time event processing, data warehousing, data lakes, data
marts, Hadoop clusters, and sandboxing provides a data analytics and storage
infrastructure that supports a stable environment while enabling innovation and
real-time response.
Development Cycle
The first of the two cycles, the development cycle, focuses primarily on building
the model and deriving it from historical data.
Business Understanding
Every BA project needs business objectives and domain experts who understand
decisions, KPIs, estimates, or reports that are of value to a business and define
the objectives of the project from a business perspective. The analysts used
sometimes have this expertise themselves if they are hired in-house but domain
experts usually have a better perspective on what aspects are important in the
day-to-day business and how the project’s results (should) impact the business.
Without expertise, the goals set, definitions, what models should be built, and
how they and the results from them should be evaluated can lead to failed
projects that do not address the most important business issues. One way to
understand the collaboration that leads to success in BA projects is to imagine a
three-legged stool. Each leg is critical to ensuring that the chair remains stable
and serves its purpose.
In BA projects, the three basic pillars are indispensable: (1) domain experts,
(2) data or database experts, and (3) modeling experts (data scientists). The
domain experts are required to formulate a problem comprehensively and in
such a way that the problem is useful to the business. The data or database
experts are needed to determine what data is available for modeling, how to
collect more, qualitatively clean up the existing, and how to access that data. The
modelers or data scientists are required to build the necessary models on this
data and the defined questions.
If one or more of these three cornerstones are missing, then the problem
cannot be defined properly or a purely technical view of the problem is taken
(for example, if only modelers and the database administrator define the
problems). Then it might be that excellent models are created with fantastic
accuracy on the latest and hottest algorithms, but they cannot be used because
the model does not meet the real needs of the business. Or in a more subtle way:
maybe the model does support the right kind of decision but the models are
scored in such a way that they do not address very well what is most important
to the business - the wrong model is chosen because the wrong metric is used to
evaluate the models.
On the other hand, if the database expert is not involved, then data problems
may occur. First, there may not be enough understanding of the layout of tables
in the databases to access all the fields needed for the algorithms. Second, there
may be too little understanding of the individual fields and what information
they represent, even if the names of the fields seem intuitive. Where honestly the
normal state in many companies is that the names have been cryptically and
arbitrarily created over years and no documentation is available. Third,
insufficient permissions may prevent data from being used. Fourth, database
resources may not support the type of joins or queries may exceed the available
technical resources needed. And fifth, model deployment options envisioned by
the BA team may not be supported by the organization. If data scientists are not
available during this critical first phase, several obstacles may arise. First, a lack
of understanding by project managers of what the algorithms and models can or
should do. For example, managers, driven by the hype around AI, may specify
specifications that are impossible to implement. Second, the definition of target
variables for modeling may not be done at all or may be done poorly, hampering
modeling efforts. Third, if the data scientist does not define the layout of the data
needed to build the models, a data source to be used may not be defined at all, or
crucial key fields may be missing but desperately needed for the models.
Data Discovery
In contrast to the CRISP-DM model, the new best practice approach defined
here is to divide the data analysis process, also called data discovery, into two
distinct steps:
Explore
(Data exploration) - After data has been prepared, this data is “explored” to see
which parts of it help to find the answers we are looking for. In the process, the
first tests can also be made and various hypotheses can be tested. You can also
think of this step as data refinement or data selection. Companies and users can
perform data exploration using a combination of automated and manual
methods. Data scientists often use automated tools such as data visualization
software for data exploration, as these tools allow users to quickly and easily
point out the most relevant features and dependencies within a data set. In this
step, users can identify the variables that seem most likely to yield interesting
observations.
Discovery
(Data discovery) - once it is known what data is needed, it is possible to “dig
deep” into that data to identify the specific points and variables that provide
answers to the original question. It is a business-user-oriented process for
identifying patterns and outliers through, for example, visual navigation through
data or the application of automated advanced analytics. Discovery is an iterative
process that does not require extensive upfront modeling. Data discovery
requires skills around understanding data relationships and data modeling, in
using data analytics and guided advanced analytics.
Data Wrangling
Data wrangling is one of those technical terms that are more or less self-
explanatory: data wrangling is the process of cleansing, structuring, and
enriching raw data into the desired format for better decision-making in less
time. Data wrangling is increasingly ubiquitous among IT organizations and is a
component of all IT initiatives. Data has become more diverse and unstructured,
requiring more time to ingest, cleanse, and structure data for actual analysis. At
the same time, business users have less time to wait for technical resources as
data is required in almost every business decision - especially for analytics-
focused organizations.
The basic idea of data wrangling is that the people who know the data and
the real-world problems behind it best are the ones who investigate and prepare
it. This means that business analysts, industry users, and managers (among
others) are the intended users of data wrangling tools. In comparison, extract-
transform-load (ETL) technologies focus on IT as the end-user. IT staff receive
requests from their business partners and implement pipelines or workflows
using ETL tools to deliver the desired data to the systems in the desired formats.
Pure business users rarely see or use ETL technologies when working with data
because they are not intuitive and are more at the database technology level than
business operations. Before data wrangling tools were available, these users’
interaction with data was only in spreadsheets or business intelligence tools.
The process of data wrangling involves a sequence of the following
processes:
Pre-processing, which takes place immediately after data collection.
Standardization of data into an understandable and usable format.
Cleaning data from noise, missing or erroneous elements.
Consolidation of data from different sources or data sets into a unified whole.
Comparison of the data with the existing data records.
Filtering of data through defined settings for subsequent processing.
Analysis
In the analysis phase, it is important to define what the solution must achieve in
relation to the problem and environment. This includes both the features and the
non-functional attributes (such as performance, usability, etc.). In the analysis
phase, the modeling technique is first selected, then the necessary quality and
validity of the model are determined, and finally, the model is implemented.
Modeling technique - In the first step of the analysis, the actual modeling
technique, algorithms, and approaches to be used must be selected. Although a
toolset or tool may have been selected previously, in the business understanding
phase, the specific implementation must be determined in this phase. This is, of
course, done with the original problem in mind. After all, whether you use
decision trees or neural networks has a lot to do with the question to be
answered, the available data, and the other framework conditions (which are set
by the two phases beforehand). A detailed list of problem types and the
algorithms that can be used in each case follows in section “Reinforcement
Learning”. It is always important to take into account, even if several techniques
have to be used:
Modeling technique - selection and definition of the modeling technique to
be used.
Modeling assumptions - Many modeling techniques make specific
assumptions about the data, e.g., that all attributes are subject to equal
distribution, no missing values are allowed, or class attributes must be
symbolic.
Quality assurance - Before the actual model is created afterward, an
additional procedure or mechanism must be defined to test the quality and
validity of the model. For example, in supervised algorithms such as
classification, it is common to use error rates as quality measures for the models.
Therefore, when applying these algorithms, one typically divides the dataset into
training and testing datasets (also called learning and validation datasets). In this
process, the model is built on the training data and then the quality is estimated
on the separate test set. It is also important at this stage to define the intended
plan for training, testing, and evaluating the models. An essential part of the plan
is to determine how the available data set will be divided into training, testing,
and validation data sets.
Create models - The modeling tool (section “When to Use Which
Algorithm?”) is then applied to the prepared data set to create one or more
models. This involves defining the necessary parameter settings. With most
modeling tools, there are a large number of parameters that can be adjusted. In
this process, the parameters and the selected values should be documented along
with a rationale for the choice of parameter settings. Thus, the result of this
substep includes both the model and the model description and documentation:
Models - These are the actual models created with the modeling tool.
Model descriptions - description of the resulting models, report on the
interpretation of the models, and documentation of any difficulties and
assumptions made in the creation process.
Validation
The models that have now been created must be interpreted according to the
domain knowledge (preferably with the involvement of the business users in
workshops), based on the previously defined success criteria and the test design.
First, the results of the models created must be assessed technically (in terms of
the quality criteria of the algorithms), and then the results must be discussed with
the business analysts and domain experts in the business context. However, the
assessment should not only measure the technical quality criteria but also take
into account the business goals and business success criteria as much as possible.
In most projects, a single technique is applied more than once and results are
produced using several different techniques or in several steps. The procedure of
technical evaluation (in the sense of mathematical quality assessment) can be
divided into two steps:
Model evaluation - A summary of the results of this step is provided, listing
the goodness and quality of each generated model (e.g. in terms of accuracy).
Then, the models are selected with respect to the quality criteria.
Revised parameter settings - Depending on the model evaluation, the
parameter settings must be adjusted again and tuned for the next modeling
run. In doing so, model creation and evaluation should be iterated until the
best model is found. It is important to document all changes, revisions, and
evaluations.
The previous step of this phase addressed factors, such as the accuracy and
generality of the model. The second step assesses the extent to which the model
achieves the business objectives and whether there is a business reason why this
model is sufficient or perhaps inadequate. Another option is to test the model(s)
on trial applications in real-world use - if time and budget constraints permit.
The evaluation phase also includes assessing any other deliverables that have
been generated.
Deployment Cycle
The second of the two cycles, the deployment cycle, focuses primarily on the use
and productive exploitation of the previously created model and its application
to actual data. Most data scientists have little or no awareness of this other half
of the problem. Many companies are struggling with BA and AI and the various
pitfalls as shown by various studies [30, 31]. According to analysts, it takes
about 2 months to take a single predictive model from creation to production.
But why is it so difficult to scale BA and AI projects in an organization?
The productive implementation and maintenance of these projects is no easy
task and most data scientists do not see it as their job. Yet the crucial questions
are essential for the success of the project:
How do I integrate the model or project into the existing system landscape?
How can the model be easily deployed so that it can be consumed by other
applications in a scalable and secure manner?
How to monitor the quality of the models and release a new version if needed?
How can the handover of artifacts from the data scientist to IT operations be
managed without friction? Is this separation even necessary?
Publish
To determine where to productively run the analytics model created, the
following considerations should be taken into account:
Scope - Ultimately, the scope is the derived information and decisions (not
the raw data) and how to act on it that determines what types of analytics are
used and where. For example, if the goal is to optimize machine availability at
one location, analysis of the data collected there may be sufficient. In this case,
the analysis can be performed anywhere, provided that normal local operations
are not critically dependent on network latency and the availability of the
analysis results. On the other hand, if the value proposition is to optimize
production across sites that require comparison of factory efficiency, then
analysis of data collected from those sites must be performed to be available at a
higher level of the system architecture.
Response time and reliability - In an industrial environment, some
problems require real-time analytics, calculations, and decisions, and others can
be performed after the fact. The former almost always require analytics to be
local for reliability and performance.
Bandwidth - The total amount of data generated (from sensors, for
example), along with the data captured by the control or transaction systems, can
be enormous in many cases. This must be factored into the overall increase in
network and infrastructure utilization, depending on the location of the
deployment.
Capacity - In some cases, it may be optimal to perform analytics at a
particular level in a system architecture but the existing infrastructure may not be
able to support it, so another level is selected.
Security - The value from data transfer must be balanced with concerns
about transferring raw data outside of controlled areas and the associated costs.
It may be more efficient to perform some analysis locally and share necessary
summaries, redacted, or anonymized information with other areas. In the vast
majority of cases, this discussion leads to a decision between a local or cloud-
based location of deployment. The important thing here is to do an honest
assessment (Are your on-premises admins really as good as the security experts
at Amazon AWS or Google?).
Compliance - To illustrate how compliance can impact analytics as a design
consideration, national security is used as an example. National security
concerns can limit architectural decisions about data management and sharing
with government regulations in industries, such as aerospace and defense. This
will impact where analytics must be deployed to meet regulatory requirements,
such as preventing large-scale computations from being performed in a public
cloud facility to reduce costs.
Platform - When it comes to deployment, a Platform as a Service (PaaS) or
Infrastructure as a Service (IaaS) must be chosen. A PaaS may be suitable for
prototyping and businesses with lower requirements. As the business grows
and/or higher demands are made, IaaS is probably the better way to go. This
requires handling more complexity but allows scaling to be implemented much
better (and probably cheaper). There are many solutions available from the big
hyper scalers (AWS, Google, Microsoft) as well as a lot of niche providers. An
overview of this is also provided in section “Business Analytics and Machine
Learning as a Service (Cloud Platforms)”.
Analytic Deployment
Deployment, i.e. the transfer of the application from development to productive
operation, is also referred to as “DevOps“. DevOps is an artificial word made up
of the terms development and IT operations and refers to a methodology that
relates to the interaction between development (first cycle) and operations.
Development needs as much change as possible to meet the needs of changing
times, while change is “the enemy” for operations. Operations require stability
and thus any change is met with strong resistance. There are many basic DevOps
practices. Some of them are:
Infrastructure as Code (IaC) - IaC is the practice of using the techniques,
processes, and toolsets used in software development to manage the deployment
and configuration of systems, applications, and middleware. Most testing and
deployment failures occur when the developer environments are different from
the test and production environments. Version control of these environments
provides immediate benefits in terms of consistency, time savings, error rates,
and audibility.
Under the practice of continuous integration (CI), working copies of all
developer code are combined with a common master line.
Automated testing - is the practice of automatically running various tests,
such as load, functional, integration, and unit tests either after you have checked
in code (i.e., attached to CI) or otherwise automatically triggering one or more
tests against a specific build or application.
In practice, analytic deployment can be roughly divided into two procedures (see
Fig. 1.4), as briefly outlined below.
Fig. 1.4 The analytical deployment architecture with Docker Swarm or Kubernetes
Application Integration
Application integration is often a difficult process, especially when integrating
existing legacy applications with new applications or web services. Given the
vast scope of this topic, one could literally write a book about successful
implementation. However, some of the basic requirements are always the same:
Adequate connectivity between platforms.
Business rules and data transformation logic.
The longevity of business processes.
The flexibility of business processes.
Flexibility in hardware, software, and business goals.
To meet these requirements, the application environment should have a common
interface for open communication, including the ability of the system to request
web services and to be compatible when interfacing with other platforms and
applications. The use of common software platforms (see the last chapter in this
book) enables this open communication through interfaces (APIs) that underlie
the paradigm of these platforms. Especially in the area of business analytics,
there will rarely be a homogeneous platform for critical business applications
(ERP system) and data analytics and machine learning at the same time. Rather,
integration of both platforms in both directions (i.e., data access and data
writing) will be necessary.
Test
Testing software describes methods for evaluating the functionality of a software
program. There are many different types of software testing but the two main
categories are dynamic testing and static testing. Dynamic testing is an
evaluation that is performed while the program is running. In contrast, static
testing is a review of the program code and associated documentation. Dynamic
and static methods are often used together.
In theory, testing software is a fairly simple activity. For every input, there
should be a defined and known output. Values are entered, selections or
navigations are made, and the actual result is compared to the expected result. If
they match, the test passes. If not, there may be an error. The point here is that
until now, you always knew in advance what the expected output should be.
But this book is about a kind of software where a defined output is not
always given. Importantly, in both the machine learning and analytics
application examples, acceptance criteria are not expressed in terms of an error
number, type, or severity. In fact, in most cases, they are expressed in terms of
the statistical probability of being within a certain range.
Production/Operations
“Operate” covers the maintenance tasks and checkpoints after the rollout that
enable successful deployment of the solution. This involves monitoring and
controlling the applications (monitoring response times) and the hardware
(server failures). This step is no different from the usual operation of other
software.
The main objective of the operation is to ensure that IT services are delivered
effectively and efficiently while maintaining the highest quality of service. This
includes fulfilling user requests, troubleshooting service errors, resolving issues,
and performing routine tasks. Some other objectives of this phase are listed
below:
Minimize the impact of service outages on daily business activities
Ensuring access to agreed IT services only by authorized personnel and
applications
Reduction of incidents and problems
Supporting users in the use of the service itself
Continuous Improvement
Continuous service improvement is a method of identifying and executing
opportunities to improve IT processes and services and objectively measuring
the impact of these efforts over time. The idea here also stems from lean
manufacturing or “The Toyota Way”. It was developed in manufacturing and
industry to reduce errors, eliminate waste, increase productivity, optimize
employee engagement, and stimulate innovation. The basic concept of
continuous service improvement is rooted in the quality philosophies of
twentieth-century business consultant and expert W. Edwards Deming. The so-
called “Deming Circle” consists of a four-step cycle of plan, do, check, and act.
This circle is executed repeatedly to achieve continuous process or service
improvement.
References
1. Bracht, U.: Digitale Fabrik: Methoden und Praxisbeispiele, 2. Aufl., VDI-Buch, Geckler, D., Wenzel, S.
(Hrsg.). Springer, Berlin/Heidelberg (2018)
2. Steven, M.: Industrie 4.0: Grundlagen – Teilbereiche – Perspektiven, 1. Ed., Moderne Produktion
(Hrsg.). Kohlhammer, Stuttgart (2019)
3. Schneider, M.: Lean factory design: Gestaltungsprinzipien für die perfekte Produktion und Logistik.
Hanser, München (2016)
[Crossref]
4. Laney, D.: 3D data management: controlling data volume, velocity and variety. META Group Res.
Note. 6(70) (2001)
6. Barbier, G., Liu, H.: Data mining in social media. In: Social Network Data Analytics, pp. 327–352.
Springer, New York (2011)
[Crossref]
8. Amirian, P., Lang, T., van Loggerenberg, F.: Big Data in Healthcare: Extracting Knowledge from Point-
of-Care Machines. Springer, Cham (2017)
[Crossref]
9. Gorry, G.A., Scott Morton, M.S.: A framework for management information systems. Sloan Manag.
Rev. 13, 55–70 (1971)
10. Sprague Jr., R.H.: A framework for the development of decision support systems. MIS Q. 4, 1–26
(1980)
[Crossref]
11. Zachman, J.A.: A framework for information systems architecture. IBM Syst. J. 26(3), 276–292 (1987)
[Crossref]
12. Sowa, J.F., Zachman, J.A.: Extending and formalizing the framework for information systems
architecture. IBM Syst. J. 31(3), 590–616 (1992)
[Crossref]
13. Watson, H.J., Rainer Jr., R.K., Koh, C.E.: Executive information systems: a framework for
development and a survey of current practices. MIS Q. 15, 13–30 (1991)
[Crossref]
14. Hosack, B., et al.: A look toward the future: decision support systems research is alive and well. J.
Assoc. Inf. Syst. 13(5), 315 (2012)
15. Watson, H.J., Wixom, B.H.: The current state of business intelligence. Computer. 40(9), 96 (2007)
[Crossref]
17. IIC: The Industrial Internet of Things Volume T3: Analytics Framework, Bd. 3. IIC, Needham, MA
(2017)
18. Russom, P.: Big data analytics. TDWI Best Practices Report. Fourth Quarter. 19(4), 1–34 (2011)
19. Watson, H., Wixom, B.: The current state of business intelligence. Computer. 40, 96–99 (2007)
[Crossref]
20. Watson, H.J.: Tutorial: big data analytics: concepts, technologies, and applications. Commun. Assoc.
Inf. Syst. 34(1), 65 (2014)
21. Kulkarni, R., S.I. Inc.: Transforming the data deluge into data-driven insights: analytics that drive
business. In: Keynote Speech Presented at the 44th Annual Decision Sciences Institute Meeting,
Baltimore (2013)
22. Awadallah, A., Graham, D.: Hadoop and the Data Warehouse: when to Use which. Copublished by
Cloudera, Inc. and Teradata Corporation, California (2011)
23. Phillips-Wren, G.E., et al.: Business analytics in the context of big data: a roadmap for research. CAIS.
37, 23 (2015)
[Crossref]
24. Chandy, K., Schulte, W.: Event Processing: Designing IT Systems for Agile Companies. McGraw-Hill,
Inc, New York (2009)
25. Watson, H.J.: Business intelligence: past, present and future, S. 153. AMCIS 2009 Proceedings, Cancun
(2009)
26. Ballard, C., et al.: Information Governance Principles and Practices for a Big Data Landscape IBM
Redbooks. International Business Machines Corporation, New York (2014)
27. Ponsard, C., Touzani, M., Majchrowski, A.: Combining process guidance and industrial feedback for
successfully deploying big data projects. Open J. Big Data. 3, 26–41 (2017)
28. Aggarwal, C.C.: Data Mining: the Textbook. Springer, Cham (2015)
[Crossref][zbMATH]
29. Angée, S., et al.: Towards an Improved ASUM-DM Process Methodology for Cross-Disciplinary
Multi-Organization Big Data & Analytics Projects. Springer, Cham (2018)
[Crossref]
30. Veeramachaneni, K.: Why you’re not getting value from your data science. https://hbr.org/2016/12/
why-youre-not-getting-value-from-your-data-science (2016). Accessed on 12 May 2017
31. McKinsey: Global AI survey: AI proves its worth, but few scale impact. https://www.mckinsey.com/
featured-insights/artificial-intelligence/global-ai-survey-ai-proves-its-worth-but-few-scale-impact
(2019). Accessed on 21 Dec 2019
Footnotes
1 In the sense of a digital representation of reality (also known as “Digital Twin”).
3 The Industrial Internet Consortium (IIC) is an open membership organization with more than 250
members. The IIC says it was founded to accelerate the development, adoption and widespread use of
interconnected machines and devices and smart analytics. Founded in March 2014 by AT&T, Cisco,
General Electric, IBM and Intel, the IIC catalyzes and coordinates industrial internet priorities and enabling
technologies with a focus on the Internet of Things.
4 A data dictionary contains metadata, that is, information about the database. The data dictionary is very
important because it contains, for example, information about what is in the database, who is allowed to
access it, where the database is physically located. The users of the database usually do not interact directly
with the data dictionary, it is only managed by the database administrators, customized by the developers of
the using applications and used in the context of the book by the analysts and data scientists.
5 A data mart is a subset of data focused on a single functional area of an organization and stored in a data
warehouse or other data store. A data mart is a set of all available data and has been tailored for use by a
specific department, unit, or group of users in an organization (for example, marketing, sales, human
resources, or finance).
6 It should be noted that data mining now also takes unstructured data into account.
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023
F. Weber, Artificial Intelligence for Business Analytics
https://doi.org/10.1007/978-3-658-37599-7_2
2. Artificial Intelligence
Felix Weber1
(1) Chair of Business Informatics and Integrated Information Systems,
University of Duisburg-Essen, Essen, Germany
Machine Learning
Machine learning is an essential part of AI and is so popular that it is sometimes
confused with artificial intelligence (at least the two terms are often used
interchangeably).
The algorithms used in machine learning can be broadly divided into three
categories (see Fig. 2.2 and the comparison in section “Unsupervised
Learning”): supervised learning, unsupervised learning, and reinforcement
learning. Supervised learning involves feedback to indicate when a prediction is
correct or incorrect, while unsupervised learning involves no response: the
algorithm simply attempts to categorize data based on its hidden structure.
Reinforcement learning is similar to supervised learning in that it receives
feedback but not necessarily for every input or state. Below, we will explore the
ideas behind the learning models and present some key algorithms used for each
of them. Machine learning algorithms are constantly changing and evolving.
However, in most cases, the algorithms tend to integrate into one of three
learning models. The models exist to automatically adapt in some way to
improve their operation or behavior. In supervised learning, a dataset contains its
desired outputs (or labels) so that a function can compute an error for a given
prediction. Supervision occurs when a prediction is made and an error (actual vs.
desired) is generated to change the function and learn the mapping. In
unsupervised learning, the data set does not contain the desired output, so there
is no way to monitor the function. Instead, the function attempts to segment the
dataset into “classes” so that each class contains a portion of the dataset with
common features. Finally, in reinforcement learning, the algorithm attempts to
learn actions for a given set of states that lead to a target state. An error is not
issued after each example (as in supervised learning) but rather when a
reinforcement signal is received (e.g., the target state is reached). This behavior
is similar to human learning, where feedback is not necessarily given for all
actions but only when a reward is warranted.
Supervised Learning
Supervised learning is the simplest of the learning models.4 Learning in the
supervised model involves creating a function that is trained using a training data
set and can then be applied to new data. Here, the training dataset contains
labeled records (labels) so that the mapping to the desired result given the set
input is known in advance. The goal is to build the function in such a way that
beyond the initial data a generalization of the function is possible, i.e. to assign
unknown data to the correct result.
In the first phase, one divides a data set into two types of samples: training
data and test data. Both training and test data contain a test vector (the inputs)
and one or more known desired output values. The mapping function learns with
the training dataset until it reaches a certain level of performance (a metric of
how accurately the mapping function maps the training data to the associated
desired output). In supervised learning, this is done for each training sample by
using this error (actual vs. desired output) to adjust the mapping function. In the
next phase, the trained mapping function is tested against the test data. The test
data represents data that has not been used for training and whose mapping
(desired output) is known. This makes it very easy to determine a good measure
of how well the mapping function can generalize to new and unknown data [3].
To tackle a given problem on a generic basis of supervised learning, several
steps need to be taken [3]:
1.
Identification of different training examples. Thus, a single handwritten
character, word, or complete sentence can be used for handwriting analysis.
2.
The second step is to assemble a training set that must represent the practical
application of a function. Therefore, a set of input objects and equivalent
results must be determined either from measurements or from experts. In the
example, to determine the transfer of the image as a matrix of black (where
there is writing) and white (where there is no writing) fields to a
mathematical vector.
3.
The third step is to identify a particular input object (character) that would
represent a learned function. The accuracy of a learned function depends
heavily on the representation of the input object. Here, the input object is
converted into a feature vector consisting of multiple features describing an
object. The total number of features should be small due to the curse of
dimensionality5 but must contain enough information to accurately predict
the output.
4.
The fourth step is to identify the structure of the learned function along with
the corresponding learning algorithm.
5.
The learning algorithm is now executed on the cumulative training set. Few
of the supervised learning algorithms require users to evaluate the control
parameters to be adjusted by performance optimization on a subset called
the validation set.
6. The final step is to assess the accuracy of the learned function. After the
processes of learning and parameter setting, the performance of the function
must be measured on the test set, which is different from the original
training set.
Unsupervised Learning
Unsupervised learning is also a relatively simple learning model.6 However, as
the name implies, it lacks a supervisory authority and there is no way to measure
the quality of the results. The goal is to build a mapping function that categorizes
the data into classes based on features hidden in the data.
As in supervised learning, one uses two phases in unsupervised learning. In
the first phase, the mapping function segments a dataset into classes. Each input
vector becomes part of a class but the algorithm cannot assign labels to these
classes.
This data is not labeled (unlike supervised learning where the labeling is
done by the user in advance), which shows that the input variables (X) do not
have equivalent output variables. Here, the algorithms have to identify the data
structures themselves [7]. Unsupervised learning can again be simplified into
two different categories of algorithms:
Clustering (section “Types of Problems in Artificial Intelligence and Their
Algorithms”): The problem to be solved with these algorithms occurs when
trying to identify the integral groupings of the data, such as grouping
customers based on their buying behavior.
Association (section “Types of Problems in Artificial Intelligence and Their
Algorithms”): The problem to be solved with these algorithms arises when
trying to find the rules to describe a large part of the available data, such as
people who tend to buy both product X and Y. The problem to be solved with
these algorithms arises when trying to find the rules to describe a large part of
the available data.
Reinforcement Learning
Reinforcement learning is a learning model with the ability to learn not only how
to map an input to output but also how to map a set of inputs to outputs with
dependencies (e.g., Markov decision processes).7 Reinforcement learning exists
in the context of states in an environment and the possible actions in a given
state. During the learning process, the algorithm randomly explores the state-
action pairs within an environment (to build a state-action pair table), then in
practice exploits the rewards of the state-action pairs to select the best action for
a given state that leads to a target state.
In this context, reinforcement learning is mostly implemented by (partially)
autonomous software programs, so-called agents. These agents interact with the
environment through discrete time steps. The agent’s ultimate goal is to
maximize rewards. At a given time t, the agent receives an observation and the
maximum possible reward [10]. Action is now selected from the available set of
actions, which are then sent to the affected environment. Thus, a new state is
found and the associated reward is determined with this transition. The
reinforcement can be either positive or negative. It is the occurrence of an event
resulting from a particular behavior that increases the frequency and strength of
the behavior. In this context, an optimally acting agent must be able to consider
the long-term effects of its actions, even if the immediate reward is negative
[11]. Therefore, reinforcement learning is suitable for topics such as short- and
long-term reward trade-offs. The use of functional approximations in larger
settings and the use of examples to optimize performance are the key elements
that enhance reinforcement learning. The situations in which reinforcement
learning is used are characterized by the absence of an analytical situation, but
the environmental model is known. Thus, the simulation model for the
environment is known and the information about the environment can be
gathered by interacting with it [12]. The first two issues can be classified as
planning problems and the last one is a true learning problem.
To create intelligent programs (the agents), reinforcement learning generally
goes through the following steps:
1.
The input state is monitored by the agent.
2.
The decision function is used to make the agent perform an action.
3.
After performing the action, the agent receives a reward or reinforcement
(positive or negative) from the environment.
4.
The information (state action) about the reward is stored.
The three types of machine learning are each used in different situations and
each involves different algorithms. Selected problems and algorithms can be
found in section “Reinforcement Learning”
Neural Networks
Since neural networks are the basis of most innovations in the field of artificial
intelligence and machine learning (self-driving cars, chatbots like Siri, etc.), they
are explained in a bit more detail below.
Neural networks are a set of algorithms loosely designed after the human
brain and fundamentally designed to recognize patterns. They interpret sensory
data through a form of machine perception, labeling, or clustering of raw data.
The patterns they recognize are numerical, contained in vectors into which all
real-world data, be it images, sound, text, or time series, must be translated.
The basic building blocks of neural networks are neurons. These form the
smallest basic unit of a neural network. A neuron takes input, computes with it,
and generates an output. This is what a 2-input neuron looks like (see Fig. 2.3).
Fig. 2.3 Representation of a neuron and the underlying mathematical processes
The activation function is used to turn an unbounded input into an output that
has a predictable shape. A commonly used activation function is the sigmoid
function, see Fig. 2.4.
Fig. 2.4 Sigmoid function
The sigmoid function only returns numbers in the range of (0,1). You can
think of this as compression: (−∞,+∞) becomes (0,1) - large negative numbers
become 0 and large positive numbers become 1.
If the activation function now results in 1, the neuron is considered to be
activated and “fires”, i.e. it passes on its value. This is because a neural network
is nothing more than a set of neurons that are connected to each other. This is
what a simple neural network might look like, Fig. 2.5:
This network has two inputs, a hidden layer with two neurons (h1 and h2)
and an output layer with one neuron (o1). A hidden layer is any layer between
the input (first) layer and the output (last) layer. In most practical cases, there
will be several hundreds of hidden layers!
Crucially, the inputs to o1 are the outputs of h1 and h2 - this is precisely
what makes loose neurons a neural network.
The neural network is now used in two ways. During learning (training) or
normal use (after training has taken place), patterns of information are fed into
the network via the input layer, triggering the layers of hidden units, which in
turn reach the output units. This interconnected design is called a feedforward
network.
Each neuron receives input signals from the neurons on the left (figuratively)
and the inputs are multiplied by the weights of the connections. Each unit sums
all the inputs received in this way (for the simplest type of network) and, when
the sum exceeds a certain threshold (value of the activation function), the neuron
“fires” and triggers the following neurons (on the right).
For a neural network to learn, some kind of feedback must be involved. In a
figurative way, neural networks learn in much the same way as small children
learn by being told what they did right or wrong.
Neural networks learn in a similar way, typically through a feedback process
called backpropagation. In this process, the output of the network is compared to
the output it should produce for a correct result. The discrepancy (difference)
between the two states is used to change the weights (w) of the connections
between the units in the network, working from the output units to the hidden
units to the input units-that is, backward (hence the word backpropagation).
Over time, backpropagation causes the network to adapt (learn) and reduce the
difference between actual and intended output to the point where the two match
exactly, so that the network computes things exactly as expected.
Once the network has been trained with enough learning examples, it reaches
a point where it can be used with a completely new set of inputs. This is because
the neural network now allows the generalization of the results learned from the
learning phase and applies to new situations (data).
Fig. 2.9 Formal representation of the support and confidence of the Apriori algorithm
Fig. 2.10 Visualization of the Apriori algorithm
The advantages of this algorithm are that it is specifically designed for use in
large datasets and is, therefore, less resource-intensive and faster than similar
association analysis algorithms.9 On the other hand, there is no need to pre-
define any constraints on the algorithm, other than subjectively defining the
relevant intervals of support and confidence - which is why even trivial or
uninteresting rules are included in the result [16].
Clustering
Cluster analysis is concerned with organizing data into groups with similar
characteristics. Ideally, the data within a group are closely matched, while the
groups themselves are very different. In other words, the object distances
between the clusters are within the cluster (“inter-cluster”) but at the same time,
the distances between the clusters (“intra-cluster”) are large.
Market segmentation is one of the main applications of cluster analysis.
Rather than marketing generically to everyone, there is a consensus that it is
more beneficial to focus on specific segments, such as with targeted product
offerings. There is an entire industry devoted to market segmentation.
Segmentation has been used to find groups of similar customers to select test
markets for promotional offers, to try to understand the key attributes of the
segments, and to track the movement of customers from different segments over
time to understand the dynamics of customer behavior.
We have seen how cluster analysis can be used to refine predictive analysis
when dealing with large and complex data sets. A parallel example of this would
be that a company has thousands of products or hundreds of stores, and
strategies are to be developed to manage these products and stores. This is not to
create a hundred or even a thousand strategies, so the products and stores need to
be grouped and a manageable number of strategies developed. Where each
strategy then applies only to groups of products or stores. An unusual example of
cluster analysis was that of the US Army who wanted to reduce the number of
different unit sizes and so analyzed many measurements of body size and
derived a size system where individuals were assigned to particular size
groups/clusters.
Cluster analysis is probably the most widely used class of predictive analytic
methods with applications in a whole range of fields, such as crime pattern
analysis, medical research, education, archaeology, astronomy, or industry.
Clustering is indeed ubiquitous.
K-means clustering - is the most famous clustering algorithm. The reasons
for this are obvious: it is easy to understand and implement. The diagram below
(see Fig. 2.11) serves as an illustration. First, a set of groups or classes are
selected to be used and initialized randomly according to their corresponding
midpoints. To determine the number of classes to use, one should briefly look at
the data and then try to identify each grouping [17]. The midpoints are vectors of
equal length since all data point vectors are X-coordinates. Each data point is
categorized by calculating the distance between that point and the center of each
predefined group and then assigning the point to the closest group.
Fig. 2.11 Steps of K-means clustering
Based on these classified points, the group center can be recalculated using
the mean of all vectors. These steps are repeated for a fixed number of iterations
or until the centers show very little change between successive iterations. The
group centers can also be randomly initialized several times and then the run
with the best results is selected. K-means is very fast because it only computes
the distances between the group centers and it requires very few computational
operations. So its linear complexity is O(n) [17]. K-means also has some
disadvantages. First, one has to “randomly” choose how many groups/classes
there are. Also, this causes different runs of the algorithm to produce different
clustering results as a result. In this way, the results may not be stable and may
lead to confusion.
Mean shift clustering - is a clustering algorithm that iteratively assigns data
points to clusters by shifting points in the direction of the mode (see graphical
representation in Fig. 2.12). The mode can be understood as the highest density
of data points (in the region, in the context of the mean-shift). Therefore, it is
also referred to as the mode search algorithm. The mean-shift algorithm has
applications in the field of image processing and computer vision. Given a set of
data points, the algorithm iteratively assigns each data point to the nearest cluster
centroid. The direction to the next cluster centroid is determined by where most
of the nearby points are located. Thus, at each iteration, each data point moves
closer to where most of the points are located, which leads or will lead to the
cluster centroid. When the algorithm stops, each point is assigned to a cluster.
Fig. 2.12 Visualization of the mean-shift algorithm according to [18]
Unlike the popular K-means algorithm, mean shifting does not require prior
specification of the number of clusters. The number of clusters is determined by
the algorithm with respect to the data.
Density-based clustering - is basically based on the mean shift algorithm
but has some advantages. This type of clustering starts with any data point that is
not visited. The neighborhood of the point is extracted with a distance, all points
within the distance are thus neighborhood points. If we have determined a
sufficient number of points in the neighborhood, the clustering process begins
and the current data point is considered the first point in the new cluster. If not, it
is called noise, which may later become part of the cluster. In both cases, the
point is labeled as “visited” [19]. With the first point of the new cluster, the
points lying in it, the neighborhood is also part of this cluster. The process of
rendering all points in the neighborhood to the same cluster is repeated for each
new point added to the cluster group.
From the plot Fig. 2.13, it could be seen that there is a correlation between
the life expectancy of women and the number of doctors in the population. This
is probably true and one could say it is quite simple: provide more doctors in the
population and life expectancy increases (left side of the figure). But the reality
is that you would have to look at other factors, like the possibility that doctors in
rural areas might have less training or experience. Or maybe they do not have
access to medical facilities like trauma centers. Adding these additional factors
would result in adding an additional dependent variable to the regression
analysis, creating a model (right side of figure) for multiple regression analysis.
The output differs depending on how many variables are present - but it is
essentially the same type of output found in a simple linear regression.
Optimization
Optimization algorithms help to minimize or maximize an objective function
E(x) (also called error function). This objective function represents a
mathematical function that depends on the internal parameters of the model used
in calculating the target values (Y) from the set of predictors (X) (Evans, 2017).
Optimization algorithms generally fall into two main categories:
First-order optimization algorithms - These algorithms minimize or
maximize an objective function E(x) based on its gradient values with respect to
the parameters. The most widely used first-order optimization algorithm is
gradient descent: the first-order derivative indicates whether the function is
descending or ascending at a given point - basically, a line tangent to a point on
its surface.
Stochastic gradient descent (SGD) is the simplest optimization algorithm to
find parameters that minimize the given cost function. Apparently, the cost
function should be convex so that the gradient descent is reduced to an optimal
minimum. For demonstration purposes, imagine the graphical representation (see
Fig. 2.14) of a hypothetical cost function.
Fig. 2.14 Stochastic gradient descent (SGD) of a cost function
It is started by defining some random initial values for the parameters. The
goal of the optimization algorithm is now to find the parameter values that
correspond to the minimum value of the cost function. Specifically, the gradient
descent begins by computing derivatives for each of the parameters. These
gradients then give a numerical fit to each parameter to minimize the cost
function. This process continues until the local/global minimum is reached.
Second-order optimization algorithms - Second-order methods use the
second-order derivative to minimize or maximize the objective function. The
Hessian matrix serves as a matrix of partial second-order derivatives.10 Since the
calculation of the second derivative is costly, the second-order is not often used.
The second-order derivative tells whether the first derivative is decreasing or
increasing - indicating the curvature of the function. Although the second
derivative is costly to find and calculate, the advantage of a second-order
optimization technique is that the curvature of the surface is not neglected or
ignored.
With good data visualization, outliers can be detected to a certain extent but
with very large and multidimensional data sets, this is limited due to the volume
of data and the difficulty of visualizing these large data sets. Therefore, it is
necessary that outlier detection algorithms automatically search and find unusual
values.
The two main applications of outlier analysis reflect the two main reasons for
such analysis. The first application is in the preliminary stage of building a
model. Here, this analysis is used to identify errors in the data so that they can be
corrected or ignored, or to identify unusual values that are valid and need to be
accounted for in the model being built. Such outliers can affect the type of
algorithm used in the analysis, as some algorithms are more sensitive to outliers
than others. The second application area is the detection of outliers or anomalies,
which is a key component in, for example, fraud analysis looking for unusual
activity. This purpose is analogously applicable in a variety of areas: in
production process control, fault detection, intrusion detection, fraud detection,
system monitoring, and event detection. One of the biggest potentials definitely
comes from applying the algorithms in fraud detection on the largest possible
data sets (transaction data in the financial sector, for example). The volumes of
data here are extremely large and can even be analyzed in real-time. The result
of the outlier analysis is usually complemented by business rules that
encapsulate the business logic. Example rules would be: “If the credit card usage
is not in location A (country, region, etc.), then investigate this transaction”, or:
“If an insurance claim is repeated x times, then this must be checked manually.”
Inter-quartile range test - The inter-quartile range test (IQR), also known
as the Tukey test [21], named after its author John Tukey, is a simple yet robust
test for identifying numerical outliers. It is the computational basis behind the
construction of the so-called “box plots” (see Fig. 2.16) produced by the test to
identify outliers. In simple mathematical terms, the formula for the inter-quartile
range test is IQR = Q3 - Q1.
The IQR can also be taken as a measure of the distribution of values because
statistics first assumes that values are grouped around a central value. The IQR
indicates how distributed the “mean” values are. The value can also be used to
see if some of the values are “too far” from the central value. These points that
are too far away are called “outliers” because they are “outside” the range where
these would normally be expected. The IQR is the length of the box in the box-
and-whisker plot (box plot or box graph) [21]. An outlier is any value that is
more than one and a half times the length of the box from either end of the box.
That is if a data point is below Q1–1.5 × IQR or above Q3 + 1.5 × IQR, it is
considered too far from the central values to be appropriate.
Questions
Why is the width for the outliers one and a half times that of the box?
Why does this value show the difference between “acceptable” and
“unacceptable” values? When John Tukey invented the box-and-whisker
plot [21] to show these values in 1977, he chose 1.5 × IQR as the
demarcation line for outliers. This worked well, so this value has been
used ever since. If you look deeper into statistics, you will find that this
measure of reasonableness for bell-shaped data means that usually, only
about one percent of the data will ever be outliers.
The preference matrix (see Fig. 2.19) can be represented in terms of item
vectors. The similarity between item I1 and item I2 is calculated as cos(I1,I2).
The matrix can also be represented as user vectors (right side of the figure).
For example, in the table, one can see that in the task of searching for
unusual values or outliers identified as outlier detection, algorithms such as the
variance test or the interquartile range test can be used. If the task is to build a
predictive model where one variable is to be predicted from the data of other
variables (the prediction, model-building task), as in churn analysis or target
marketing, then examples of groups of relevant algorithms include decision
trees, neural networks, and regression models. Within these three groups, there
are many different algorithms, and the decision of which algorithm to use
depends on many data-specific reasons. Therefore, it may be advisable to test
with several algorithms and then make a detailed decision based on the results
(goodness or correctness of the results). One can also make additional
considerations, such as how easy it is to use the results of the analysis in a
business process.
Knowledge of the individual algorithms is clearly an advantage but not
mandatory to benefit from predictive analysis. As mentioned, today’s
frameworks and services (section “SQL”) make it very easy to try out and test
more than one algorithm from the set of relevant algorithms that yield the best
results. Moreover, in many cases, the algorithm does not need to be understood
in every detail in order to use it in a project. This is of course a debatable point
and any knowledge is clearly beneficial. Also, the kind of data (numeric,
categorical) one has or expects as input and output is a crucial aspect (Table 2.3).
Table 2.3 Overview of the classes of applications, variables, and algorithms
One possible analogy to describe this is although not many people know how
airplanes manage to fly, people still use them regularly. Although the analogy is
not perfect, it is intended to show that we have a certain confidence in flying
simply from the observation that the process has a high success rate. In this
discussion, it is important to remember that a major challenge in the analysis is
first identifying, obtaining, reviewing, and preparing the data for analysis.
Finding the best model is certainly important but it is not the biggest challenge.
The second major challenge is the transition from analysis to implementation
and integration of the analyses into business processes (see the process model
introduced above). So, to decide which algorithm should be used and when one
can simply apply all the appropriate algorithms to the data and choose the best
one. This is a logical and reasonable approach but it raises the question: what is
“best” and how is it measured? The answer to this question varies by algorithm
group. For example, for association analysis, the choice of algorithms falls on
Apriori and Apriori Lite - the latter being a subset as it is limited to finding
individual pre- and post-rules. The choice, therefore, has more to do with rule
requirements and performance, as Apriori Lite will be faster than generic Apriori
but again is limited in terms of rules extracted from the data. For cluster analysis,
the concept of how best to select what is harder to define. For ABC analysis, you
cannot really say that different values of A, B, or C are better or worse. It is up to
the user to choose what is best for them - there is no concept of optimal model
fit. The k-means algorithm does not necessarily provide better cluster analysis
than “Kohonen self-organizing maps” (Kohonen SOMs) and vice versa.
However, k-means is easier to understand, hence its popularity. Although
Kohonen SOMs are complex, they are more flexible in use because there is a
less enforced assignment of records to a cluster (in the sense that k-means must
have k-clusters, while Kohonen SOMs do not specify the number of clusters).
There are quality metrics for clusters but these metrics are more indicators than
definitive computations. The best approach to evaluating algorithms is to try
both k-means and Kohonen SOMs with different numbers of clusters to examine
the solutions and decide which is best for the application.
The concept of the best algorithm also varies within each subset of
classification analysis. However, there are basically two types of predictions
from the classification model: numerical or categorical. For numerical
predictions, the most common measure is the mean squared error for each data
point, also represented as the mean squared error (MSE). In estimation theory,
this value indicates how much a point scatters around the value being estimated,
and thus the MSE is a key quality criterion for estimators. In regression analysis,
the MSE is interpreted as the expected squared distance an estimator has from
the true value. There is the associated quadratic mean (root mean square or
RMS). This is the mean calculated as the square root of the quotient of the sum
of the squares of the observed numbers and their number. The RMS is used by
regression analysis, which provides numerical predictions, including statistical
measures of goodness of fit, such as R-squared, analysis of variance (ANOVA),
and the F-value. Categorical predictions are generally evaluated using what is
called confusion matrices, which are essentially designed to show how many
times each category was predicted correctly and how many times incorrectly.
Based on the matrix, there are then model quality measures, such as sensitivity
or true positive rate and specificity or true negative rate. For binary classification
models, we can plot and compare model performance in gain and lift diagrams.
For time series analysis, the same quality measures apply as for numerical
predictions in classification analysis, except that the analysis is overtime periods.
For the outlier tests, the variance test and the inter-quartile range (IQR) test are
used to look for overall outliers in the data set. The variance test is trivial but it is
affected by the outliers themselves. Therefore, the more popular IQR test is used
because it takes into account the median and quartile as a measure to identify an
outlier and thus is not influenced by the actual outliers themselves. This anomaly
detection algorithm is used to find local outliers in the data set.
From the above can be derived as rules to help in the selection of algorithms:
When looking for associations in the data and
– if multiple assignments of elements are desired, then one uses Apriori,
– only single rules are desired, then use Apriori Lite,
– if the performance of Apriori is too slow, you should switch to Apriori Lite
Sampling.
When searching for clusters or segments in the data and when
– the cluster sizes are user-defined, then ABC analysis can be used,
– the desired number of clusters is known, then k-means is used,
– the number of clusters is unknown, one uses Kohonen SOMs.
If the data is to be classified and the target variable is numeric, there is only
one independent numeric variable, and if
– a relationship is to be taken into account, bivariate linear regression is used,
otherwise, if
– a non-linear relationship is to be considered, one uses a bivariate
exponential or geometric or natural logarithmic regression, there is more
than one independent numerical variable, one uses multiple linear and non-
linear regressions for linear and non-linear models.
If one is looking for classification data and the variables are categorical or a
mixture of categorical and numeric and if
– the output of decision tree rules is desired, then one uses either C4.5,
CHAID, or CNR and decides according to the best result,
– the output of the probability of a result is preferred, then one uses logistic
regression,
– the model quality is in the foreground and the understanding of the model is
less important, then one uses neural networks and decides according to the
best result.
If time-series data is to be predicted and if the data is to be
– are constant or stationary, then one uses single exponential smoothing,
– represent a trend, then double exponential smoothing is used,
– are seasonal, then one uses triple exponential smoothing.
References
1. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial
Intelligence, vol. xxviii, p. 932. Prentice Hall, Englewood Cliffs (1995)
[zbMATH]
2. Watson, H.J., Rainer Jr., R.K., Koh, C.E.: Executive information systems: a framework for
development and a survey of current practices. MIS Q. 15, 13–30 (1991)
[Crossref]
3. Goodfellow, I., et al.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)
[zbMATH]
4. Amirian, P., Lang, T., van Loggerenberg, F.: Big Data in Healthcare: Extracting Knowledge from Point-
of-Care Machines. Springer, Cham (2017)
[Crossref]
5. Zachman, J.A.: A framework for information systems architecture. IBM Syst. J. 26(3), 276–292 (1987)
[Crossref]
6. Sowa, J.F., Zachman, J.A.: Extending and formalizing the framework for information systems
architecture. IBM Syst. J. 31(3), 590–616 (1992)
[Crossref]
7. Witten, I.H., et al.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, Cambridge (2016)
[zbMATH]
8. Gorry, G.A., Scott Morton, M.S.: A framework for management information systems. Sloan Manag.
Rev. 13, 55–70 (1971)
9. Sprague Jr., R.H.: A framework for the development of decision support systems. MIS Q. 4, 1–26
(1980)
[Crossref]
10. Robert, C., Moy, C., Wang, C.-X.: Reinforcement learning approaches and evaluation criteria for
opportunistic spectrum access. In: 2014 IEEE International Conference on Communications (ICC),
IEEE (2014)
11. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
[zbMATH]
12. Michalski, R.S., Carbonell, J.G., Mitchell, T.M.: Machine Learning: An Artificial Intelligence
Approach. Springer Science & Business Media, Berlin (2013)
[zbMATH]
13. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th
International Conference on Very Large Data Bases, VLDB (1994)
15. Bollinger, T.: Assoziationsregeln – Analyse eines Data Mining Verfahrens. Informatik-Spektrum. 19(5),
257–261 (1996)
[Crossref]
16. Decker, R.: Empirischer Vergleich alternativer Ansätze zur Verbundanalyse im Marketing.
Proceedingsband zur KSFE. 5, 99–110 (2001)
17. Dhanachandra, N., Manglem, K., Chanu, Y.J.: Image segmentation using K-means clustering algorithm
and subtractive clustering algorithm. Procedia Comput. Sci. 54, 764–771 (2015)
[Crossref]
18. Kim, N., et al.: Load profile extraction by mean-shift clustering with sample Pearson correlation
coefficient distance. Energies. 11, 2397 (2018)
[Crossref]
19.
Larcheveque, J.-M.H.D., et al.: Semantic clustering. Google Patents (2016)
20. Chatterjee, S., Hadi, A.S.: Regression Analysis by Example. Wiley, New York (2015)
[zbMATH]
21. Tukey, J.W.: Comparing individual means in the analysis of variance. Biometrics. 5(2), 99–114 (1949)
[MathSciNet][Crossref]
23. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: The Adaptive Web, pp. 325–
341. Springer, Heidelberg (2007)
[Crossref]
24. Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item collaborative filtering.
IEEE Internet Comput. 7, 76–80 (2003)
[Crossref]
25. Gunawardana, A., Meek, C.: A unified approach to building hybrid recommender systems. RecSys. 9,
117–124 (2009)
26. Liu, N.N., Zhao, M., Yang, Q.: Probabilistic latent preference analysis for collaborative filtering. In:
Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM (2009)
27. Gong, S.: A collaborative filtering recommendation algorithm based on user clustering and item
clustering. JSW. 5(7), 745–752 (2010)
[Crossref]
28. Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adap. Inter.
12(4), 331–370 (2002)
[Crossref][zbMATH]
29. Zhao, X., Zhang, W., Wang, J.: Interactive collaborative filtering. In: Proceedings of the 22nd ACM
International Conference on Information and Knowledge Management, ACM (2013)
Footnotes
1 See the execution of Alan Turing in this regard. The mathematician and computer scientist is considered
one of the most influential theorists of early computer development and computer science.
6 For more detailed information on this learning model, see [5] or [6].
8 A survey of the various developments around the Apriori algorithm can be found in [13] or [14].
10 For a comprehensive and scientifically sound presentation of second-order algorithms, see the
University of Standford Lecture Notes by Prof. Ye. Available here: https://web.stanford.edu/class/
msande311/lecture13.pdf.
11 See: [24].
12 See: [25].
13 See: [26].
© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023
F. Weber, Artificial Intelligence for Business Analytics
https://doi.org/10.1007/978-3-658-37599-7_3
3. AI and BA Platforms
Felix Weber1
(1) Chair of Business Informatics and Integrated Information Systems,
University of Duisburg-Essen, Essen, Germany
Data Warehouse
A data warehouse (DWH) resides in a database, and it does not matter which
database forms the basis from Redshift and BigQuery to MySQL and Postgres. A
data warehouse is a central repository for data collected from one or more data
sources. With the help of data warehouses, it is possible to manage data and
perform quick analyses on large data sets to uncover hidden patterns. DWHs
store current and historical data and are used to create analytics reports for data
consumers across the enterprise. Examples of reports can range from annual
financial reports to hourly revenue analysis trends.
Schema-On-Write
Schema-on-write has been the standard for the process of data storage in
relational databases and thus also for data warehouses for many years. Before
data is written to the database, the structure of this data is strictly defined and
metadata is stored and updated. Irrelevant data is discarded, data types, lengths,
and positions are described in advance. The schema, i.e. the columns, rows,
tables, and relationships, is defined in advance and during construction for the
specific purpose for which the database will serve. Then the data is put into the
predefined structure and stored. Thus, the data must all first be cleansed,
transformed, and adapted to this structure before it can be stored in a process.
This is commonly referred to as ETL (extract transform load). This is why the
paradigm here is also called “schema-on-write“since the data structure is already
defined when the data is written and stored.
Data Lake
Data lake is a relatively new term attributed to James Dixon, the CTO of Pentaho
[1]. Essentially, a data lake is a large storage location for raw or lightly
processed data in its original format. The data lake stores data in a flat structure,
usually as files. Data in the data lake is associated with a unique ID and tagged
with metadata. When a query arises, the data lake can be queried for relevant
data and this smaller data set can then be analyzed to answer the question.
Hadoop, Google Cloud Storage, Azure Storage, and the Amazon S3 platform
can be used to build data lake repositories. Data lakes do not require much
planning - typically there is no schema and no ETL process. Thanks to the
falling cost of data storage both on-premises in your own data centers and the
cloud and the abundance of virtual services, a data lake can be set up quickly.
Even before anyone knows what questions they want to ask in the future, data
from a variety of sources and in a variety of formats can be stored immediately
in the data lake. However, because data lakes contain a variety of data formats
and large amounts of data, querying is much more difficult. Traditional BI tools
usually do not support a Data Lake at all - this then requires transformation to
generate insights. This makes the enterprise data lake a playground for people
with advanced data skills (data scientists and experienced developers) but less
accessible to business users.
The paradigm for storing data in a data lake is called schema-on-read. This
describes the concept that you do not need to know in advance what you will do
with data in the future. Data of all types, sizes, shapes and structures can all be
“unthinkingly thrown” into the data lake and other Hadoop data storage systems.
While some metadata, data about that data, needs to be stored so that you still
know what is in it at the end, you do not need to know how the content will be
structured. It is quite possible that the data stored for one purpose will even be
used for a completely different purpose than originally intended. The data is
stored without first deciding what information is important, what should be used
as a unique identifier, or what part of the data needs to be summed and
aggregated to be useful. Therefore, the data is stored in its original granular
form, with nothing thrown away because it is now considered unimportant,
nothing is aggregated into a composite, and there is no key information or
dependencies (indexes). In fact, no structural information is defined at all when
the data is stored. When someone is ready to use that data, they define at that
time which parts are essential for their purpose. All that is defined in advance is
where to find the information that is important for that purpose and what parts of
the data set can be ignored. This is why this paradigm is also called “schema-on-
read“because the schema is defined at the time of reading and using the data, not
at the time of writing and storing it.
Data lakes and data warehouses are both used for storing large amounts of data
but each approach has its applications. Typically, a data warehouse is a relational
database housed on on-premise servers or the cloud, although this has been less
common due to the amount of data involved. The data stored in a data
warehouse is extracted from various online transaction processing (OLTP)
applications to support business analytics queries and data marts for specific
internal business groups, such as sales or marketing organizational units. Data
warehouses are useful when there is a large amount of data from operational
systems that need to be readily available for analysis. Because the data in a data
lake is often uncurated and can come from sources outside the company’s
operational systems, data lakes are not well suited for the average business
analytics user. Nevertheless, they have the advantage that even without a
predefined reason and process, data is kept for use cases that will be created in
the future. Table 3.1 shows an overview that can be used as a basis for decision-
making.
Table 3.1 Comparison between data warehouse and data lake
Fig. 3.1 Comparison between traditional data processing and stream processing
Batch Streaming
Scope of Query or process over all or most of Querying or processing data within a rolling time
data the data in the dataset window or only on the last data record
Data sets Large amounts of data Individual records or micro-batches consisting of a few
records
Performance Latencies in minutes to hours Requires latency on the order of seconds or
milliseconds
Analyses Complex analyses Simple response functions, aggregates, and rolling key
figures
An evaluation framework for deciding between batch and streaming can be built
on the levels of data size, data volume, performance, and analysis
Many companies are now building on hybrid models by combining the two
approaches and creating a real-time (streaming) layer and a batch layer
simultaneously or in series. Data is first processed by a streaming data platform
to provide real-time insights and is then loaded into a data store where it can be
transformed and used for a variety of batch processing use cases.
Traditional DBMSs move data from disk to main memory into a cache or
buffer pool when it is accessed. Moving data to the main memory makes re-
accessing the data more efficient but the constant need to move data can cause
performance problems. Since the data in an IMDBMS is already in memory and
does not need to be moved, application and query performance can be greatly
improved. To ensure the persistence of data in an IMDBMS, it must be
periodically moved from memory to persistent, nonvolatile storage. This is
important because data stored in memory would not survive a failure (main
memory cannot store data in the event of a power failure). There are several
ways to achieve this data persistence. One way is transaction logging, in which
periodic snapshots of the in-memory database are written to nonvolatile storage
media (hard disks). If the system fails and needs to be restarted, the database can
then be rolled back or redirected to the last completed transaction. Another way
to maintain data persistence is to create additional copies of the database on
nonvolatile media. At the hardware level, there is also the option of using non-
volatile RAM (NVRAM), such as battery RAM backed up by a battery, or
ferroelectric RAM (FeRAM), which can store data when powered off. Hybrid
IMDBMSs that store data on both hard disks and memory chips are also
conceivable and available.
In-memory database systems have a wide application but are mainly used for
real-time applications that require high performance. Use cases for IMDBMS are
applications with real-time data management requirements, such as
telecommunications, finance, defense, for decision support and optimization.
Applications that require real-time data access, including call center applications,
travel and reservation applications, and streaming applications, are also good
candidates for use with an IMDBMS.
Apache Hadoop
Hadoop describes itself as an “open-source distributed processing framework”
that enables data processing and storage for big data applications in clustered
systems. It is at the center of a growing ecosystem of big data technologies used
primarily to support advanced analytics - particularly predictive analytics, data
mining, and machine learning applications. Hadoop can handle various forms of
structured and unstructured data and provides users with more flexibility to
collect, process, and analyze data than relational databases and data warehouses.
Hadoop‘s primary focus here is on analytic applications, and its ability to
process and store different types of data makes it a particularly good choice for
big data analytics applications. Big data environments typically include not only
big data but also various types of structured transactional data to semi-structured
and unstructured forms of information, such as clickstream records, web server,
and mobile application logs, social media posts, customer emails, and sensor
data from the Internet of Things (IoT). Originally known only as Apache
Hadoop, the technology is being developed as part of an open-source project
within the Apache Software Foundation (ASF). The commercial distribution of
Hadoop is currently offered by four primary big data platform providers:
Amazon Web Services (AWS), Cloudera, Hortonworks, and MapR
Technologies. In addition, Google, Microsoft, and other vendors offer cloud-
based managed services based on Hadoop and related technologies.
Hadoop runs on clusters of standard servers, and it supports thousands of
hardware nodes for massive amounts of data. Hadoop uses a namesake
distributed file system that enables fast data access across nodes in a cluster, as
well as fault-tolerant features to allow applications to continue running when
individual nodes fail. Consequently, Hadoop became a foundational data
management platform for big data analytics applications after its emergence in
the mid-2000s.
Hadoop was developed by computer scientists Doug Cutting and Mike
Cafarella [2], initially to support processing for a Dutch open-source search
engine and associated web crawler. After Google published technical papers in
2003 and 2004 detailing the Google File System (GFS) and MapReduce
programming framework, Cutting and Cafarella modified earlier technology
plans and developed a Java-based MapReduce implementation and file system
modeled on Google. In early 2006, these elements were spun off from Nutch and
became a separate Apache subproject that Cutting named Hadoop after his son’s
stuffed elephant. At the same time, Cutting was contracted by internet service
provider Yahoo, which became Hadoop‘s first production user later in 2006.
The core components in the first iteration of Hadoop were MapReduce, the
Hadoop Distributed File System (HDFS), and Hadoop Common, as well as a set
of common tools and libraries. As its name implies, MapReduce uses the “map
and reduce” paradigm to split processing jobs into multiple tasks that execute on
the cluster nodes where the data is stored, and then combine what the tasks
produce into a cohesive set of results. MapReduce initially acted both as
Hadoop‘s processing engine and as a cluster resource manager, connecting
HDFS directly to the system and restricting users to running MapReduce batch
applications (Table 3.3).
Table 3.3 Overview of Hadoop and adjacent technologies
The Hadoop File System is just one part of the core components of the Hadoop
platform
This changed with the release of Hadoop 2.0, which became generally
available in 2013 in version 2.2.0. It introduced Apache Hadoop YARN, a new
cluster resource management, and job scheduling technology that inherited these
features from MapReduce. YARN - short for Yet Another Resource Negotiator
but typically just referred to by the acronym - ended the strict reliance on
MapReduce and opened up Hadoop to other processing engines and various
applications besides batch jobs.
Simply put, Hadoop has two main components. The first component,
Hadoop Distributed File System, helps to share the data, put it on different
nodes, replicate and manage it. The second component, MapReduce, processes
the data from each node in parallel and computes the results of the job.
With YARN, the capabilities that a Hadoop cluster can deliver have greatly
expanded - to include stream processing and real-time analytics applications that
can run alongside processing engines like Apache Spark and Apache Flink. For
example, some manufacturers are using real-time data that feeds into predictive
maintenance applications in Hadoop to detect equipment failures before they
occur (predictive maintenance). With fraud detection, website personalization,
and customer satisfaction scoring, other real-time use cases are known and can
be implemented on Hadoop.
Because Hadoop can process and store such a wide range of data, it allows
organizations to set up data lakes as sprawling repositories for inbound
information streams. A Hadoop data lake often stores raw data in such a way that
data scientists and other analysts can access the full data sets as needed. Data
lakes generally serve different purposes than traditional data warehouses, which
contain cleansed transactional data. In some cases, however, companies consider
their Hadoop data lakes to be modern data warehouses. Either way, the growing
role of big data analytics in business decisions has made effective data
governance and data security processes a priority when deploying data lakes.
Python
Python is a programming language that, unlike other programming languages
such as C, Fortran, or Java, makes it easier to solve domain problems rather than
dealing with the complexity of how a computer works. Python achieves this goal
by having the following attributes:
Python is a high-level language, which means it abstracts away the underlying
computer-related technical details. For example, Python does not make its
users think too much about managing computer memory or declaring
variables correctly and uses safe assumptions about what the programmer is
trying to convey. In addition, a high-level language can be expressed in a way
that resembles familiar prose or mathematical equations. Python is well suited
for beginners because of its ease of understanding and similarity to well-
known programming languages, such as Java.
Python is a general-purpose language, which means that it can be used for all
problems - rather than specializing in a particular area, such as statistical
analysis. For example, Python can be used to implement artificial intelligence
methods as well as statistical analysis.
Python is an interpreted language, which means that evaluating code to get
results can be done immediately, rather than having to go through a time-
consuming compiled and executed cycle.
Python has a standard library and numerous third-party libraries that provide a
variety of existing codebases and examples for problem-solving.
Python has a huge following, which means programmers can quickly find
solutions and sample code for problems using Google and Stackoverflow.
In addition, Python has a rich ecosystem for scientific inquiry in the form of
many proven and popular open-source packages, including:
numpy, a Python package for scientific computing,
matplotlib, a plotting library that produces publication quality figures,
cartography, a library of cartographic tools for Python,
netcdf4-python, a Python interface to the netCDF-C library.
R
The “R Foundation” describes R as “a language and environment for statistical
computing and graphics.” Open-source R was developed by Ross Ihaka and
Robert Gentleman at the University of Auckland in New Zealand in the 1990s as
a statistical platform for their students and has been extended over the decades
with thousands of custom libraries.
R is data analysis software: Data scientists, statisticians, and analysts - anyone
who needs to understand data can use R for statistical analysis, data
visualization, and predictive modeling.
R is a programming language: As an object-oriented language developed by
statisticians, R provides objects, operators, and functions that allow users to
explore, model, and visualize data.
R is an environment for statistical analysis: Standard statistical methods are
easy to implement in R, and since much of the cutting-edge research in
statistics and predictive modeling is done in R, newly developed techniques
are often available in R first.
R is an open-source software project: R is free and has a high standard of
quality and numerical accuracy thanks to years of testing and tinkering by
users and developers. R’s open interfaces allow integration with other
applications and systems.
R has a large community: The R project leadership has grown to more than 20
leading statisticians and computer scientists from around the world, and
thousands of contributors have created add-on packages. With two million
users, R has a vibrant online community.
R is not only used by academic users, but many large companies also use the R
programming language, including Uber, Google, Airbnb, Facebook, and so on.
The primary application of R remains statistics, visualization, and machine
learning. Also, R has a rich ecosystem for scientific research in the form of many
proven and popular open-source packages, including:
1.
sqldf allows you to perform SQL queries on R data frames. If you have
basic SQL knowledge, you can use sqldf to process data very quickly.
2.
forecast facilitates the fitting of time series models.
3.
plyr provides a handful of functions that split a data structure into groups,
apply a function to each group, and return the results in a data structure.
4.
stringr offers a number of string operators.
5.
Database drivers (e.g. RMongo, SQLite, RMySQL) allow R to access the
database you are using, saving you time and effort in copying and pasting.
6.
lubridate facilitates the work with date and time.
7.
ggplot2 makes it easy to create fancy charts.
8.
qcc is a library for statistical quality control.
9.
reshape2, as the name suggests, restructures data: It converts data from
large format to long format and vice versa.
10.
randomForest is a machine learning package that enables supervised or
unsupervised learning.
Unlike Python, however, R is not suitable for other applications and
implementations besides the static core.
SQL
SQL (Structured Query Language) is a standard database language used to
create, maintain and retrieve relational databases. SQL was created in the 1970s
and has become a very important tool in any data scientist’s toolbox because it is
crucial for accessing, updating, inserting, manipulating, and modifying data.
SQL is used in communicating with relational databases to retrieve the records
from the databases.
Easy to learn and use - Unlike other programming languages that require a
high level of conceptual understanding and knowledge of the steps required to
perform a task, SQL is known for its simplicity through the use of declarative
statements. It uses a simple language structure with English words that are
easy to understand. If you are a beginner in programming and data science,
SQL is one of the best languages to start with. The short syntax makes it
possible to query data and gain insights from it. As a data scientist, you
definitely need to learn SQL as it is easy to master and required for most
projects.
Easy to understand and visualize - SQL helps to explore the dataset
sufficiently, visualize, identify the structure and learn what the dataset actually
looks like. This helps to identify if there are any missing values, outliers,
NULL values and identify the format of the dataset for further use. By slicing,
filtering, aggregating, and sorting, SQL allows you to deal with the dataset.
Integrates with other languages - As SQL is powerful in terms of data
access, query, and manipulation, it is limited in some aspects like visualization
or use of algorithms. SQL integrates well with other scripting languages like
R and Python.
Big data management - Most of the time, data science deals with large
amounts of data stored in relational databases. Working with such data sets
requires high-level solutions to manage them differently from the usual
spreadsheets. As the volume of data sets increases, it becomes difficult to use
traditional spreadsheets, for example. The best solution for handling large
volumes of data is SQL. SQL is capable of managing such data sets.
Scala
Scala is a general-purpose, high-level, multi-paradigm programming language. It
is a pure object-oriented programming language that also supports the functional
programming approach. There is no primitive data, as everything is an object in
Scala. Scala is designed to express common programming patterns in a refined,
concise, and type-safe manner. Scala programs can be converted to bytecodes
and can run on the JVM (Java Virtual Machine). Scala is an acronym for
“scalable language”, which emphasis the focus of the initiators. It also provides
the Javascript runtimes. Scala is heavily influenced by Java and some other
programming languages. Scala offers many reasons why it is popular among
programmers:
Easy: Scala is a high-level language that is closer to other popular
programming languages like Java, C, or C++. So it becomes very easy for
anyone to learn Scala. For Java programmers, Scala is easier to learn.
Feature richness: Scala incorporates the features of several languages such as
C, C++, or Java, making the language more useful, scalable, and productive.
Integration with Java: The source code of Scala is designed in such a way
that the compiler can interpret the Java classes. Also, the associated compiler
can use the frameworks, Java libraries, and tools, etc. After compilation, Scala
programs can run on the Java Virtual Machine (JVM).
Web-based & desktop application development: For web applications, it
provides support by compiling JavaScript. Similar to desktop applications, it
can be compiled to JVM bytecode.
Used by large companies: Most of the largest technology companies use
Scala. The reason is that it is highly scalable and can be used in the backend.
Julia
Julia was created in 2009 and introduced to the public in 2012. Julia is intended
to address the shortcomings of Python and other scientific computing and data
processing languages and applications.
Julia is compiled and not interpreted. For faster runtime performance, Julia
is compiled just-in-time (JIT) using the LLVM compiler framework. Julia can
best approach or match the speed of C.
Julia is interactive. Julia contains a REPL (read-eval-print loop) or
interactive command line, similar to Python. Fast, one-time scripts and
commands can be entered and executed directly.
Julia has a simple syntax. Julia‘s syntax is similar to Python‘s.
Julia combines the advantages of dynamic typing and static typing. You
can specify types for variables, such as “unsigned 32-bit integer”. But you can
also create hierarchies of types to allow general cases for handling variables of
certain types - for example, to write a function that accepts integers without
specifying the length or sign of the integer.
Julia can call Python, C, and Fortran libraries. Julia can work directly with
external libraries written in C and Fortran. It is also possible to work with
Python code via the PyCall library and even exchange data between Python
and Julia.
Julia supports metaprogramming. Julia programs can generate other Julia
programs and even modify their code, in a way reminiscent of languages like
Lisp.
Julia has a full-featured debugger. Julia 1.1 introduced a debugging suite
that runs code in a local REPL and allows you to step through the results,
inspect variables, and add breakpoints in the code.
AI Frameworks
Open and free open source software for AI allows anyone to get on the AI
bandwagon without spending a lot of time and large resources building the
infrastructure. The term open-source software refers to a tool with a source code
that is available for free over the internet. For a company that has just launched
its first ML initiative, using open source tools can be a great way to practice data
science for free before opting for enterprise-level tools like Microsoft Azure or
Amazon Machine Learning. But the benefits of using open source tools do not
end at availability. Generally, such projects have a large community of
developers and data scientists interested in sharing datasets and pre-trained
models. For example, instead of building image recognition from scratch, one
can use classification models trained on ImageNet’s1 data or create their own
from these datasets. With open-source ML tools, one can also use transfer
learning, i.e., solve machine learning problems by applying knowledge gained
from working on a problem from a related or even distant domain. For example,
one can transfer some capacities from the model that has learned to recognize
cars to the model that aims to recognize trucks.
Depending on the task to work with, pre-trained models and open datasets
may not be as accurate as custom ones but they save a lot of effort and time and
do not require you to collect datasets yourself first. According to Andrew Ng,
former chief scientist at Baidu and professor at Stanford, the concept of reusing
open-source models and datasets will be the second biggest driver of commercial
ML success after supervised learning [3].
Among many active and less popular open-source tools, five are selected and
presented below.
Tensorflow
Originally developed by Google for internal use, TensorFlow [4] was released
under an Apache 2.0 open source license in 2015. The library continues to be
used by Google for a number of services, such as speech recognition, photo
search, and automatic replies for Gmail inboxes. Google’s reputation and the
flowchart paradigm used to create models have attracted a large number of
contributors to TensorFlow. This has led to public access with detailed
documentation and tutorials that provide an easy entry into the world of neural
network applications. TensorFlow is a Python tool for both deep neural network
exploration and complex mathematical computation, and can even support
reinforcement learning. TensorFlow’s uniqueness also lies in its dataflow graph
structures, which consist of nodes (mathematical operations) and edges
(numerical arrays or tensors).
Datasets and models - The flexibility of TensorFlow is based on the ability
to use it for both research and recurrent machine learning tasks. Thus, one can
use the low-level API called TensorFlow Core. TensorFlow Core allows you to
gain full control over the models and train them with your dataset. However,
there are also public and official pre-trained models to develop higher-level APIs
based on TensorFlow Core. Some of the most popular models include MNIST,2 a
traditional dataset that helps identify handwritten digits on an image, or
Medicare Data, a dataset from Google that is used to predict charges for medical
services, among other things.
Audience - For someone looking to use machine learning for the first time,
the variety of features in TensorFlow can be a bit overwhelming. Some even
argue that the library is not trying to speed up a machine learning curve but make
it even steeper. TensorFlow is a low-level library that requires extensive code
writing skills and a good understanding of data science specifics to successfully
work with the product. Therefore, Tensorflow may not be the first choice if the
data science team is IT-centric: there are simpler alternatives for that, which we
will discuss.
Use cases - Given the complexity of TensorFlow, use cases mostly involve
solutions from large companies with access to machine learning specialists. For
example, UK online supermarket Ocado [5] used TensorFlow to prioritize emails
arriving at its customer center and improve demand forecasting. Global
insurance company Axa [6] also uses the library to predict major claims among
its customers.
Theano
Theano is a low-level scientific computing library based on Python that is used
for deep learning tasks related to defining, optimizing, and evaluating
mathematical expressions. Although it has impressive computational power,
users complain about an inaccessible interface and unhelpful error messages. For
these reasons, Theano is mainly used in combination with more user-friendly
wrappers such as Keras, Lasagne, and Blocks - three high-level frameworks for
rapid prototyping and model testing.
Datasets and models - There are public models for Theano but any
framework used beyond that also has many tutorials and pre-trained datasets to
choose from. Keras, for example, stores available models and detailed tutorials
in its documentation.
Audience - Using Lasagna or Keras as a high-level wrapper with Theano,
you in turn have access to a variety of tutorials and pre-trained datasets. In
addition, Keras is considered one of the easiest libraries to start with in the early
stages of deep learning exploration. Since TensorFlow was designed as a
replacement for Theano, a large part of its fanbase has left. But there are still
many advantages that many data scientists find compelling enough to work with
Theano. The simplicity and maturity of Theano alone are important points to
consider when making this decision.
Use cases - Theano is considered as one industry standard for deep learning
research and development, and was originally developed for implementing state-
of-the-art deep learning algorithms. However, since people are unlikely to use
Theano directly, its many uses expand as it is used as a foundation for other
libraries: digital and image recognition, object localization, and even chatbots.
Torch
Torch is often referred to as the easiest deep learning tool for beginners. It has a
simple scripting language, Lua, and a helpful community that offers an
impressive selection of tutorials and packages for almost any deep learning
purpose. Although the underlying language, Lua, is a less common language,
Torch itself is widely used - Facebook, Google, and Twitter are known to use it
in their AI projects.
Datasets and models - A list of popular datasets loaded for use in Torch can
be found on the GitHub cheatsheet page.3 In addition, Facebook has released
official code for implementing Deep Residual Networks (ResNets) with pre-
trained models with instructions for fine-tuning your datasets.4
Target audience - Regardless of the differences and similarities, the choice
will always depend on the underlying language because the availability of
experienced Lua developers will always be smaller than that of Python.
However, Lua is significantly easier to read, which is reflected in Torch‘s simple
syntax. Torch‘s active contributors swear by Lua, making it a framework of
choice for beginners and those looking to expand their toolset.
Use cases - Facebook uses Torch to create DeepText, a tool that categorizes
minute-by-minute text posts shared on the site and provides more personalized
content targeting. Twitter was able to use Torch to recommend posts based on an
algorithmic timeline (instead of reverse chronological order).
Scikit-Learn
Scikit-learn is a framework designed for supervised and unsupervised machine
learning algorithms. As one of the components of the Python scientific
ecosystem, it builds on NumPy and SciPy libraries, each responsible for lower-
level data science tasks. While NumPy sits on top of Python and deals with
numerical computation, SciPy covers more specific numerical routines, such as
optimization and interpolation. Scikit-learn was developed specifically for
machine learning.
Datasets and models - The library already contains some standard datasets
for classification and regression. This is useful for beginners, although the
datasets are too small to represent real-world situations. However, the diabetes
dataset5 for measuring disease progression or the iris dataset for pattern
recognition are good for illustrating and learning the behavior of machine
learning algorithms in scikit. In addition, the library provides information on
loading datasets from external sources, includes example generators for tasks,
such as multiclass classification and decomposition, and provides
recommendations for using common datasets.
Audience - Although Scikit-learn is a robust library, it emphasizes usability
and documentation. Given its simplicity and numerous well-described examples,
it is an accessible tool for non-experts to quickly apply machine learning
algorithms. According to testimonials from some software houses, Scikit is well
suited for production, which is characterized by limited time and human
resources.
Use cases - Scikit-learn has been used by a variety of successful tech
companies, such as Spotify, Evernote, or Booking.com for product
recommendations and customer service.
Jupyter Notebook
The Jupyter Notebook is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations, and
text. Jupyter Notebook is maintained by the staff of the Jupyter6 project. Jupyter
Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name Jupyter comes from the main
supported programming languages that the tool supports: Julia, Python, and R.
Jupyter ships with the IPython kernel, which can be used to write programs in
Python - but there are currently over 100 other kernels that can be used.
Originally developed for data science applications in Python, R, and Julia,
Jupyter Notebook is suitable for all kinds of projects in a variety of ways:
Data visualizations - Most people’s first exposure to Jupyter Notebook is
through a data visualization, a shared notebook that involves rendering a
dataset as a graph. Jupyter Notebook allows you to perform visualizations but
also share them, and make interactive changes to the shared code and dataset.
Code sharing - Cloud services like GitHub and Pastebin offer ways to share
code but are largely non-interactive. With a Jupyter Notebook, you can view
code, execute it, and view the results directly in your web browser.
Live code interactions - Jupyter Notebook code is not static; it can be edited
and re-executed incrementally in real-time, with feedback directly in the
browser. Notebooks can also embed user controls (e.g., sliders or text entry
fields) that can be used as input points for code.
Documenting code examples - If you have a piece of code and want to
explain it line by line with live feedback all the way, you can embed it in a
Jupyter Notebook. The best part is that the code remains fully functional - you
can add interactivity along with the explanation and display and narrate at the
same time.
Code is usually not just code. Especially in the field of business analytics, code
is part of a thought process, a discussion, even an experiment. This is especially
true for data analytics but also almost any other application. With Jupyter, you
can create a “notation book” that shows the work: the code, the data, the results,
along with the explanations and reflections. Data means nothing if you cannot
turn it into insights if you cannot explore, share, and discuss it. Data analysis
means little if you cannot explore and try someone else’s results. Jupyter is a tool
for exploring, sharing, and discussing. A notebook is easy to share. One can save
the notebook and send it as an attachment so someone else can open it with
Jupyter. One can upload the notebook to a GitHub repository and have others
read it there - GitHub automatically renders the notebook to a static web page.
GitHub users can download (clone) their copy of the notebook and any
supporting files so they can extend your work. You can view the results, change
the code, and see what happens.
Amazon AWS
Amazon Web Services (AWS) is a subsidiary of Amazon that provides on-
demand cloud computing platforms for individuals, businesses, and public
institutions on a subscription basis. The technology allows users to have virtual
computer clusters that are accessible over the internet. AWS virtual computers
emulate most of the attributes of a real computer, including hardware (CPU(s) &
GPU(s), main memory, hard drive/SSD storage); a choice of operating systems;
networking; and pre-installed application software, such as web servers,
databases, CRM, etc. Each AWS system also virtualizes its console I/O
(keyboard, display, and mouse) so that AWS subscribers can connect to their
AWS system using any web browser. The browser acts as a window to the
virtual computer, allowing subscribers to log in, configure, and use their virtual
systems just as they would with a real physical computer. It also provides a
variety of services, utilities, and features that can be used independently of an
entire server.
AWS technology is deployed on server farms around the world and operated
by the Amazon subsidiary. Fees are based on a combination of usage, customer-
selected hardware/OS/software/networking features, required availability,
redundancy, security, and service options. Subscribers can pay for a single AWS
virtual machine, a dedicated physical machine, or clusters of both. As part of the
subscription agreement,8 Amazon provides security for the subscribers’ system.
AWS operates from many global geographic regions, including 6 in North
America.9
In 2017, AWS included more than 90 services covering a broad spectrum,
including compute, storage, networking, databases, analytics, application
services, provisioning, management, mobile devices, developer tools, and tools
for IoT and blockchain. Among the most popular are Amazon Elastic Compute
Cloud (EC2) and Amazon Simple Storage Service (S3). Most services are not
directly accessible to end users but provide functionality via APIs for developers
to use in their applications.
The AWS services relevant for BA and AI comprise services at different
levels of abstraction and thus each with a different target group. The portfolio
has become so extensive that only the most relevant services are presented here,
with an attempt to provide an even more comprehensive overview at the end of
the chapter.
Amazon’s machine learning services are available in two ways: predictive
analytics with Amazon ML and the SageMaker tool for data scientists.
Amazon SageMaker Build, train, and deploy custom machine learning models
Amazon Elastic Inference Acceleration of Deep Learning Inference
Amazon Forecast Increase forecast accuracy using machine learning
Amazon Lex Create voice and text chatbots
Amazon Personalize Integrates real-time recommendations into existing applications
Amazon Polly Turning text into natural language
Amazon Rekognition Analyze images and videos
Amazon SageMaker Ground Truth Create accurate ML training records
Amazon Textract Extract text and data from documents
Amazon Translate Natural sounding, fluent translations
Amazon Transcribe Automatic speech recognition
AWS Deep Learning AMIs Deep Learning on Amazon EC2
AWS Deep Learning Container Docker images for Deep Learning
AWS DeepLens Video camera enabled for Deep Learning
AWS DeepRacer Autonomous racing car steered by ML on a scale of 1:18
AWS Inferentia Machine learning inference chip
Apache MXNet in AWS Scalable, open-source deep learning framework
TensorFlow on AWS Open-source machine intelligence library
Amazon Elastic Inference Acceleration of Deep Learning Inference
Amazon Forecast Increasing forecast accuracy with the help of machine learning
PostgreSQL This service provides the same core management and patching
features described for the other open source databases, at about the same price.
Because of the availability of its add-on modules, PostgreSQL is popular with
developers who create geospatial, statistical, and machine-readable backends so
they can move existing application code into the RDS with little modification.
For example, you can use Amazon Forecast to create the following forecasts:
Demand for retail products, such as demand for products sold on a website or
in a particular store or location (for sales planning or even reordering
purposes).
Supply chain demand, including the number of raw materials, services, or
other inputs needed to produce the products.
Resource requirements, such as the number of call center agents, contract
workers, IT staff, and/or amount of energy needed to meet the demand.
Operational metrics, such as web traffic, AWS usage, or IoT sensor usage.
Key business figures, such as cash flow, revenue, profits, and expenses, by
region or service
Amazon Forecast is a fully managed service, so there are no servers to
provision and no machine learning models to create, train, or deploy. Users only
pay for what they use, and there are no minimum fees or upfront commitments.
Amazon Forecast greatly simplifies the creation of machine learning models.
In addition to providing a set of predefined algorithms, Forecast provides an
AutoML option for model training. AutoML automates complex machine
learning tasks, such as algorithm selection, hyperparameter setting, iterative
modeling, and model evaluation. Developers without machine learning
experience can import training data into one or more Amazon Forecast datasets,
train predictors, and generate forecasts using the Amazon Forecast APIs, the
AWS Command Line Interface (AWS CLI), or the Amazon Forecast console.
Amazon Forecast offers the following additional benefits over homegrown
models:
Accuracy - Amazon Forecast uses deep neural networks and traditional
statistical methods for forecasting. Given much related time series, forecasts
made using Amazon Forecast deep learning algorithms, such as DeepAR+13
and NPTS14 are typically more accurate than forecasts made using traditional
methods, such as exponential smoothing.
Ease of use - the Amazon Forecasting Console can be used to look up and
visualize forecasts for any time series with different granularities. Metrics for
the accuracy of your forecasts can also be viewed.
Amazon Personalize (Analytics Service)
Amazon Personalize15 (see Fig. 3.4 for an overview) is a machine learning
service that makes it easy for developers to use personalization in applications
and provide customized recommendations to customers. It reflects the vast
experience Amazon has in building personalization systems. For example,
Amazon Personalize can be used in a variety of scenarios, including
recommendations for users based on their preferences and behaviors,
personalized re-ranking of results, and personalized content for emails and
notifications. Amazon Personalize does not require extensive machine learning
experience. Pre-defined solution variants (a trained Amazon Personalize
recommendation model) can be created, trained, and deployed using the AWS
Console or programmatically using the AWS SDK. All the developer needs to do
is the following:
1.
format input data and upload it to an Amazon S3 bucket or send real-time
event data,
2.
select a training recipe (algorithm) to be applied to the data,
3.
train a solution variant,
4.
provision of the solution via interface,
5.
integration into existing applications.
Amazon Personalize can capture live user events to provide real-time
personalization. Amazon Personalize can combine real-time user activity data
with existing user profiles and item information to recommend the most relevant
items based on the user’s current session and activity.
Google Hub (Alpha) Find, share, and deploy AI components in the Google Cloud
Google Cloud AutoML (Beta) Easily train high-quality custom ML models
Google Cloud TPU Train and run ML models faster than ever before
Google Cloud Machine Learning Create first-class models and make them available in production
Engine
Google Cloud Talent Solution Hiring new employees with AI support
Google Dialogflow Enterprise Implement dialog-oriented communication across devices and
Edition platforms
Google Cloud Natural Language Extract data from unstructured text
Google Cloud Speech-to-Text ML-assisted conversion of speech to text
Google Cloud Text-to-Speech ML-assisted conversion of text to speech
Google Cloud Translation Dynamic translation between languages
Google Cloud Vision Extract information from images using machine learning
Google Cloud Video Intelligence Extract metadata from videos
Google Cloud Inference API (Alpha) Quickly run large correlations in typed timeline datasets
Google Firebase Predictions (Beta) Intelligently segment users based on predicted behavior
Google Cloud Deep Learning VM Preconfigured VMs for Deep Learning applications
Image
IBM Watson
IBM offers a single machine learning platform for both experienced data
scientists and those new to the industry. Technically, the system offers two
approaches: an automated and a manual implementation (with the latter intended
for proven experts). Similar to the outdated Google Prediction API or Amazon
ML, IBM’s Watson Studio has a model builder reminiscent of a fully automated
data processing and model creation interface that requires little to no training to
get started with data processing, model creation, and deployment to production.
The automated part can solve three main types of tasks: binary classification,
multiclass classification, and regression. One can either choose a fully automated
approach or manually select the ML method to use. Currently, IBM has ten
methods to cover these three groups of tasks:
Logistic regression
Decision trees (Decision tree classifier)
Random forest classification
Gradient-boosted tree classification
Naive Bayes classification
Linear regression
Regression-based on decision trees
Random forest Regression
Gradient-boosted tree regression
Isotonic regression
Separately, IBM provides a deep neural network training workflow with a
flow editor interface similar to the one used in Azure ML Studio. If you are
looking for more advanced features, IBM ML has notebooks like Jupiter to
manually program models using popular frameworks like TensorFlow, Scikit-
learn, PyTorch, and others.
Microsoft Azure
When it comes to machine learning as a service (MLaaS) platforms, Microsoft’s
Azure also seems to have a versatile toolset in the MLaaS market. It covers the
majority of ML-related tasks, offers two different products for building custom
models, and has a solid set of APIs for those who do not want to attack data
science with their bare hands. Microsoft Azure, formerly known as Windows
Azure, is Microsoft’s public cloud computing platform. It offers a range of cloud
services, including services for computing, analytics, storage, and networking. In
addition to traditional cloud offerings, such as virtual machines, object storage,
and content delivery networks (CDNs), Azure also offers services based on
proprietary Microsoft technologies. Azure also offers cloud-hosted versions of
popular Microsoft enterprise solutions, such as Active Directory and SQL
Server.
Azure Cosmos DB Globally distributed database with support for multiple data models at any
scale
Azure SQL database Managed relational SQL database as DaaS solution (database-as-a-service)
Azure Database for MySQL Managed MySQL database service for app developers
Azure Database for Managed PostgreSQL database service for app developers
PostgreSQL
Azure Database for MariaDB Managed MariaDB database service for app developers
SQL Server on virtual Hosting SQL Server enterprise applications in the cloud
computers
Azure Database Migration Easier migration of local databases to the cloud
Service
Azure Cache for Redis Top performance for applications thanks to high throughput and low-
latency data access
SQL Server Stretch Database Dynamic stretching of on-premises SQL Server databases to Azure
Microsoft Azure also offers a range of data services, some of which are
redundant
Azure Cosmos DB
Azure Cosmos DB is a cloud database that supports multiple ways to store and
process data; as such, it is classified as a multi-model database. In multi-model
databases, different database modules are natively supported and made
accessible through common APIs. Thus, one is not limited to a single data model
as is the case with dedicated graphs, key values, or document repositories, for
example.
Azure Cosmos DB grew in part out of Microsoft Research’s work to improve
data development methods for large-scale applications. This work began in 2010
as “Project Florence”18 and was commercialized by Microsoft in 2015 with the
release of Azure DocumentDB. Azure Cosmos DB, which became generally
available in May 2017, is the next generation of document-oriented databases,
essentially replacing the document-oriented NoSQL data model. Cosmos DB
provides support for key values, graphs, and geospatial data, among others. In
this regard, Azure Cosmos DB is only available as a cloud service and features
support for global data distribution, i.e., data partitioning across multiple Azure
cloud regions or zones (meaning geographically separated Microsoft data
centers). Azure Cosmos DB uses containers called “collections” to store data.
Without explicit programming, Azure Cosmos DB’s global distribution
paradigm places data closer to users’ physical locations. The database also
provides an advanced consistency tuning model that aims to work around issues
that data architects actually have to handle to achieve difficult tradeoffs between
throughput, space, and consistency in distributed systems, as described in the
oft-cited CAP theorem.19
As with document DB, Azure Cosmos DB allows data developers to work
with flexible data schemas that are easier to create and update than the more
common relational schemas.
In addition to data and ML services, Azure offers a wide range of other services
SAP Cloud Platform can also be integrated with other applications and can
be used by non-SAP customers. SAP Cloud Platform can be deployed on any of
the three major public cloud infrastructure providers: Amazon Web Services
(AWS), Microsoft Azure, and Google.
Fig. 3.6 Overview of the integration with the SAP Data Hub
SCP also structures ML service offerings into four categories. PAL is special
because it is included in the SAP HANA database platform
Build or Buy?
From a vendor perspective, the aforementioned managed ML services are
positioned for organizations that are in the process of building their data science
teams or whose teams are primarily comprised of data analysts, BI specialists, or
software engineers (who may be transitioning to data science). However, even
small to mid-sized data science teams can gain value from evaluating these
machine learning as a service (MLaaS) and data storage as a service (DSaaS)
offerings. Because these providers can produce and access so much data, they
can build and train their machine learning models in-house - these pre-built
models can quickly provide higher performance. It also makes sense to use these
MLaaS offerings as a basis for comparison with in-house models.
For example, let us say that the BA team is tasked with automating the
labeling of fashion products in the online store. This essentially means that a
product, such as a dress is used as input to automatically and accurately
determine attributes, such as sleeve length (without), neckline (scoop neck),
length (mini dress), pattern (stripes), color (green, yellow, red), etc. These
attributes are then used to either personalize the marketing campaign or improve
the search function.
Using machine learning - specifically computer vision - to handle this
product attribute classification process makes sense because of the (assumed)
large, diverse product catalog. Access to these attributes helps the company with
some important initiatives:
Personalization around content and product recommendations
Improving discoverability and search in the user experience
Forecasting/planning for inventory management
The following presents a framework particularly for evaluating various
MLaaS products. One will notice how the attributes for transparency, ease of
use, flexibility, cost, and performance span across each part of the workflow. The
evaluation framework and its questions help understand whether using an
MLaaS product is the right approach for the task and the team - from the data
that is injected into the MLaaS offering to the actual results of the analytics
applications. To prevent these systems from becoming expensive black boxes,
one should also ask how the evaluation is actually done and whether the results
can be updated and used in an automated way in further executing systems (ERP,
CRM, etc.).
The evaluation framework is based on the business analytics model for
artificial intelligence (BAM.AI) procedure model presented above and is divided
into the two main topics “Development” (section “Development Cycle”) and
“Deployment” (section “Deployment Cycle”) but extends these to include the
decision perspective of data management in section “Data Management”. This
covers all relevant aspects. For all three main topics, the points of cost,
performance (capability), transparency and traceability, usability, and flexibility
must be evaluated (see Table 3.13).
Table 3.13 Evaluation framework under BAM.AI
References
1. Dixon, J.: Pentaho, Hadoop, and Data Lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-
hadoop-and-data-lakes/ (2010). Accessed on 1 Mar 2016
2. Cafarella, M., Lorica, B., Cutting, D.: The Next 10 Years of Apache Hadoop. https://www.oreilly.com/
ideas/the-next-10-years-of-apache-hadoop (2016). Accessed on 18 Mar 2019
3. Ruder, S.: Highlights of NIPS 2016: Adversarial Learning, Meta-Learning, and More. http://ruder.io/
highlights-nips-2016/index.html#thenutsandboltsofmachinelearning (2016). Accessed on 23 May 2019
4. Pattanayak, S., John, S.: Pro Deep Learning with TensorFlow. Springer, Berlin (2017)
[Crossref]
5. Google: Ocado: Delivering Big Results by Learning from Big Data. https://cloud.google.com/
customers/ocado/ (2018). Accessed on 12 Feb 2019
6. Sato, K.: Using Machine Learning for Insurance Pricing Optimization. https://cloud.google.com/blog/
products/gcp/using-machine-learning-for-insurance-pricing-optimization (2017). Accessed on 1 Feb
2019
7. Google Inc.: Introducing Google App Engine + Our New Blog. http://googleappengine.blogspot.com/
2008/04/introducing-google-app-engine-our-new.html (2008). Accessed on 22 Jan 2018
8. Brewer, E.: Towards robust towards robust distributed systems 19th ACM symposium on principles of
distributed computing (PODC). Invited talk (2000)
9. Brewer, E.: CAP twelve years later: how the. Computer. 2, 23–29 (2012)
[Crossref]
10. Hartz, M.: What Is SAP Data Hub? And Answers to Other Frequently Asked Questions – SAP HANA
(2017)
11. Plattner, H., Leukert, B.: The in-Memory Revolution: how SAP HANA Enables Business of the Future.
Springer, Berlin (2015)
12. Prassol, P.: In-memory-platform SAP HANA als big data-Anwendungsplattform. In: Fasel, D., Meier,
A. (eds.) Big Data: Grundlagen, Systeme und Nutzungspotenziale, pp. 195–209. Springer, Wiesbaden
(2016)
[Crossref]
13.
Prassol, P.: SAP HANA als Anwendungsplattform für Real-Time Business. HMD Praxis der
Wirtschaftsinformatik. 52(3), 358–372 (2015)
[Crossref]
Footnotes
1 ImageNet is a free image database which is used for research projects. Each image is additionally
assigned to a noun. See: http://www.image-net.org
2 See: http://yann.lecun.com/exdb/mnist/.
3 See: https://github.com/torch/torch7/wiki/Cheatsheet
4 See: https://github.com/facebook/fb.resnet.torch
5 See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
6 See: https://jupyter.org
8 https://aws.amazon.com/de/agreement/
9 https://aws.amazon.com/de/about-aws/global-infrastructure/
10 See: https://aws.amazon.com/de/s3/faqs/.
11 See: https://aws.amazon.com/de/snowball/.
12 More here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
15 See: https://aws.amazon.com/de/personalize/.
18 See: https://azure.microsoft.com/de-de/blog/dear-documentdb-customers-welcome-to-azure-cosmos-
db/.
19 The theorem states that networked and distributed systems can only guarantee two of the following
three properties: consistency, availability, and partition tolerance. See also [8, 9].
20 See: https://blogs.sap.com/2012/06/01/sap-netweaver-cloud-the-road-forward/.
The level of detail of segmentation can range from classic segment marketing to
niche marketing to individual marketing. The recommended approach for
segmenting markets is often described in the literature with the model
“Segmenting, Targeting, Positioning” (STP) [33]. This approach divides the
process of market segmentation into three main steps that must be performed in
chronological order. The first step is the actual segmentation, i.e. the division of
the overall market into individual segments by using appropriate segmentation
variables. These segments optimally represent buyer groups that are as clearly
distinguishable as possible, each of which is to be addressed with a range of
services or a marketing mix that is specially tailored to them. To assess the
opportunities in each submarket, a supplier must now evaluate the attractiveness
of the segments and, based on this evaluation, determine which segments to
serve. The third step is to develop a positioning concept for each target market to
establish a sustainable, competitive position that will be signaled to the targeted
consumers. Although in some cases segmentation in service markets and retail
markets are also treated separately, there are hardly any specific segmentation
criteria or approaches for the latter two areas. Rather, researchers assume that the
concepts developed for B2C physical goods markets can also be used for
consumer services and retail [33]. Several advantages can be derived from
market segmentation. The most important advantage is that decision-makers can
target a smaller market with higher precision. In this way, resources can be used
more wisely and efficiently. In addition, market segmentation leads to closer
relationships between customers and the company. Moreover, the results of
market segmentation can be used for decision-makers to determine the
respective competitive strategies (i.e., differentiation, low-cost, or focus
strategy).
Project Structure
In light of the work outlined above, the following research questions are
proposed:
(Objective 1) Is it possible to collect enough data from the web to build
store clusters? It is not clear whether it is possible to rely on freely available
data from the web to build store clusters for the case study company. Large
market research companies have business models based on the fact that retail
companies currently rely solely on their data to make such decisions. We want
to investigate whether this dependency is justified or whether more
independence and thus potential cost savings and competitive advantage are
possible.
(Objective 2) Can the neural gas algorithm generate suitable memory
clusters that have high intraclass and low interclass similarity? As
mentioned earlier, many algorithms and methods for clustering already exist
in the literature. Our resulting clusters need to be tested against standard
cluster quality measures to assess the utility of the proposed method. The
ability to adapt to changing data inputs and be robust to noise is not enough to
justify the use of the algorithm in practice.
(Objective 3) Are the resulting store clusters suitable for segmenting
marketing activities and thus for use as a basis for marketing automation
in stationary retail? The resulting clusters must be tested not only for
mathematical-theoretical evaluation but also for practical applicability and
usefulness within current marketing practice.
Overview of the data sources used. A distinction can be made here between
primary and secondary statistical information.
The use of primary statistical information sources usually has the advantage
that this data is already available since it was originally collected for a purpose
other than pricing. On the other hand, the collection of secondary statistical
information requires separate processes to be set up or purchased as an external
service.
Directly available data comes from multiple sources that provide processed
and cleaned data that is readily available and usually identifiable by geo-
coordinates or name tags (cities or other geographic markers). The standard
source in current retail marketing practice is to obtain data from large marketing
and market research firms, such as The Nielsen Company. By default, data is
only available at the 5-digit zip code area and municipality level. The internal
ERP system already provides the locations and rudimentary information about
the operational stores. This includes name, location, type, sales data, and
assortment. The Bureau of the Census in Germany provides a collection of
information gathered from the most recent census in Germany, which took place
in 2011. Each resident is assigned to an address and thus to a grid cell with a side
length of 1 km. Members of the armed forces, police, and foreign service
working abroad and their dependents are not taken into account in this
evaluation. Foreigners are persons with foreign citizenship. In the classification
of nationality, a distinction is made between persons with German and foreign
nationality. Persons with German nationality are considered Germans, regardless
of the existence of other nationalities. The sex of each person was recorded as
‘male’ or ‘female’ in the 2011 census. No further specifications are planned, as
this corresponds to the information provided by the population registration
offices on May 9, 2011. The average age of the population (in years) is the ratio
of the sum of the age in years of the total population and the total population per
grid cell. The average household size is the ratio of the number of all persons
living in private households to the total number of private households per km2
and is based on data from the 2011 census. Second homes are also taken into
account in this context. A private household consists of at least one person. This
is based on the “concept of shared living”. All persons living together in the
same dwelling, regardless of their residential status (main/secondary residence),
are considered members of the same private household, so that there is one
private household per occupied dwelling. Persons in collective and institutional
accommodation are not included here but only persons with a self-managed
household. The vacancy rate (dwellings) is the ratio of vacant dwellings to all
occupied and vacant dwellings per grid cell, expressed as a percentage. It does
not include the following dwellings: vacation and recreational dwellings,
diplomatic/foreign armed forces dwellings, and commercial dwellings. The
calculation is made for dwellings in residential buildings (excluding hostels) and
is based on data from the last census in Germany (2011). The living area is the
floor area of the entire apartment in m2. The apartment also includes rooms
outside the actual enclosure (e.g. attics) as well as basement and floor rooms
developed for residential purposes. For the purpose of determining the usable
floor area, the rooms are counted as follows:
Full: the floor areas of rooms/parts of rooms with a clear height of at least 2
meters;
Half: the floor areas of rooms/parts of rooms with a clear height of at least 1
meter but less than 2 meters; unheatable conservatories, swimming pools, and
similar rooms closed on all sides; normally a quarter but not more than half:
the areas of balconies, loggias, roof gardens, and terraces.
Average living space per inhabitant is the ratio of the total living space of
occupied dwellings in m2 to the total number of persons in occupied dwellings
per grid cell and is based on data from the 2011 census, excluding
diplomatic/foreign armed forces housing, holiday and leisure housing, and
commercial housing. The calculation is made for dwellings in residential
buildings (excluding hostels). The average or median income in a society or
group is the income level from which the number of low-income households (or
individuals) is equal to that of higher-income households. The ratings of mostly
anonymous web users are collected via the APIs Twitter, Google Places, and
Facebook. Here, a numerical rating and a textual overview of each place are
available. It is important to note that these reviews are pre-selected and censored,
as all major platforms use automatic spam detection methods to delete reviews
that are likely to be spam. Based on the collected review data, the sentiments are
calculated and stored on an individual and aggregated level for each place. From
various event and news platforms, such as Eventbrite, Eventim, or local
newspaper announcements, information about local events with their dates,
locations, and categories are scraped and matched to the grid map.
The same process is also provided for the competitive data. This includes
local shops, restaurants, and other types of business. In this process, all the
information available on the various platforms is combined into a single data set.
The datasets are then matched against the list of stores within geographic
distance. From a number of different German real estate platforms, such as
immowelt or immobilienscout24, the average rental price and the property
quality are determined for each grid map. Here, the street names within the grid
are used to calculate an overall mean value for these two key figures.
The grid. All these data were collected and then incorporated into the core
concept of a raster-based map of Germany at the granularity level of 1 km2 (see
Fig. 4.6 for a visualization).
The image above shows the visualization of a random sample area. Each of
the fields in this grid contains the following data: grid ID, geographic location
(latitude and longitude), population, number of foreigners, gender percentage,
average age, household size, vacancy rate, living area, average living area,
average household income, average rent, and property quality. The stores,
competitors, and events are then grouped by their location into a single grid ID.
Implementation
In order to collect data from a wide variety of websites, several scraping bots
were implemented (the overall architecture is outlined in Fig. 4.7). Web services
are the de facto standard for integrating data and services. However, there are
data integration scenarios that web services cannot fully address. Some internet
databases and tools do not support web services, and existing web services do
not meet user data requirements. As a result, web data scraping, one of the oldest
web content extraction techniques, is still able to provide a valuable service for a
variety of applications ranging from simple extraction automata to online meta-
servers. Since much of the data needed in this project was not directly available,
simply because there is no API, most of the data were collected through scraping
bots (independent computer program to collect data from a specific website or a
set of similar websites). Here, a single bot is set up using the Python
programming language for each data source that automatically searches and
extracts the required data. In the next step, the data is cleaned and validated.
Then, the data is exported to the central SAP HANA in-memory database. The
same procedure is set for the data sources for which a publicly available API is
available. All bots are built to be checked regularly for possible updates and data
changes. With the native streaming server, this process could also be set up in
real-time as events (data) come into the system [51]. Due to the nature of the
relatively stable dataset used here, we decided against this idea for the first
prototype.
Based on the available data from the central database, the bot for the growing
neural gas is determined. In this phase, the newly available data is added to the
existing network and the cluster is re-evaluated.
The central database serves not only as a data repository but also for
sentiment analysis of the collected review data sets.
Growing gas neural algorithm and approach. According to [42], the GNG
algorithm is represented by the following pseudocode:
1.
One starts with two units a and b at random positions wa and wb in Rn. The
matrices must also store connection information and connection age
information. In addition, each neuron has an error counter representing the
cumulative error.
2.
One generates an input vector ξ according to P(ξ).
3.
One searches for the closest unit s1 and the second closest unit s2.
4.
Now, the age of all connections emitting s1 is increased. Also, summarize
the error counter with ∆error(s1) = ||ws1 -ξ||||2.
5.
We now calculate the displacement of s1 and its direct topological neighbors
in the direction ξ with ∆ws1 = εb(ξ - ws1); ∆wn = εn(ξ - wn), where εb εn
(0, 1) is a parameter to set the motion.
6.
The age of the connections is reset afterward if s1 and s2 were connected,
otherwise, you create a connection. All connections with age higher than the
predefined maximum age amax and also neurons that have no outgoing
connections are deleted. The age of all outgoing connections of s1 is
increased by one.
7.
Each λ-iteration locates the neuron q with the largest value of its error
counter and its topological neighbor with the highest error counter f. A new
neuron r between q and f with wr = ((wq + wf)/2) is created. Likewise, the
connections between r, q, and f and the original connection between q and f
is deleted.
The error counter of q and f is reduced by multiplication with α. And the
error counter of the new neuron is initialized with the value of f.
8.
The error count of all neurons by multiplying by δ is reduced and restarted
with step one.
Results
Regarding the first question of the objective, “Is it possible to collect enough
data from the web to form store clusters” (objective 1), a comparison with
related literature is useful. The majority of the literature on new retail store
location decisions can make a clear distinction between internal and external
(trade area) variables. Looking at the present dataset, it is clear that internal data
are disregarded. However, the external dataset is more extensive than much of
the existing literature [52, 53]. The less considered internal data can be added
from internal resources in an extension of the existing environment. This is
where the advantages of GNG algorithms come into play, as they allow smooth
adaptation to the extended vectors.
To answer the second objective question, “Can the neural gas algorithm
produce suitable clusters that have high intraclass and low interclass similarity?”
(objective 2), a mathematical test is performed. The resulting clusters are
visualized in Fig. 4.8.
Fig. 4.8 Two-dimensional representation of the data set and the resulting clusters: the cumulative and
global error, the order and size of the network, and a higher-level visualization of the resulting clusters
A total of nine different clusters out of 3761 stores seems small at first glance
but for retail marketing practice, this is quite feasible, as most changes within the
marketing mix also require physical changes in the store (change of assortment
or placement of products).
Thus, in addition to the mathematical quality specified by the GNG, usability
for the stationary retail sector is also possible.
To answer the current question, “Are the resulting store clusters suitable to
segment marketing activities and thus to be used as a basis for marketing
automation in brick-and-mortar retail?” (objective 3), we follow a guideline
developed by Kesting and Rennhak [33].
Criteria for segmentation must meet certain conditions. The literature [54,
55] generally sets six requirements for them, which aim, among other things, to
ensure a meaningful market division (see Table 4.2).
Table 4.2 Segmentation requirements
As the majority of the selected variables are related to economic (micro and
macro) factors, they are linked to the future purchasing behavior of customers
within the segments. Since most of the factors come from research institutes or
the official German census office, the requirement is met that the factors must be
measurable and recordable with existing market research methods. The segments
to be targeted are both accessibility and reachability. It is undeniable that
targeting requires more effort to be implemented. Since an empirical test is yet to
be implemented with the use of the segments in practice, the assessment of cost-
effectiveness is yet to be completed. Also, empirical testing and testing over a
longer period are subject to further investigation.
This case study presented an innovative prototype for the use of artificial
intelligence in a field that is not known for the widespread use of such
technologies [56]. The application of GNG to a core retail marketing task is
unique. In particular, the results show that the use of this algorithm can provide
useful results in a noisy and dynamic environment. Based on the resulting
clusters, a marketing automation approach is possible. Since these clusters can
be dynamically adapted to changing external and internal changes, marketing
activities can be adjusted on the fly. As a basis, we suggest starting with simple
A/B tests to change parameters within the marketing mix (price, product,
promotion, or location) and derive tailored marketing activities.
References
1. Hallowell, R.: The relationships of customer satisfaction, customer loyalty, and profitability: an
empirical study. Int. J. Serv. Ind. Manag. 7(4), 27–42 (1996)
[Crossref]
2. Homburg, C., Koschate, N., Hoyer, W.D.: Do satisfied customers really pay more? A study of the
relationship between customer satisfaction and willingness to pay. J. Mark. 69(2), 84–96 (2005)
[Crossref]
3. Francioni, B., Savelli, E., Cioppi, M.: Store satisfaction and store loyalty: the moderating role of store
atmosphere. J. Retail. Consum. Serv. 43, 333–341 (2018)
[Crossref]
4. Kumar, V., Anand, A., Song, H.: Future of retailer profitability: an organizing framework. J. Retail.
93(1), 96–119 (2017)
[Crossref]
5. Anderson, E.W.: Customer satisfaction and price tolerance. Mark. Lett. 7(3), 265–274 (1996)
[MathSciNet][Crossref]
6. Renker, C., Maiwald, F.: Vorteilsstrategien des stationären Einzelhandels im Wettbewerb mit dem
Online-Handel. In: Binckebanck, L., Elste, R. (Hrsg.) Digitalisierung im Vertrieb: Strategien zum
Einsatz neuer Technologien in Vertriebsorganisationen, S. 85–104. Springer, Wiesbaden (2016)
8. IFH: Catch me if you can – Wie der stationäre Handel seine Kunden einfangen kann. https://www.
cisco.com/c/dam/m/digital/de_emear/1260500/IFH_Kurzstudie_EH_digital_Web.pdf (2017).
Zugegriffen am 23.07.2018
10. Anders, G.: Inside Amazon’s Idea Machine: How Bezos Decodes Customers. https://www.forbes.com/
sites/georgeanders/2012/04/04/inside-amazon/#73807ee56199 (2012). Accessed on 20 May 2018
11. Constantinides, E., Romero, C.L., Boria, M.A.G.: Social media: a new frontier for retailers? Eur. Retail
Res. 22, 1–28 (2008)
12. Piotrowicz, W., Cuthbertson, R.: Introduction to the special issue information technology in retail:
toward omnichannel retailing. Int. J. Electron. Commer. 18(4), 5–16 (2014)
[Crossref]
13. Evanschitzky, H., et al.: Consumer trial, continuous use, and economic benefits of a retail service
innovation: the case of the personal shopping assistant. J. Prod. Innov. Manag. 32(3), 459–475 (2015)
[Crossref]
14. Oliver, R.L.: Effect of expectation and disconfirmation on postexposure product evaluations: an
alternative interpretation. J. Appl. Psychol. 62(4), 480 (1977)
[Crossref]
16. Simon, A., et al.: Safety and usability evaluation of a web-based insulin self-titration system for
patients with type 2 diabetes mellitus. Artif. Intell. Med. 59(1), 23–31 (2013)
[Crossref]
17. Fornell, C., et al.: The American customer satisfaction index: nature, purpose, and findings. J. Mark.
60(4), 7–18 (1996)
[Crossref]
20. Woesner, I.: Retail Omnichannel Commerce – Model Company. https://www.brainbi.dev (2016).
Accessed on 1 July 2017
21. Plattner, H., Leukert, B.: The in-Memory Revolution: how SAP HANA Enables Business of the Future.
Springer, Berlin (2015)
22. Schütte, R., Vetter, T.: Analyse des Digitalisierungspotentials von Handelsunternehmen. In: Handel 4.0,
S. 75–113. Springer, Berlin (2017)
23. Meffert, H., Burmann, C., Kirchgeorg, M.: Marketing: Grundlagen marktorientierter
Unternehmensführung. Konzepte – Instrumente – Praxisbeispiele, 12. Aufl., S. 357–768. Springer
Fachmedien, Wiesbaden (2015)
24. Daurer, S., Molitor, D., Spann, M.: Digitalisierung und Konvergenz von Online-und Offline-Welt. Z
Betriebswirtsch. 82(4), 3–23 (2012)
[Crossref]
25. Weber, F., Schütte, R.: A domain-oriented analysis of the impact of machine learning – the case of
retailing. Big Data Cogn. Comput. 3(1), 11 (2019)
[Crossref]
26. Kari, M., Weber, F., Schütte, R.: Datengetriebene Entscheidungsfindung aus strategischer und
operativer Perspektive im Handel. Springer, Berlin (2019). HMD Praxis der Wirtschaftsinformatik
[Crossref]
27. Schöler, K.: Das Marktgebiet im Einzelhandel: Determinanten, Erklärungsmodelle u.
Gestaltungsmöglichkeiten d. räumlichen Absatzes. Duncker & Humblot, Berlin (1981)
28. Schröder, H.: Handelsmarketing Methoden und Instrumente im Einzelhandel, 1. Aufl. Redline
Wirtschaft, München (2002)
29. Wedel, M., Kamakura, W.A.: Market Segmentation: Conceptual and Methodological Foundations, vol.
Bd. 8. Springer Science & Business Media, New York (2012)
30. Doyle, P., Saunders, J.: Multiproduct advertising budgeting. Mark. Sci. 9(2), 97–113 (1990)
[Crossref]
31. Smith, W.R.: Product differentiation and market segmentation as alternative marketing strategies. J.
Mark. 21(1), 3–8 (1956)
[Crossref]
32. Weinstein, A.: Market Segmentation: Using Niche Marketing to Exploit New Markets. Probus
Publishing, Chicago (1987)
33. Kesting, T., Rennhak, C.: Marktsegmentierung in der deutschen Unternehmenspraxis. Springer,
Wiesbaden (2008)
34. Huang, J.-J., Tzeng, G.-H., Ong, C.-S.: Marketing segmentation using support vector clustering. Expert
Syst. Appl. 32(2), 313–317 (2007)
[Crossref]
35. Jiang, H., Kamber, M.: Data Mining: Concept and Techniques, pp. 26–78. Morgan Kaufmann
Publishers Inc., San Francissco (2001)
36. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland
(1967)
37. Kohonen, T.: Self-Organization and Associative Memory, vol. Bd. 8. Springer Science & Business
Media, New York (2012)
[zbMATH]
38. Mitsyn, S., Ososkov, G.: The growing neural gas and clustering of large amounts of data. Opt. Mem.
Neural Netw. 20(4), 260–270 (2011)
[Crossref]
39. Cottrell, M., et al.: Batch and median neural gas. Neural Netw. 19(6–7), 762–771 (2006)
[Crossref][zbMATH]
40. Brescia, M., et al.: The detection of globular clusters in galaxies as a data mining problem. Mon. Not.
R. Astron. Soc. 421(2), 1155–1165 (2012)
[Crossref]
41. Martinetz, T., Schulten, K.: A “Neural-Gas” Network Learns Topologies. MIT Press, Cambridge (1991)
42. Fritzke, B.: A growing neural gas network learns topologies. In: Advances in Neural Information
Processing Systems (1995)
43. Chaudhary, V., Ahlawat, A.K., Bhatia, R.: Growing neural networks using soft competitive learning.
Int. J. Comput. Appl. (0975–8887). 21, 1 (2011)
44. Xinjian, Q., Guojian, C., Zheng, W.: An overview of some classical Growing Neural Networks and new
developments. In: 2010 2nd International Conference on Education Technology and Computer (2010)
45. Angora, G., et al.: Neural gas based classification of globular clusters. In: International Conference on
Data Analytics and Management in Data Intensive Domains. Springer, Berlin (2017)
46. Ghesmoune, M., Lebbah, M., Azzag, H.: A new growing neural gas for clustering data streams. Neural
Netw. 78, 36–50 (2016)
[Crossref]
47. Watson, H., Wixom, B.: The current state of business intelligence. Computer. 40, 96–99 (2007)
[Crossref]
48. Awadallah, A., Graham, D.: Hadoop and the Data Warehouse: when to Use which. Copublished by
Cloudera, Inc. and Teradata Corporation, California (2011)
49. Hartmann, M.: Preismanagement im Einzelhandel, 1. Aufl., Gabler Edition Wissenschaft (Hrsg.). Dt.
Univ.-Verl, Wiesbaden (2006)
50. Weber, F.: Preispolitik. In: Preispolitik im digitalen Zeitalter, pp. 1–12. Springer Gabler, Wiesbaden
(2020)
[Crossref]
51. Weber, F.: Streaming analytics – real-time customer satisfaction in brick-and-mortar retailing. In:
Cybernetics and Automation Control Theory Methods in Intelligent Algorithms. Springer, Cham (2019)
52. Mendes, A.B., Themido, I.H.: Multi-outlet retail site location assessment. Int. Trans. Oper. Res. 11(1),
1–18 (2004)
[Crossref][zbMATH]
53. Themido, I.H., Quintino, A., Leitão, J.: Modelling the retail sales of gasoline in a Portuguese
metropolitan area. Int. Trans. Oper. Res. 5(2), 89–102 (1998)
[Crossref]
54. Meffert, H., Burmann, C., Kirchgeorg, M.: Marketing Grundlagen marktorientierter
Unternehmensführung, Konzepte, Instrumente, Praxisbeispiele, 9. Aufl. Gabler, Wiesbaden (2000)
55. Freter, H.: Marktsegmentierung (Informationen für Marketing-Entscheidungen). DBW, Stuttgart (1983)
56. Weber, F., Schütte, R.: State-of-the-art and adoption of artificial intelligence in retailing. Digital Policy
Regul. Gov. 21(3), 264–279 (2019)
[Crossref]