IFS 306-Knowledge Management and Information Retrieval Lecture Notes-New

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

IFS 306: KNOWLEDGE MANAGEMENT AND INFORMATION RETRIEVAL

CHAPTER ONE

KNOWLEDGE MANAGEMENT

What is knowledge?

Knowledge refers to information combined with experience, context,


interpretation, and reflection.

It is a high-value form of information that is ready for application to decision and


actions within organizations. Knowledge is increasingly being viewed as a
commodity or an intellectual asset. It possesses some contradictory characteristics
that are radically different from those of other valuable commodities. In this rapid
changing business environment, the ability to manage knowledge is becoming
more crucial in today’s knowledge economy. The power of knowledge is
increasingly documented as the new strategic tool in the growing organizations.

Today, knowledge is considered as a great resource to an organization. The creation


and diffusion of knowledge have become ever more important factors in
competitiveness.

Understanding Knowledge, Information and Data


Data = Unorganized facts
Information = Data + Context
Knowledge = Information + Judgment
Knowledge
Know-how, understanding, experience, insight, intuition and contextualized
information
Information
Contextualized, categorized, calculated and condensed data
Data
Facts and figures which relay something specific, but which are not organized in
any way

Data, Information and Knowledge


1
➢ Data symbolize unorganized and unrefined facts. Typically data is stagnant in
nature.
➢ It can correspond to a set of distinct facts about events. Data is a
precondition to information.
➢ An organization from time to time has to decide on the nature and amount
of data that is necessary for generating the required information.
Information
➢ Information can be measured as a collection of data (processed data) due to
which decision making becomes easier.
➢ Information has generally got some connotation and function.
Knowledge
➢ Knowledge indicates human understanding of a subject matter that has been
attained in the course of appropriate study and familiarity.
➢ Knowledge is usually based on learning, thinking, and proper perception of
the problem area.
➢ Knowledge is not information and information is not data.
➢ Knowledge is resultant from information in the same way information is
derivative of data.
We can view it as an understanding of information based on its perceived
importance or relevance to a problem area. It can be measured as the
incorporation of individual discerning processes that assist them to draw
consequential conclusions.

Understanding Knowledge
Knowledge can be defined as the, “understanding obtained through the process of
experience or appropriate study”. Knowledge can also be an accumulation of facts,
procedural rules, or heuristics; A fact is generally a statement representing truth
about a subject matter or domain; A procedural rule is a rule that describes a
sequence of actions; A heuristic is a rule of thumb based on years of experience.
Intelligence implies the capability to acquire and apply appropriate knowledge;
memory indicates the ability to store and retrieve relevant experience according to
will; Learning represents the skill of acquiring knowledge using the method of
instruction/study.
Experience relates to the understanding that we develop through our past actions.
Knowledge can develop over time through successful experience, and experience
can lead to expertise.
Common sense refers to the natural and mostly unreflective opinions of humans.

2
Human thinking and learning provides a strong background for understanding
knowledge and expertise. Here the study of interdisciplinary study of human
intelligence is cognitive Psychology. This tries to identify the cognitive structures
and processes that closely relates to skilled performance within an area of
operation. The two major components of cognitive psychology are:
Experimental psychology: This studies the cognitive process that constitutes
human intelligence.
Artificial Intelligence (AI): This studies the cognition of computer-based intelligent
systems.
The process of eliciting and representing experts’ knowledge usually involves a
knowledge developer and some human experts (domain experts). In order to
gather the knowledge from human experts, the developer usually interviews the
experts and asks for information regarding a specific area of expertise. It is almost
impossible for humans to provide the completely accurate reports of their mental
processes. The research in the area of cognitive psychology helps to a better
understanding of what constitutes knowledge, how knowledge is elicited, and how
it should be represented in a corporate knowledge base. Hence, cognitive
psychology contributes a great deal to the area of knowledge management.

Types of Knowledge

Knowledge can be broadly classified into two types

1. Tacit knowledge

2. Explicit knowledge

Tacit Knowledge – The word tacit means understood and implied without being
stated. Tacit knowledge is unique and it can’t be expressed clearly. The cognitive
skills of an employee are a classic example of tacit knowledge. Tacit knowledge is
personal and it varies depending upon the education, attitude and perception of
the individual. This is impossible to articulate because sometimes tacit knowledge
may be even sub conscious. This tacit knowledge is also subjective in character.
This knowledge is exhibited by the individual automatically, without even realizing
it.

Explicit Knowledge – The word explicit means stated clearly and in detail without
any room for confusion. Explicit knowledge is easy to articulate and it is not
subjective. This is also not unique and it will not differ upon individuals. It is
3
impersonal. The explicit knowledge is easy to share with others. Currently,
knowledge has become a significant foundation of competitive advantage,
knowledge has been recognized as one of the most important assets and
knowledge management in any organization is to be imperative to the
organization’s success.

What is Knowledge Management (KM)?

Different authors’ views:


1. Knowledge management is a system that affords control, dissemination, and
usage of information.
2. Knowledge Management is a set of processes used to effectively use a
knowledge system to locate the knowledge required by one or more people to
perform their assigned tasks.
3. Knowledge Management is the explicit and systematic management of vital
knowledge - and its associated processes of creation, organisation, diffusion, use
and exploitation - in pursuit of business objectives. Another one states that:
Knowledge Management is the process whereby a firm manages the know-how of
its employees about its products, services, organizational systems and intellectual
property. Specifically, knowledge management embodies the strategies and
processes that a firm employs to identify, capture and leverage the knowledge
contained within its corporate memory. It is appropriated towards the basic
activities of planning and implementing tasks in a systematic and efficient manner.
Therefore, Organizations with efficient communication linkages have higher
information flow, knowledge sharing, cooperation, problem-solving, creating,
efficiency and productivity, hence a more efficient knowledge management
process. Companies built on such well-developed networks will produce
measurable business results, such as faster learning, quicker response to client
needs, better problem-solving, less rework and duplication of effort, new ideas and
more innovation. They will invariably enjoy higher sales, more profits, and superior
market value.

What is Knowledge Management Principle?

4
The dictionary definition of a principle is a 'fundamental truth or law as a basis of
reasoning or action'. Furthermore, principles have, at least, four distinct
characteristics:

• They are timeless. They will be just as relevant in 50 years’ time as they are now.
• They are changeless. Whereas knowledge will change over time, principles do not
change.
• They are universal. That is to say, they can be applied anywhere.
• They are scalable. That is, the same principles can apply to individuals, teams,
organizations, inter-organizations, and even globally. So one can say that
principles, are 'the heart of the matter', the fundamental source.

Knowledge management principles need to be embedded in the organization and


embodied in the people.

KM does not appear to possess the qualities of a discipline. If anything, KM qualifies


as an emerging field of study. Those involved in the emerging field of KM are still
vexed today by the lack of a single, comprehensive definition, an authoritative body
of knowledge, proven theories, and a generalized conceptual framework. There are
a couple of reasons for this. First, there is little consensus regarding what
knowledge actually is. Some regard knowledge as being virtually synonymous with
information, while others incorporate concepts such as experience, know-how,
know-what, understanding, and values. At the risk of generalisation, the former
approach tends to be more common in IT dominated circles while the latter is more
prevalent in business management literature. Second, KM has a wide range of
contributors from different fields and industries, which further perpetuates
different understanding of what the term actually means. Nevertheless, the
disciplines attributed to being the greatest contributors to, or users of KM are:
computer science, business, management, library and information science,
engineering; psychology, multidisciplinary science, energy and fuels, social sciences,
operation research and management science, and planning and development.

Knowledge Representation and Reasoning (KR², KR&R)

5
Knowledge Representation and reasoning (KR², KR&R) is the field of artificial
intelligence (AI) dedicated to representing information about the world in a form
that a computer system can utilize to solve complex tasks such as diagnosing a
medical condition or having a dialog in a natural language. Knowledge
representation incorporates findings from psychology about how humans solve
problems and represent knowledge in order to design formalisms that will make
complex systems easier to design and build. Knowledge representation and
reasoning also incorporates findings from logic to automate various kinds of
reasoning, such as the application of rules or the relations of sets and subsets.

Examples of knowledge representation formalisms include semantic nets,


systems architecture, frames, rules, and ontologies. Examples of automated
reasoning engines include inference engines, theorem provers, and classifiers.

History

The earliest work in computerized knowledge representation was focused on


general problem-solvers such as the General Problem Solver (GPS) system
developed by Allen Newell and Herbert A. Simon in 1959. These systems featured
data structures for planning and decomposition. The system would begin with a
goal. It would then decompose that goal into sub-goals and then set out to
construct strategies that could accomplish each subgoal.

In these early days of AI, general search algorithms such as A* were also
developed. However, the amorphous problem definitions for systems such as GPS
meant that they worked only for very constrained toy domains (e.g. the "blocks
world"). In order to tackle non-toy problems, AI researchers such as Ed Feigenbaum
and Frederick Hayes-Roth realized that it was necessary to focus systems on more
constrained problems. These efforts led to the cognitive revolution in psychology
and to the phase of AI focused on knowledge representation that resulted in expert
systems in the 1970s and 80s, production systems and frame languages. Rather
than general problem solvers, AI changed its focus to expert systems that could
match human competence on a specific task, such as medical diagnosis.

Expert systems gave us the terminology still in use today where AI systems are
divided into a knowledge base, with facts about the world and rules, and an
inference engine, which applies the rules to the knowledge base in order to
answer questions and solve problems. In these early systems the knowledge base

6
tended to be a fairly flat structure, essentially assertions about the values of
variables used by the rules. In addition to expert systems, other researchers
developed the concept of frame-based languages in the mid-1980s. A frame is
similar to an object class: It is an abstract description of a category describing things
in the world, problems, and potential solutions. Frames were originally used on
systems geared toward human interaction, e.g. understanding natural language
and the social settings in which various default expectations such as ordering food
in a restaurant narrow the search space and allow the system to choose
appropriate responses to dynamic situations.

It was not long before the frame communities and the rule-based researchers
realized that there was a synergy between their approaches. Frames were good for
representing the real world, described as classes, subclasses, slots (data values)
with various constraints on possible values. Rules were good for representing and
utilizing complex logic such as the process to make a medical diagnosis. Integrated
systems were developed that combined frames and rules. One of the most
powerful and well known was the 1983 Knowledge Engineering Environment
(KEE) from Intellicorp. KEE had a complete rule engine with forward and backward
chaining. It also had a complete frame-based knowledge base with triggers, slots
(data values), inheritance, and message passing. Although message passing
originated in the object-oriented community rather than AI it was quickly embraced
by AI researchers as well in environments such as KEE and in the operating systems
for Lisp machines from Symbolics, Xerox, and Texas Instruments.

The integration of frames, rules, and object-oriented programming was


significantly driven by commercial ventures such as KEE and Symbolics spun off
from various research projects. At the same time as this was occurring, there was
another strain of research that was less commercially focused and was driven by
mathematical logic and automated theorem proving. One of the most influential
languages in this research was the KL-ONE language of the mid-'80s. KL-ONE was a
frame language that had a rigorous semantics, formal definitions for concepts such
as an Is-A relation. KL-ONE and languages that were influenced by it such as Loom
had an automated reasoning engine that was based on formal logic rather than on
IF-THEN rules. This reasoner is called the classifier. A classifier can analyze a set of
declarations and infer new assertions, for example, redefine a class to be a subclass
or superclass of some other class that wasn't formally specified. In this way the
classifier can function as an inference engine, deducing new facts from an existing
knowledge base. The classifier can also provide consistency checking on a

7
knowledge base (which in the case of KL-ONE languages is also referred to as an
Ontology).

Another area of knowledge representation research was the problem of common


sense reasoning. One of the first realizations learned from trying to make software
that can function with human natural language was that humans regularly draw on
an extensive foundation of knowledge about the real world that we simply take for
granted but that is not at all obvious to an artificial agent.

The starting point for knowledge representation is the knowledge representation


hypothesis first formalized by Brian C. Smith in 1985: Any mechanically embodied
intelligent process will be comprised of structural ingredients that a) we as
external observers naturally take to represent a propositional account of the
knowledge that the overall process exhibits, and b) independent of such external
semantic attribution, play a formal but causal and essential role in engendering
the behavior that manifests that knowledge.

Currently, one of the most active areas of knowledge representation research are
projects associated with the Semantic Web. The Semantic Web seeks to add a layer
of semantics (meaning) on top of the current Internet. Rather than indexing web
sites and pages via keywords, the Semantic Web creates large ontologies of
concepts. Searching for a concept will be more effective than traditional text only
searches. Frame languages and automatic classification play a big part in the vision
for the future Semantic Web. The automatic classification gives developers
technology to provide order on a constantly evolving network of knowledge.
Defining ontologies that are static and incapable of evolving on the fly would be
very limiting for Internet-based systems. The classifier technology provides the
ability to deal with the dynamic environment of the Internet.

Recent projects funded primarily by the Defense Advanced Research Projects


Agency (DARPA) have integrated frame languages and classifiers with markup
languages based on XML. The Semantic Web integrates concepts from knowledge
representation and reasoning with markup languages based on XML. The Resource
Description Framework (RDF) provides the basic capability to define classes,
subclasses, and properties of objects. It provides the basic capabilities to define
knowledge-based objects on the Internet with basic features such as Is-A relations
and object properties. The Web Ontology Language (OWL) provides additional

8
levels of semantics and enables integration with classification engines. It adds
additional semantics and integrates with automatic classification reasoners.

Overview

Knowledge representation is a field of artificial intelligence that focuses on


designing computer representations that capture information about the world that
can be used for solving complex problems.

The justification for knowledge representation is that conventional procedural


code is not the best formalism to use to solve complex problems. Knowledge
representation makes complex software easier to define and maintain than
procedural code and can be used in expert systems. For example, talking to experts
in terms of business rules rather than code lessens the semantic gap between users
and developers and makes development of complex systems more practical.

Knowledge representation goes hand in hand with automated reasoning because


one of the main purposes of explicitly representing knowledge is to be able to
reason about that knowledge, to make inferences, assert new knowledge, etc.
Virtually all knowledge representation languages have a reasoning or inference
engine as part of the system.

A key trade-off in the design of a knowledge representation formalism is that


between expressivity and practicality. The ultimate knowledge representation
formalism in terms of expressive power and compactness is First Order Logic (FOL).
There is no more powerful formalism than that used by mathematicians to define
general propositions about the world. However, FOL has two drawbacks as a
knowledge representation formalism: ease of use and practicality of
implementation. First order logic can be intimidating even for many software
developers. Languages that do not have the complete formal power of FOL can still
provide close to the same expressive power with a user interface that is more
practical for the average developer to understand. The issue of practicality of
implementation is that FOL in some ways is too expressive. With FOL it is possible
to create statements (e.g. quantification over infinite sets) that would cause a
system to never terminate if it attempted to verify them.

9
Thus, a subset of FOL can be both easier to use and more practical to implement.
This was a driving motivation behind rule-based expert systems. IF-THEN rules
provide a subset of FOL but a very useful one that is also very intuitive. The history
of most of the early AI knowledge representation formalisms; from databases to
semantic nets to theorem provers and production systems can be viewed as various
design decisions on whether to emphasize expressive power or computability and
efficiency.

In a key 1993 paper on the topic, Randall Davis of MIT outlined five distinct roles to
analyze a knowledge representation framework:

• A knowledge representation (KR) is most fundamentally a surrogate, a


substitute for the thing itself, used to enable an entity to determine
consequences by thinking rather than acting, i.e., by reasoning about the
world rather than taking action in it.
• It is a set of ontological commitments, i.e., an answer to the question: In what
terms should I think about the world? What does it consists of?
• It is a fragmentary theory of intelligent reasoning, expressed in terms of
three components: (i) the representation's fundamental conception of
intelligent reasoning; (ii) the set of inferences the representation sanctions;
and (iii) the set of inferences it recommends.
• It is a medium for pragmatically (dealing with things sensibly and realistically)
efficient computation, i.e., the computational environment in which thinking
is accomplished. One contribution to this pragmatic efficiency is supplied by
the guidance a representation provides for organizing information so as to
facilitate making the recommended inferences.
• It is a medium of human expression, i.e., a language in which we say things
about the world.

Knowledge representation and reasoning are a key enabling technology for the
Semantic Web. Languages based on the Frame model with automatic classification
provide a layer of semantics on top of the existing Internet. Rather than searching
via text strings as is typical today, it will be possible to define logical queries and
find pages that map to those queries. The automated reasoning component in
these systems is an engine known as the classifier. Classifiers focus on the
subsumption relations in a knowledge base rather than rules. A classifier can infer
new classes and dynamically change the ontology as new information becomes

10
available. This capability is ideal for the ever-changing and evolving information
space of the Internet.

Characteristics of Knowledge Representation

In 1985, Ron Brachman categorized the core issues for knowledge representation
as follows:

• Primitives. What is the underlying framework used to represent knowledge?


Semantic networks were one of the first knowledge representation
primitives. Also, data structures and algorithms for general fast search. In
this area, there is a strong overlap with research in data structures and
algorithms in computer science. In early systems, the Lisp programming
language, which was modeled after the lambda calculus, was often used as
a form of functional knowledge representation. Frames and Rules were the
next kind of primitive. Frame languages had various mechanisms for
expressing and enforcing constraints on frame data. All data in frames are
stored in slots. Slots are analogous to relations in entity-relation modeling
and to object properties in object-oriented modeling. Another technique
for primitives is to define languages that are modeled after First Order Logic
(FOL). The most well known example is Prolog, but there are also many
special purpose theorem proving environments. These environments can
validate logical models and can deduce new theories from existing models.
Essentially they automate the process a logician would go through in
analyzing a model. Theorem proving technology had some specific practical
applications in the areas of software engineering. For example, it is possible
to prove that a software program rigidly adheres to a formal logical
specification.
• Meta-representation. This is also known as the issue of reflection in
computer science. It refers to the capability of a formalism to have access to
information about its own state. An example would be the meta-object
protocol in Smalltalk and CLOS that gives developers run time access to the
class objects and enables them to dynamically redefine the structure of the
knowledge base even at run time. Meta-representation means the
knowledge representation language is itself expressed in that language. For
example, in most Frame based environments all frames would be instances
of a frame class. That class object can be inspected at run time, so that the
object can understand and even change its internal structure or the structure

11
of other parts of the model. In rule-based environments, the rules were also
usually instances of rule classes. Part of the meta protocol for rules were the
meta rules that prioritized rule firing.
• Incompleteness. Traditional logic requires additional axioms and constraints
to deal with the real world as opposed to the world of mathematics. Also, it
is often useful to associate degrees of confidence with a statement. I.e., not
simply say "Socrates is Human" but rather "Socrates is Human with
confidence 50%". This was one of the early innovations from expert systems
research which migrated to some commercial tools, the ability to associate
certainty factors with rules and conclusions. Later research in this area is
known as fuzzy logic.
• Definitions and universals vs. facts and defaults. Universals are general
statements about the world such as "All humans are mortal". Facts are
specific examples of universals such as "Socrates is a human and therefore
mortal". In logical terms definitions and universals are about universal
quantification while facts and defaults are about existential quantifications.
All forms of knowledge representation must deal with this aspect and most
do so with some variant of set theory, modeling universals as sets and
subsets and definitions as elements in those sets.
• Non-monotonic reasoning. Non-monotonic reasoning allows various kinds
of hypothetical reasoning. The system associates facts asserted with the
rules and facts used to justify them and as those facts change, there are
updates on the dependent knowledge as well. In rule-based systems this
capability is known as a truth maintenance system.
• Expressive adequacy. The standard that Brachman and most AI researchers
use to measure expressive adequacy is usually First Order Logic (FOL).
Theoretical limitations mean that a full implementation of FOL is not
practical. Researchers should be clear about how expressive (how much of
full FOL expressive power) they intend their representation to be.
• Reasoning efficiency. This refers to the run time efficiency of the system. The
ability of the knowledge base to be updated and the reasoner to develop
new inferences in a reasonable period of time. In some ways, this is the flip
side of expressive adequacy. In general, the more powerful a representation,
the more it has expressive adequacy, the less efficient its automated
reasoning engine will be. Efficiency was often an issue, especially for early
applications of knowledge representation technology. They were usually

12
implemented in interpreted environments such as Lisp, which were slow
compared to more traditional platforms of the time.

Ontologies can of course be written down in a wide variety of languages and


notations (e.g., logic, LISP, etc.); the essential information is not the form of that
language but the content, i.e., the set of concepts offered as a way of thinking
about the world. Simply put, the important part is notions like connections and
components, not the choice between writing them as predicates or LISP constructs.

The commitment made selecting one or another ontology can produce a sharply
different view of the task at hand. Consider the difference that arises in selecting
the lumped element view of a circuit rather than the electrodynamic view of the
same device. As a second example, medical diagnosis viewed in terms of rules (e.g.,
MYCIN) looks substantially different from the same task viewed in terms of frames
(e.g., INTERNIST). Where MYCIN sees the medical world as made up of empirical
associations connecting symptom to disease, INTERNIST sees a set of prototypes,
in particular prototypical diseases, to be matched against the case at hand.

Knowledge Engineering

Knowledge engineering deals with knowledge acquisition, representation,


validation, inferencing, explanation, and maintenance. Knowledge engineering
involves the cooperation of human experts in codifying and making the rules (or
other procedures) that a human expert uses to solve real problems explicit. A
knowledge engineer is responsible for formally applying AI methods directly to
difficult applications normally requiring expertise. He is responsible for building
complex computer programs that can reason.

The knowledge engineering process includes five major activities:

1. Knowledge acquisition. Knowledge acquisition involves the acquisition of


knowledge from human experts, books, documents, sensors, or computer files.
Knowledge acquisition is the process of extracting, structuring, and organizing
knowledge from one or more sources, and its transfer to the knowledge base and
sometimes to the inference engine. This process has been identified by many
researchers and practitioners as a major bottleneck. Acquisition is actually done
throughout the entire development process.

13
2. Knowledge validation. The knowledge is validated and verified (for example, by
using test cases) until its quality is acceptable. Test case results are usually shown
to the expert to verify the accuracy of the Expert System.
3. Knowledge representation. The acquired knowledge is organized into a
knowledge representation. This activity involves preparation of a knowledge map
and encoding the knowledge in the knowledge base.
4. Inferencing. This activity involves the design of software to enable the computer
to make inferences based on the knowledge and the specifics of a problem. Then
the system can provide advice to a nonexpert user.
5.Explanation and justification. This involves the design and programming of an
explanation capability to answer questions like why a specific piece of information
is needed or how a certain conclusion was obtained.

The most common method for eliciting knowledge from an expert is through
interviews. The knowledge engineer interviews one or more experts and develops
a vocabulary and an understanding of the problem domain. Then, he attempts to
identify an appropriate knowledge representation and inferencing (reasoning)
approach. The interviewing may take place over several weeks or even years.

There are several automatic and semiautomatic knowledge acquisition methods,


especially ones for inducing rules directly from databases and text (for example,
Knowledge Seeker from Angoss Software Systems).

CHAPTER TWO

14
EXTRACT TRANSFORM LOAD (ETL) PROCESS

CHAPTER TWO

EXTRACT TRANSFORM LAOD (ETL) PROCESS

In computing, extract, transform, load (ETL) is a three-phase process where data is


extracted, transformed (cleaned, sanitized, scrubbed) and loaded into an output
data container. The data can be collated from one or more sources and it can also
be outputted to one or more destinations. ETL processing is typically executed
using software applications but it can also be done manually by system operators.
ETL software typically automates the entire process and can be run manually or on
reoccurring schedules either as single jobs or aggregated into a batch of jobs.

Conventional ETL diagram(Ralph, 2004)

A properly designed ETL system extracts data from source systems and enforces
data type and data validity standards and ensures it conforms structurally to the
requirements of the output. Some ETL systems can also deliver data in a
presentation-ready format so that application developers can build applications
and end users can make decisions.

15
The ETL process became a popular concept in the 1970s and is often used in data
warehousing. ETL systems commonly integrate data from multiple applications
(systems), typically developed and supported by different vendors or hosted on
separate computer hardware. The separate systems containing the original data
are frequently managed and operated by different stakeholders. For example, a
cost accounting system may combine data from payroll, sales, and purchasing.

Why is ETL important?

Organizations today have both structured and unstructured data from various
sources including:

• Customer data from online payment and customer relationship


management (CRM) systems
• Inventory and operations data from vendor systems
• Sensor data from Internet of Things (IoT) devices
• Marketing data from social media and customer feedback
• Employee data from internal human resources systems

By applying the process of extract, transform, and load (ETL), individual raw
datasets can be prepared in a format and structure that is more consumable for
analytics purposes, resulting in more meaningful insights. For example, online
retailers can analyze data from points of sale to forecast demand and manage
inventory. Marketing teams can integrate CRM data with customer feedback on
social media to study consumer behavior.

How does ETL benefit business intelligence?

Extract, transform, and load (ETL) improves business intelligence and analytics by
making the process more reliable, accurate, detailed, and efficient.

a) Historical context

16
ETL gives deep historical context to the organization’s data. An enterprise can
combine legacy data with data from new platforms and applications. You can view
older datasets alongside more recent information, which gives you a long-term
view of data.

b) Consolidated data view

ETL provides a consolidated view of data for in-depth analysis and reporting.
Managing multiple datasets demands time and coordination and can result in
inefficiencies and delays. ETL combines databases and various forms of data into a
single, unified view. The data integration process improves the data quality and
saves the time required to move, categorize, or standardize data. This makes it
easier to analyze, visualize, and make sense of large datasets.

c) Accurate data analysis

ETL gives more accurate data analysis to meet compliance and regulatory
standards. You can integrate ETL tools with data quality tools to profile, audit, and
clean data, ensuring that the data is trustworthy.

d) Task automation

ETL automates repeatable data processing tasks for efficient analysis. ETL tools
automate the data migration process, and you can set them up to integrate data
changes periodically or even at runtime. As a result, data engineers can spend more
time innovating and less time managing tedious tasks like moving and formatting
data.

How has ETL evolved?

Extract, transform, and load (ETL) originated with the emergence of relational
databases that stored data in the form of tables for analysis. Early ETL tools
attempted to convert data from transactional data formats to relational data
formats for analysis.

Traditional ETL

17
Raw data was typically stored in transactional databases that supported many read
and write requests but did not lend well to analytics. You can think of it as a row in
a spreadsheet. For example, in an ecommerce system, the transactional database
stored the purchased item, customer details, and order details in one transaction.
Over the year, it contained a long list of transactions with repeat entries for the
same customer who purchased multiple items during the year. Given the data
duplication, it became cumbersome to analyze the most popular items or purchase
trends in that year.

To overcome this issue, ETL tools automatically converted this transactional data
into relational data with interconnected tables. Analysts could use queries to
identify relationships between the tables, in addition to patterns and trends.

Modern ETL

As ETL technology evolved, both data types and data sources increased
exponentially. Cloud technology emerged to create vast databases (also called data
sinks). Such data sinks can receive data from multiple sources and have underlying
hardware resources that can scale over time. ETL tools have also become more
sophisticated and can work with modern data sinks. They can convert data from
legacy data formats to modern data formats. Examples of modern databases
follow.

Data warehouses

A data warehouse is a central repository that can store multiple databases. Within
each database, you can organize your data into tables and columns that describe
the data types in the table. The data warehouse software works across multiple
types of storage hardware—such as solid state drives (SSDs), hard drives, and other
cloud storage—to optimize your data processing.

Data lakes

With a data lake, you can store your structured and unstructured data in one
centralized repository and at any scale. You can store data as is without having to
first structure it based on questions you might have in the future. Data lakes also
allow you to run different types of analytics on your data, like SQL queries, big data
analytics, full-text search, real-time analytics, and machine learning (ML) to guide
better decisions.
18
How does ETL work?

Extract, transform, and load (ETL) works by moving data from the source system to
the destination system at periodic intervals. The ETL process works in three steps:

1. Extract the relevant data from the source database


2. Transform the data so that it is better suited for analytics
3. Load the data into the target database

What is data extraction?

In data extraction, extract, transform, and load (ETL) tools extract or copy raw data
from multiple sources and store it in a staging area. A staging area (or landing zone)
is an intermediate storage area for temporarily storing extracted data. Data staging
areas are often transient, meaning their contents are erased after data extraction
is complete. However, the staging area might also retain a data archive for
troubleshooting purposes.

How frequently the system sends data from the data source to the target data store
depends on the underlying change data capture mechanism. Data extraction
commonly happens in one of the three following ways.

a) Update notification

In update notification, the source system notifies you when a data record changes.
You can then run the extraction process for that change. Most databases and web
applications provide update mechanisms to support this data integration method.

19
b) Incremental extraction

Some data sources can't provide update notifications but can identify and extract
data that has been modified over a given time period. In this case, the system
checks for changes at periodic intervals, such as once a week, once a month, or at
the end of a campaign. You only need to extract data that has changed.

c) Full extraction

Some systems can't identify data changes or give notifications, so reloading all data
is the only option. This extraction method requires you to keep a copy of the last
extract to check which records are new. Because this approach involves high data
transfer volumes, we recommend you use it only for small tables.

What is data transformation?

In data transformation, extract, transform, and load (ETL) tools transform and
consolidate the raw data in the staging area to prepare it for the target data
warehouse. The data transformation phase can involve the following types of data
changes.

a) Basic data transformation

Basic transformations improve data quality by removing errors, emptying data


fields, or simplifying data. Examples of these transformations follow.

b) Data cleansing

Data cleansing removes errors and maps source data to the target data format. For
example, you can map empty data fields to the number 0, map the data value
“Parent” to “P,” or map “Child” to “C.”

c) Data deduplication

Deduplication in data cleansing identifies and removes duplicate records.

d) Data format revision

Format revision converts data, such as character sets, measurement units, and
date/time values, into a consistent format. For example, a food company might

20
have different recipe databases with ingredients measured in kilograms and
pounds. ETL will convert everything to pounds.

Advanced data transformation

Advanced transformations use business rules to optimize the data for easier
analysis. Examples of these transformations follow.

a) Derivation

Derivation applies business rules to your data to calculate new values from existing
values. For example, you can convert revenue to profit by subtracting expenses or
calculating the total cost of a purchase by multiplying the price of each item by the
number of items ordered.

b) Joining

In data preparation, joining links the same data from different data sources. For
example, you can find the total purchase cost of one item by adding the purchase
value from different vendors and storing only the final total in the target system.

c) Splitting

You can divide a column or data attribute into multiple columns in the target
system. For example, if the data source saves the customer name as “Jane John
Doe,” you can split it into a first, middle, and last name.

d) Summarization

Summarization improves data quality by reducing a large number of data values


into a smaller dataset. For example, customer order invoice values can have many
different small amounts. You can summarize the data by adding them up over a
given period to build a customer lifetime value (CLV) metric.

e) Encryption

You can protect sensitive data to comply with data laws or data privacy by adding
encryption before the data streams to the target database.

What is data loading?

21
In data loading, extract transform, and load (ETL) tools move the transformed data
from the staging area into the target data warehouse. For most organizations that
use ETL, the process is automated, well defined, continual, and batch driven. Two
methods for loading data follow.

a) Full load

In full load, the entire data from the source is transformed and moved to the data
warehouse. The full load usually takes place the first time you load data from a
source system into the data warehouse.

b) Incremental load

In incremental load, the ETL tool loads the delta (or difference) between target and
source systems at regular intervals. It stores the last extract date so that only
records added after this date are loaded. There are two ways to implement
incremental load.

i) Streaming incremental load

If you have small data volumes, you can stream continual changes over data
pipelines to the target data warehouse. When the speed of data increases to
millions of events per second, you can use event stream processing to monitor and
process the data streams to make more-timely decisions.

ii) Batch incremental load

If you have large data volumes, you can collect load data changes into batches
periodically. During this set period of time, no actions can happen to either the
source or target system as data is synchronized.

Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

1. Cycle initiation
2. Build reference data
3. Extract (from sources)
22
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create
aggregates or disaggregates)
6. Stage (load into staging tables, if used)
7. Audit reports (for example, on compliance with business rules. Also, in case
of failure, helps to diagnose/repair)
8. Publish (to target tables)
9. Archive

Challenges

ETL processes can involve considerable complexity, and significant operational


problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the
expectations of designers at the time validation and transformation rules are
specified. Data profiling of a source during data analysis can identify the data
conditions that must be managed by transform rules specifications, leading to an
amendment of validation rules explicitly and implicitly implemented in the ETL
process.

Data warehouses are typically assembled from a variety of data sources with
different formats and purposes. As such, ETL is a key process to bring all the data
together in a standard, homogeneous environment.

Design analysis should establish the scalability of an ETL system across the lifetime
of its usage — including understanding the volumes of data that must be processed
within service level agreements. The time available to extract from source systems
may change, which may mean the same amount of data may have to be processed
in less time. Some ETL systems have to scale to process terabytes of data to update
data warehouses with tens of terabytes of data. Increasing volumes of data may
require designs that can scale from daily batch to multiple-day micro batch to
integration with message queues or real-time change-data-capture for continuous
transformation and update.

Performance

23
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour
(or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard
drives, multiple gigabit-network connections, and much memory.

In real life, the slowest part of an ETL process usually occurs in the database load
phase. Databases may perform slowly because they have to take care of
concurrency, integrity maintenance, and indices. Thus, for better performance, it
may make sense to employ:

• Direct path extract method or bulk unload whenever is possible (instead of


querying the database) to reduce the load on source system while getting
high-speed extract
• Most of the transformation processing outside of the database
• Bulk load operations whenever possible

Still, even using bulk operations, database access is usually the bottleneck in the
ETL process. Some common methods used to increase performance are:

• Partition tables (and indices): try to keep partitions similar in size (watch for
null values that can skew the partitioning)
• Do all validation in the ETL layer before the load: disable integrity checking
(disable constraint ...) in the target database tables during the load
• Disable triggers (disable trigger ...) in the target database tables during the
load: simulate their effect as a separate step
• Generate IDs in the ETL layer (not in the database)
• Drop the indices (on a table or partition) before the load - and recreate them
after the load (SQL: drop index ...; create index ...)
• Use parallel bulk load when possible — works well when the table is
partitioned or there are no indices (Note: attempting to do parallel loads into
the same table (partition) usually causes locks — if not on the data rows,
then on indices)
• If a requirement exists to do insertions, updates, or deletions, find out which
rows should be processed in which way in the ETL layer, and then process
these three operations in the database separately; you often can do bulk
load for inserts, but updates and deletes commonly go through an API (using
SQL)

24
Whether to do certain operations in the database or outside may involve a trade-
off. For example, removing duplicates using distinct may be slow in the database;
thus, it makes sense to do it outside. On the other side, if using distinct significantly
(x100) decreases the number of rows to be extracted, then it makes sense to
remove duplications as early as possible in the database before unloading data.

A common source of problems in ETL is a big number of dependencies among ETL


jobs. For example, job "B" cannot start while job "A" is not finished. One can usually
achieve better performance by visualizing all processes on a graph, and trying to
reduce the graph making maximum use of parallelism, and making "chains" of
consecutive processing as short as possible. Again, partitioning of big tables and
their indices can really help.

Another common issue occurs when the data are spread among several databases,
and processing is done in those databases sequentially. Sometimes database
replication may be involved as a method of copying data between databases — it
can significantly slow down the whole process. The common solution is to reduce
the processing graph to only three layers:

• Sources
• Central ETL layer
• Targets

This approach allows processing to take maximum advantage of parallelism. For


example, if you need to load data into two databases, you can run the loads in
parallel (instead of loading into the first — and then replicating into the second).

Sometimes processing must take place sequentially. For example, dimensional


(reference) data are needed before one can get and validate the rows for main
"fact" tables.

Parallel processing

A recent development in ETL software is the implementation of parallel processing.


It has enabled a number of methods to improve overall performance of ETL when
dealing with large volumes of data.

ETL applications implement three main types of parallelism:

25
• Data: By splitting a single sequential file into smaller data files to provide
parallel access
• Pipeline: allowing the simultaneous running of several components on the
same data stream, e.g. looking up a value on record 1 at the same time as
adding two fields on record 2
• Component: The simultaneous running of multiple processes on different
data streams in the same job, e.g. sorting one input file while removing
duplicates on another file

All three types of parallelism usually operate combined in a single job or task.

An additional difficulty comes with making sure that the data being uploaded is
relatively consistent. Because multiple source databases may have different update
cycles (some may be updated every few minutes, while others may take days or
weeks), an ETL system may be required to hold back certain data until all sources
are synchronized. Likewise, where a warehouse may have to be reconciled to the
contents in a source system or with the general ledger, establishing synchronization
and reconciliation points becomes necessary.

Rerunnability, recoverability

Data warehousing procedures usually subdivide a big ETL process into smaller
pieces running sequentially or in parallel. To keep track of data flows, it makes
sense to tag each data row with "row_id", and tag each piece of the process with
"run_id". In case of a failure, having these IDs help to roll back and rerun the failed
piece.

Best practice also calls for checkpoints, which are states when certain phases of the
process are completed. Once at a checkpoint, it is a good idea to write everything
to disk, clean out some temporary files, log the state, etc.

Virtual ETL

Data virtualization is a form of advancement on ETL processing. The application of


data virtualization to ETL allowed solving the most common ETL tasks of data
migration and application integration for multiple dispersed data sources. Virtual
ETL operates with the abstracted representation of the objects or entities gathered
from the variety of relational, semi-structured, and unstructured data sources. ETL
tools can leverage object-oriented modeling and work with entities'
26
representations persistently stored in a centrally located hub-and-spoke
architecture. Such a collection that contains representations of the entities or
objects gathered from the data sources for ETL processing is called a metadata
repository and it can reside in memory or be made persistent. By using a persistent
metadata repository, ETL tools can transition from one-time projects to persistent
middleware, performing data harmonization and data profiling consistently and in
near-real time.

Data virtualization uses a software abstraction layer to create an integrated data


view without physically extracting, transforming, or loading the data. Organizations
use this functionality as a virtual unified data repository without the expense and
complexity of building and managing separate platforms for source and target.
While you can use data virtualization alongside extract, transform, and load (ETL),
it is increasingly seen as an alternative to ETL and other physical data integration
methods. For example, you can use AWS Glue Elastic Views to quickly create a
virtual table—a materialized view—from multiple different source data stores.

What is AWS Glue?

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes
it easier to discover, prepare, and combine data for analytics, machine learning
(ML), and application development. AWS Glue has the capabilities you need for
data integration so that you can start analyzing and using your data in minutes
instead of months.

• Business users can find and access data by using the AWS Glue Data Catalog
• Data engineers can visually create, run, and monitor ETL workflows with AWS
Glue Studio
• Data scientists can clean data without writing code using AWS Glue
DataBrew

Get started with AWS Glue by creating an AWS account.

Dealing with keys

Unique keys play an important part in all relational databases, as they tie
everything together. A unique key is a column that identifies a given entity,
whereas a foreign key is a column in another table that refers to a primary key. Keys
can comprise several columns, in which case they are composite keys. In many
27
cases, the primary key is an auto-generated integer that has no meaning for the
business entity being represented, but solely exists for the purpose of the relational
database - commonly referred to as a surrogate key.

As there is usually more than one data source getting loaded into the warehouse,
the keys are an important concern to be addressed. For example: customers might
be represented in several data sources, with their Social Security Number as the
primary key in one source, their phone number in another, and a surrogate in the
third. Yet a data warehouse may require the consolidation of all the customer
information into one dimension.

A recommended way to deal with the concern involves adding a warehouse


surrogate key, which is used as a foreign key from the fact table.

Usually, updates occur to a dimension's source data, which obviously must be


reflected in the data warehouse.

If the primary key of the source data is required for reporting, the dimension
already contains that piece of information for each row. If the source data uses a
surrogate key, the warehouse must keep track of it even though it is never used in
queries or reports; it is done by creating a lookup table that contains the warehouse
surrogate key and the originating key. This way, the dimension is not polluted with
surrogates from various source systems, while the ability to update is preserved.

The lookup table is used in different ways depending on the nature of the source
data. There three ways described as follows: (find the others)

Type 1
The dimension row is simply updated to match the current state of the
source system; the warehouse does not capture history; the lookup table is
used to identify the dimension row to update or overwrite
Type 2
A new dimension row is added with the new state of the source system; a
new surrogate key is assigned; source key is no longer unique in the lookup
table

Fully logged

28
A new dimension row is added with the new state of the source system, while the
previous dimension row is updated to reflect it is no longer active and time of
deactivation.

Tools

An established ETL framework may improve connectivity and scalability. A good ETL
tool must be able to communicate with the many different relational databases
and read the various file formats used throughout an organization. ETL tools have
started to migrate into Enterprise Application Integration, or even Enterprise
Service Bus, systems that now cover much more than just the extraction,
transformation, and loading of data. Many ETL vendors now have data profiling,
data quality, and metadata capabilities. A common use case for ETL tools include
converting CSV files to formats readable by relational databases. A typical
translation of millions of records is facilitated by ETL tools that enable users to input
csv-like data feeds/files and import them into a database with as little code as
possible.

ETL tools are typically used by a broad range of professionals — from students in
computer science looking to quickly import large data sets to database architects
in charge of company account management, ETL tools have become a convenient
tool that can be relied on to get maximum performance. ETL tools in most cases
contain a GUI that helps users conveniently transform data, using a visual data
mapper, as opposed to writing large programs to parse files and modify data types.

While ETL tools have traditionally been for developers and IT staff, research firm
Gartner wrote that the new trend is to provide these capabilities to business users
so they can themselves create connections and data integrations when needed,
rather than going to the IT staff. Gartner refers to these non-technical users as
Citizen Integrators.

ETL vs. ELT

What is ELT?

29
Extract, load, and transform (ELT) is an extension of extract, transform, and load
(ETL) that reverses the order of operations. You can load data directly into the
target system before processing it. The intermediate staging area is not required
because the target data warehouse has data mapping capabilities within it. ELT has
become more popular with the adoption of cloud infrastructure, which gives target
databases the processing power they need for transformations.

ETL compared to ELT

ELT works well for high-volume, unstructured datasets that require frequent
loading. It is also ideal for big data because the planning for analytics can be done
after data extraction and storage. It leaves the bulk of transformations for the
analytics stage and focuses on loading minimally processed raw data into the data
warehouse.

The ETL process requires more definition at the beginning. Analytics needs to be
involved from the start to define target data types, structures, and relationships.
Data scientists mainly use ETL to load legacy databases into the warehouse, and
ELT has become the norm today.

Extract, load, transform (ELT) is a variant of ETL where the extracted data is loaded
into the target system first. The architecture for the analytics pipeline shall also
consider where to cleanse and enrich data as well as how to conform dimensions.

Ralph Kimball and Joe Caserta's book The Data Warehouse ETL Toolkit, (Wiley,
2004), which is used as a textbook for courses teaching ETL processes in data
warehousing, addressed this issue.

Cloud-based data warehouses like Amazon Redshift, Google BigQuery, and


Snowflake Computing have been able to provide highly scalable computing power.
This lets businesses forgo preload transformations and replicate raw data into their
data warehouses, where it can transform them as needed using SQL.

After having used ELT, data may be processed further and stored in a data mart.

There are pros and cons to each approach. Most data integration tools skew
towards ETL, while ELT is popular in database and data warehouse appliances.

30
Similarly, it is possible to perform TEL (Transform, Extract, Load) where data is first
transformed on a blockchain (as a way of recording changes to data, e.g., token
burning) before extracting and loading into another data store.

CHAPTER THREE

AN OVERVIEW OF DATA MINING

Data mining has become a valuable tool for businesses looking to get ahead in the
information economy. In this article, we’ll look at what data mining is, discuss
applications of data mining and cover some data mining techniques. We’ll also walk
you step-by-step through the data mining process.

What is data mining?

Data mining is the process of extracting patterns and other useful information from
large data sets. It’s sometimes known as knowledge discovery in data or KDD.
Thanks to the rise of big data and advancements in data warehousing technologies,
the use of data mining techniques has grown in recent decades, turning raw data
into valuable knowledge that companies can use.

Although technology has advanced to handle substantial datasets, executives still


face automation and scalability challenges.

Data mining has improved corporate decision-making through clever data


analytics. Data mining techniques can be broadly classified into two categories:

• Defining the target dataset


• Forecasting outcomes using machine learning methods

These tactics are used to organise and filter data – providing the most important
information, from fraud detection to user behaviours, bottlenecks, and even
security breaches.

Getting into the realm of data mining has never been easier, and collecting
meaningful insights has never been faster – especially when combining data mining
with data analytics and visualisation tools like Apache Spark. Artificial intelligence

31
advancements are accelerating the adoption of data mining techniques across
industries.

What is data mining used for?

The following are some of the applications of data mining:

• To achieve a corporate goal


• To answer business or research questions
• To contribute to problem-solving
• To aid in the accurate prediction of outcomes
• To analyse and predict trends and anomalies
• To inform forecasts
• To identify gaps and mistakes in processes, such as supply chain bottlenecks
or incorrect data entry

What are the benefits of data mining?

The benefits of data mining are many and varied. We live and operate in a data-
driven society, so gaining as many insights as possible is critical. In this complex
information age, data mining gives us the tools to solve challenges and issues. The
following are some of the benefits of data mining:

• It assists businesses in gathering reliable data


• It assists organisations in making well-informed decisions
• It is a time- and cost-effective solution when compared to other data
applications
• It enables organisations to make cost-effective production and operational
changes
• It aids in the detection of credit issues and fraud
• It enables data scientists to quickly evaluate massive amounts of data. Data
scientists can then use the data to spot fraud, create risk models, and
improve product safety
• It enables data scientists to create behaviour and trend forecasts and
uncover hidden patterns.

Examples of data mining

The following are common applications of the data mining process:

32
Retail

Retailers analyse purchase patterns to establish product categories and determine


where they should be placed in aisles and on shelves. Data mining can also be used
to determine which deals are most popular with customers or to boost sales in the
checkout line.

Marketing

Data mining is being used to sift through ever-larger databases and improve market
segmentation. It is possible to predict consumer behaviour by analysing the
associations between criteria such as customer age, gender, tastes, and more, in
order to design tailored loyalty schemes.

In marketing, data mining predicts which consumers are most likely to unsubscribe
from a product, what they typically search for online, and what should be included
in a mailing list to increase response rates. It can play a valuable role in any digital
marketing strategy.

Media

Certain networks use real-time data mining to gauge their online television (IPTV)
and radio viewership. These systems capture and analyse anonymous data from
channel views, broadcasts, and programmes on the fly.

Data mining enables networks to provide personalised recommendations to radio


and television listeners and viewers, as well as providing real-time data on
customer interests and behaviour. Networks also acquire vital information for their
marketers, who can use this information to better target their future customers.

Medicine

Data mining allows for more precise diagnosis. It is possible to provide more
effective therapies when all the patient’s information is available – such as medical
records, physical examinations, and treatment patterns. It also allows for more
effective, efficient, and cost-effective administration of health resources by
detecting risks, predicting illnesses in specific segments of the population, and
forecasting hospital admission length.

33
Data mining in medicine also has the benefit of detecting anomalies, as well as
developing better relationships with patients through a better understanding of
their requirements.

Banking

Data mining is used by banks to better understand market risk. It is often used to
analyse transactions, purchasing trends, and client financial data. Data mining also
enables banks to gain a better understanding of our online tastes and behaviours
in order to improve the return on their marketing initiatives, analyse financial
product effectiveness, and ensure regulatory compliance.

The data mining process

The CRISP-DM (Cross-Industry Standard Process for Data Mining) is the most widely
used data-mining framework. The CRISP-DM procedure is divided into six stages:
Business Understanding, Data Understanding, Data Preparation, Modelling,
Evaluation and Deployment.

These phases are tackled in sequential order as the process is iterative, which
means that any models and understanding developed during the process are
designed to be enhanced by subsequent knowledge gathered throughout the
process.

1. Business Understanding

The first stage of CRISP-DM is to obtain a thorough understanding of the business


and to determine the organisation’s specific needs or goals. Understanding a
business means determining the issues the company wishes to address – for
example, a company may want to boost response rates for various marketing
efforts.

One of the first responsibilities in the Business Understanding phase is to dig down
to a more specific definition of the problem. The query could be narrowed to
determine which client subsets are most likely to make repeat purchases, or how
much they are willing to spend.

2. Data Understanding

34
Following the definition of the organisation’s goals, data scientists start discovering
what exists in the current data. A corporation may have information about a client’s
(or potential client’s) name, address, and other contact information. They may also
have records of previous purchases.

There may be information about client interests or family makeup, depending on


the source of the data. All of this data may be very useful in future campaigns.

3. Data Preparation

Once we have a firm grasp on what data exists and what data does not, the data is
prepared and processed in a way that makes it valuable. The data preparation
procedure is lengthy and accounts for roughly 80% of the project’s time.

The creation of a data dictionary is the first step in the data preparation process.
The data is separated into chunks, then the elements of metadata are described in
a way that makes it human-readable to ensure that it is understandable, especially
to someone who isn’t a data scientist.

Data analytics is the next part of the data preparation process and involves finding
and developing new data points that may be calculated from existing inputs.
Helpful profiles can be created using business analytics, which can subsequently be
used for predictive modelling and to develop well-targeted marketing campaigns.

4. Modelling

The information gathered during data preparation is then used to develop various
behavioural models. For example – in the case of marketing campaigns, modelling
involves the creation of “training data” representative of the ideal customer.

These consumer profiles are then used as models for scaling campaign success
through modelling. Modelling often involves the use of artificial intelligence.

5. Evaluation

It’s critical to provide clear visual reporting as information is processed, to really


understand results on a cognitive level. Graphical presentation techniques are
becoming increasingly important for not just comprehending but also recognising
trends.

35
By itself, a stream of data may not appear to be significant, but when displayed on
a graph, trends can be quickly discerned. There are a variety of useful tools that will
rapidly generate visual reports, such as bar charts and scatter plots.

6. Deployment

CRISP-DM is iterative by definition. Each stage not only informs the next one but
also the one before it. New information is applied to previous phases as it is
learned, and the models are informed and re-informed by each step of the process.

New data points emerge when the data is prepared; these improve when more
models are developed and assessed. The results of “final” deployments can be
transformed into new models for testing and assessment in the future.

Different data mining techniques

When moving through the six CRISP-DM stages, data scientists rely on a variety of
techniques. These include:

Tracking patterns

Learning to discover patterns in your data sets is one of the most basic data mining
techniques. This is frequently an identification of a periodic anomaly in the data or
the ebb and flow of a particular variable through time.

For example, you may find that sales of a particular product increase immediately
before the holidays, or that warmer weather sends more visitors to your website.

Forecasting

Prediction is one of the most important data mining methods, as it’s used to
forecast the types of data you’ll see in the future. In many circumstances, simply
noticing and understanding previous patterns is sufficient to provide a reasonable
prediction of what will occur in the future. For example, you may look at a
consumer’s credit history and previous transactions to see if they’re a credit risk in
the future.

36
Association

Association determines links between different variables. In this situation, you’ll


look for certain events that are linked with one another; for example, you might
discover that when your consumers buy one thing, they frequently buy another,
related item. This is commonly used to populate “people also bought” sections on
online stores.

Classification

Classification is an advanced data mining technique that requires you to group


together diverse attributes into discernible groups, which you can then use to make
additional conclusions or perform a specific job.

You might be able to designate individual consumers as “low,” “medium,” or “high”


credit risks based on data about their financial backgrounds and buying history.
These classifications might then be used to learn even more about those clients.

Clustering

Clustering is similar to classification, in that it involves putting together groups of


data based on their commonalities. For example, you may group different
demographics of your audience into distinct categories based on their discretionary
income or how frequently they purchase at your store.

Outlier detection

In many circumstances, simply finding the overall pattern will not provide you with
a complete picture of your data. You must also be able to spot anomalies,
sometimes known as outliers, in your data.

If, for example, your customers are nearly all male but there’s a significant rise in
female customers during one week in July, you’ll want to research the spike and
figure out what caused it so that you can either reproduce it or better understand
your audience.

Regression

37
Regression is a type of planning and modelling that is used to determine the
probability of a particular variable, given the presence of other variables. You may
use it, for example, to forecast a price based on other criteria such as availability,
consumer demand, and competition. The main goal of regression is to help you
figure out the relationship between several variables in a data set.

How to get started with data mining

Data mining and data science are best learned by doing, so start studying data as
soon as possible. However, you’ll also need to study the theory to develop a solid
statistical and machine learning foundation to understand what you’re doing and
glean valuable insights from the noise of data.

• Learn R and Python. These are the most popular languages for data mining.
• Take a course. A course will go more in-depth on the points summarised
here. FutureLearn offers courses on data mining, such as Data Mining with
Weka.
• Learn data mining software suites such as KNIME, SAS and MATLAB.
• Participate in data mining competitions, such as Bitgrit and Kaggle.
• Interact with other data scientists via groups and social networks. Browse
the Reddit data mining thread, and attend conferences such as the IDCM.

Notes on Text and Web Mining and Digital Libraries follow

38

You might also like