DATA MINING TERMS &
CONCEPTS
DBMS
• Database System is used in traditional way of storing and
retrieving data.
• The major task of database system is to perform query
processing.
• These systems are generally referred as online
transaction processing system.
• These systems are used day to day operations of and
organization.
Data Warehouse
• Data Warehouse is the place where huge amount of data
is stored.
• It is meant for users or knowledge workers in the role of
data analysis and decision making.
• These systems are referred as online analytical
processing.
DBMS and Data Warehouse Difference
DBMS and Data Warehouse Difference
OLTP and OLAP
• OLTP
Transaction Oriented applications
Mainly concern with Entry, Storage and retrieval of data.
Design to day-to-day operations such as purchasing,
inventory, payroll, accounting etc.
It supports basically DML operations.
Users of OLTP
Almost all industries including:
Airlines
Supermarkets
Banking
Insurance
Etc.
• Data usually captured in OLTP are stored in
commercial relational databases. e.g;
• Database of supermarket store consists of the
following table to store the data about its
transactions, product, inventory, employee etc.
• Transactions
• ProductName
• EmployeeDetails
• InventorySupplies
• Suppliers
Advantages of OLTP
• Simplicity
• Efficiency
• Allow user to read, write and delete data quickly
• Fast query processing
• Respond user actions immediately and also support transaction
processing in demand.
Challenges
• Security
• It require concurrency control(locking) and
• recovery mechanism.
• OLTP system data content not suitable for decision
making
• A typical OLTP system manages the current data within the
enterprises/organization. These data are too far away from the
decision making.
Answer
The supermarket store is deciding on introducing a new
product. The key debating issue are: “which product should
they introduce?” and “should it be specific to a few
customer segments?”
The Supermarket store is looking at offering some discount
on their year of sale. The question here: “How much
discount should they offer ” and “ should different discount
to be given to different customer segment?”
Answer: OLAP
• OLAP differ from traditional DB in way the
data is conceptualized and stored.
• OLAP data are held in the dimensional
form rather than the relational form.
• OLAP life’s blood is multidimensional data
model.
• The multidimensional data model views
the data in the form of data cube.
Distributed Data Store (Distributed
Database)
• A distributed data store is a computer network where
information is stored on more than one node, often in a
replicated fashion It is usually specifically used to refer to
a distributed database where users store information on a
number of nodes.
Multidimensional Schema
• Multidimensional Schema is especially designed to model
data warehouse systems.
• The schemas are designed to address the unique needs
of very large databases designed for the analytical
purpose (OLAP).
• Two main types of schemas used are:
• Star Schema
• Snowflake Schema
Star Schema
• Star Schema in data warehouse, in which the center of
the star can have one fact table and a number of
associated dimension tables.
• It is known as star schema as its structure resembles a
star.
• The Star Schema data model is the simplest type of Data
Warehouse schema.
Star schema
Star schema Example
Characteristics of star schema
• Every dimension in a star schema is represented with the
only one-dimension table.
• The dimension table should contain the set of attributes.
• The dimension table is joined to the fact table using a
foreign key
• The dimension table are not joined to each other
Snowflake Schema
Snowflake Schema
Characteristics of Snowflake Schema
• It uses smaller disk space.
• Easier to implement a dimension as is added to the
Schema.
• Due to multiple tables query performance is reduced
Difference
Difference
ETL
• ETL is a process in Data Warehousing and it stands for
Extract, Transform and Load.
• It is a process in which an ETL tool extracts the data from
various data source systems, transforms it in the staging
area, and then finally, loads it into the Data Warehouse
system.
ETL
Extraction
• The first step of the ETL process is extraction.
• In this step, data from various source systems is extracted
which can be in various formats like relational databases,
No SQL, XML, and flat files into the staging area.
• It is important to extract the data from various source
systems and store it into the staging area first and not
directly into the data warehouse because the extracted
data is in various formats and can be corrupted also.
Transformation
• In this step, a set of rules or functions are applied on the
extracted data to convert it into a single standard format. It may
involve following processes/tasks:
• Filtering – loading only certain attributes into the data warehouse.
• Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States, and America into USA, etc.
• Joining – joining multiple attributes into one.
• Splitting – splitting a single attribute into multiple attributes.
• Sorting – sorting tuples on the basis of some attribute (generally key-
attribute).
Loading
• In this step, the transformed data is finally loaded into the
data warehouse.
• Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after
longer but regular intervals.
• The rate and period of loading solely depends on the
requirements and varies from system to system.
Pipelining
Data mining
• Data mining has been defined as the non-trivial extraction
of implicit, previously unknown, and potentially useful
information from large data sets or databases.
Knowledge Discovery
• Knowledge discovery is the process of finding novel,
interesting, and useful patterns in data.
• Data mining is a subset of knowledge discovery. Thus,
data mining is also known as Knowledge Discovery in
Databases
Information Retrieval
• Automatic retrieval of all relevant documents while at the
same time retrieving as few of the non-relevant as
possible.
• It has the primary goals of indexing text and searching for
useful documents in a collection.
Triplet
• Data is an expression of feedback; a statement (rightly or
wrongly so) about an observation.
• Information is contextualized data.
• Knowledge is a phenomenon that implies our ability to
use the information for reasoning and decision making,
i.e., it is the basis of what you can, will, would, should or
might do with information.
Information Extraction
• Information Extraction has the goal of transforming a
collection of documents, usually with the help of an IR
system, into information that is more readily digested and
analyzed.
Knowledge Representation
• Knowledge representation is the presentation of
knowledge to the user for visualization in terms of trees,
tables, rules graphs, charts, matrices, etc.
Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from
a set of low-level concepts to higher-level, more general
concepts.
• Depending on the type of the ordering relation we
distinguish several types of concept hierarchies.
Set Group Hierarchy
• Concept hierarchies may also be defined by discretizing
or grouping values for a given dimension or attribute,
resulting in a set-grouping hierarchy.
Schema Hierarchy
• A concept hierarchy that is a total or partial order among
attributes in a database schema is called a schema
hierarchy.
Different user view point
• There may be more than one concept hierarchy for a
given attribute or dimension, based on different user
viewpoints.
• For instance, a user may prefer to organize price by
defining ranges for inexpensive, moderately_priced, and
expensive.
Schema hierarchy
• Relating concept generality.
• The ordering reflects the generality of the attribute values,
e.g. street < city < state < country.
Set-grouping hierarchy
• The ordering relation is the subset relation (⊆). Applies to
set values.
• Example:
• {13, ..., 39} = young; {13, ..., 19} = teenage;
• {13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young
Operation-derived hierarchy
• Produced by applying an operation (encoding, decoding,
information extraction).
hierarcy user−name < department < university <
education
Rule-based hierarchy
• Using rules to define the partial order.
• for example: if antecedent then consequent defines the
order antecedent < consequent.