0% found this document useful (0 votes)
11 views

Module 1 Notes

Uploaded by

ravindra MO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module 1 Notes

Uploaded by

ravindra MO
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

SRINIVAS UNIVERSITY

INSTITUTE OF ENGINEERING AND


TECHNOLOGY
MUKKA, MANGALURU

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

NOTES

DATA WAREHOUSE AND DATA MINING


SUBJECT CODE: 22SCS651

COMPILED BY:
Mrs. RAMEESA K, Assistant Professor

2024-25

1
UNIT - I

Introduction to Data Warehouse:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of


data in support of management's decision-making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source
A and source B may have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying aproduct.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse
can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.

Data Warehouse Design Process:


A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
o The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that
must be solved are clear and well understood.
o The bottom-up approach starts with experiments and prototypes. This is useful in the
early stage of business modelling and technology development. It allows an organization
to move forward at considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
o In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.

2
The warehouse design process consists of the following steps:

 Choose a business process to model, for example, orders, invoices, shipments,


inventory, account administration, sales, or the general ledger. If the business process
is organizational and involves multiple complex object collections, a data warehouse
model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
 Choose the grain of the business process. The grain is the fundamental, atomic level of
data to be represented in the fact table for this process, for example, individual
transactions, individual daily snapshots, and so on.
 Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
 Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.

Differences between operational database systems and data warehouse

3
Characteristics of Data Warehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.

Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.

Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from
3 months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept.

Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

4
A Three Tier Data Warehouse Architecture:

Tier-1:

The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse. The data are extracted using application
program interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server. Examples of gateways
include ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for
Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a
metadata repository, which stores information about the data warehouse and its contents.

5
Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.
 OLAP model is an extended relational DBMS that maps operations on
multidimensional data to standard relational operations.
 A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Multidimensional Model
A multidimensional model views data in the form of a data-cube. A data cube enables data to
be modelled and viewed in multiple dimensions. It is defined by dimensions and facts. The
dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for
the dimension time, item, and location. These dimensions allow the save to keep track of
things, for example, monthly sales of items and the locations at which the items were sold. Each
dimension has a table related to it, called a dimensional table, which describes the dimension
further. For example, a dimensional table for an item may contain the attributes item name,
brand, and type. A multidimensional data model is organized around a central theme, for
example, sales. This theme is represented by a fact table. Facts are numerical measures. The
fact table contains the names of the facts or measures of the related dimensional tables.

Data Warehouse Models:


There are three data warehouse models.
1. Enterprise warehouse:

 An enterprise warehouse collects all of the information about subjects spanning


the entire organization.
 It provides corporate-wide data integration, usually from one or more operational

6
systems or external information providers, and is cross-functional in scope.
 It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
 An enterprise data warehouse may be implemented on traditional mainframes,
computer super servers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.
2. Data mart:
 A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a
marketing data mart may confine its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized.
 Data marts are usually implemented on low-cost departmental servers that are
UNIX/LINUX- or Windows-based. The implementation cycle of a data mart is
more likely to be measured in weeks rather than months or years. However, it may
involve complex integration in the long run if its design and planning were not
enterprise-wide.
 Depending on the source of data, data marts can be categorized as independent more
dependent. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts are
source directly from enterprise data warehouses.
3. Virtual warehouse:

 A virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
 A virtual warehouse is easy to build but requires excess capacity on operational
database servers.

Meta Data Repository:

Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted data,
the source of the extracted data, and missing fields that have been added by data cleaning or
integration processes.

7
A metadata repository should contain the following:

o A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data
mart locations and contents.
o Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
o The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
o The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
o Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and scheduling
of refresh, update, and replication cycles.
o Business metadata, which include business terms and definitions, data ownership
information, and charging policies

Schema Design:
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases The
entity- relationship data model is commonly used in the design of relational databases, where
a database schema consists of a set of entities and the relationships between them. Such a data
model is appropriate for on- line transaction processing. A data warehouse, however, requires
a concise, subject-oriented schema that facilitates on-line data analysis. The most popular data
model for a data warehouse is a multidimensional model. Such a model can exist in the form of
a star schema, a snowflake schema, or a fact constellation schema. Let’s look at each of these
schema types. Star schema: The most common modeling paradigm is the star schema, in which
the data warehouse contains (1) a large central table (fact table) containing

8
the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension
tables), one for each dimension.
Star schema:
A star schema for All Electronics sales is shown in Figure. Sales are considered along four
dimensions, namely, time, item, branch, and location. The schema contains a central fact table
for sales that contains keys to each of the four dimensions, along with two measures: dollars
sold and units sold. To minimize the size of the fact table, dimension identifiers (such as time
key and item key) are system-generated identifiers. Notice that in the star schema, each
dimension is represented by only one table, and each table contains a set of attributes. For
example, the location dimension table contains the attribute set {location key, street, city,
province or state, country}. This constraint may introduce some redundancy.

For example, “Vancouver” and “Victoria” are both cities in the Canadian province of British
Columbia. Entries for such cities in the location dimension table will create redundancy among
the attributes province or state and country, that is, (..., Vancouver, British Columbia, Canada)
and (..., Victoria, British Columbia, Canada). Moreover, the attributes within a dimension table
may form either a hierarchy (total order) or a lattice (partial order).

9
Snowflake schema.:
A snowflake schema for All Electronics sales is given in Figure Here, the sales fact table is
identical to that of the star schema in Figure. The main difference between the two schemas is
in the definition of dimension tables.

The single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables. For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where supplier key is linked
to the supplier dimension table, containing supplier key and supplier type information.
Similarly, the single dimension table for location in the star schema can be normalized into two
new tables: location and city.

Notice that further normalization can be performed on province or state and country in the
snowflake schema.

Fact constellation.
A fact constellation schema is shown in Figure. This schema specifies two fact tables, sales
and shipping. The sales table definition is identical to that of the star schema . The shipping
table has five dimensions, or keys: item key, time key, shipper key, from location, and to
location, and two measures: dollars cost and units shipped.

10
A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location are shared between both the sales
and shipping fact tables. In data warehousing, there is a distinction between a data warehouse
and a data mart.

A data warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data
warehouses, the fact constellation schema is commonly used, since it can model multiple,
interrelated subjects. A data mart, on the other hand, is a department subset of the data
warehouse that focuses on selected subjects, and thus its scope is department wide. For data
marts, the star or snowflake schema are commonly used, since both are geared toward modeling
single subjects, although the star schema is more popular and efficient.

Measures: Their Categorization and Computation:


“How are measures computed?” To answer this question, we first study how measures can be
categorized.1 Note that a multidimensional point in the data cube space can be defined by a set
of dimension-value pairs, for example, h time = “Q1”, location = “Vancouver”, item =
“computer” . A data cube measure is a numerical function that can be evaluated at each point
in the data cube space. A measure value is computed for a given point by aggregating the data
corresponding to the respective dimension-value pairs defining the given point.

11
Measures can be organized into three categories (i.e., distributive, algebraic, holistic), based
on the kind of aggregate functions used.

Distributive: An aggregate function is distributive if it can be computed in a distributed manner


as follows. Suppose the data are partitioned into n sets. We apply the function to each partition,
resulting in n aggregate values. If the result derived by applying the function to the n aggregate
values is the same as that derived by applying the function to the entire data set (without
partitioning), the function can be computed in a distributed manner. For example, count() can
be computed for a data cube by first partitioning the cube into a set of sub cubes, computing
count() for each sub cube, and then summing up the counts obtained for each sub cube. Hence,
count() is a distributive aggregate function. For the same reason, sum(), min(), and max() are
distributive aggregate functions. A measure is distributive if it is obtained by applying a
distributive aggregate function. Distributive measures can be computed efficiently because
they can be computed in a distributive manner.

OLAP (Online analytical Processing):


• OLAP is an approach to answering multi-dimensional analytical (MDA) queries
swiftly.
• OLAP is part of the broader category of business intelligence, which also
encompasses relational database, report writing and data mining.
• OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.

OLAP consists of three basic analytical operations:


 Consolidation (Roll-Up)

12
 Drill-Down
 Slicing And Dicing

• Consolidation involves the aggregation of data that can be accumulated and computed
in one or more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.
• The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s sales.
• Slicing and dicing is a feature whereby users can take out (slicing) a specific set of
data of the OLAP cube and view (dicing) the slices from different viewpoints.

Types of OLAP:
1. Relational OLAP (ROLAP):
• ROLAP works directly with relational databases. The base data and the dimension
tables are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
• This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement. ROLAP tools do not use pre-calculated data cubes but instead pose the query
to the standard relational database and its tables in order to bring back the data required
to answer the question.
• ROLAP tools feature the ability to ask any question because the methodology does not
limit to the contents of a cube. ROLAP also has the ability to drill down to the lowest
level of detail in the database.

13
Benefits:
• It is compatible with data warehouses and OLTP systems.

• The data size limitation of ROLAP technology is determined by the underlying


RDBMS. As a result, ROLAP does not limit the amount of data that can be stored.

Limitations:
• SQL functionality is constrained.
• It’s difficult to keep aggregate tables up to date.

2. Multidimensional OLAP (MOLAP):


• MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
• MOLAP stores this data in an optimized multi-dimensional array storage, rather than
in a relational database. Therefore it requires the pre-computation and storage of
information in the cube - the operation known as processing.
• MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The
data cube contains all the possible answers to a given range of questions.
• MOLAP tools have a very fast response time and the ability to quickly write back data
into the data set.

14
Benefits:

• Suitable for slicing and dicing operations.


• Outperforms ROLAP when data is dense.
• Capable of performing complex calculations.

Limitations:
• It is difficult to change the dimensions without re-aggregating.
• Since all calculations are performed when the cube is built, a large amount of data cannot be
stored in the cube itself.

3. Hybrid OLAP (HOLAP):

• There is no clear agreement across the industry as to what constitutes Hybrid OLAP, except
that a database will divide data between relational and specialized storage.
• For example, for some vendors, a HOLAP database will use relational tables to hold the
larger quantities of detailed data, and use specialized storage for at least some aspects of the
smaller quantities of more-aggregate or less-detailed data.
• HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities
of both approaches.
• HOLAP tools can utilize both pre-calculated cubes and relational data sources.

15
Benefits:

• HOLAP combines the benefits of MOLAP and ROLAP.

• Provide quick access at all aggregation levels.

Limitations

• Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely
complex.

• There is a greater likelihood of overlap, particularly in their functionalities.

16
DATA MINING:

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.

The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.

The key properties of data mining are

 Automatic discovery of patterns


 Prediction of likely outcomes
 Creation of actionable information
 Focus on large datasets and databases

The Scope of Data Mining


Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides.

Given databases of sufficient size and quality, data mining technology can generate new
business opportunities by providing these capabilities:

Automated prediction of trends and behaviours

Data mining automates the process of finding predictive information in large databases.
Questions that traditionally required extensive hands- on analysis can now be answered directly
from the data — quickly.

A typical example of a predictive problem is targeted marketing. Data mining uses data on past
promotional mailings to identify the targets most likely to maximize return on

17
investment in future mailings. Other predictive problems include forecasting bankruptcy and
other forms of default, and identifying segments of a population likely to respond similarly to
given events.

Automated discovery of previously unknown patterns.


Data mining tools sweep through databases and identify previously hidden patterns in one step.
An example of pattern discovery is the analysis of retail sales data to identify seemingly
unrelated products that are often purchased together. Other pattern discovery problems include
detecting fraudulent credit card transactions and identifying anomalous data that could
represent data entry keying errors.

Data Mining Functionalities:


We have observed various types of databases and information repositories on which
datamining can be performed. Let us now examine the kinds of data patterns that can be mined.
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories: descriptive and
predictive. Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order to make
predictions.

In some cases, users may have no idea regarding what kinds of patterns in their data may be
interesting, and hence may like to search for several different kinds of patterns in parallel. Thus,
it is important to have a data mining system that can mine multiple kinds of patterns to
accommodate different user expectations or applications. Furthermore, data mining systems
should be able to discover patterns at various granularity (i.e., different levels of abstraction).
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Because some patterns may not hold for all of the data in the database, a
measure of certainty or “trustworthiness” is usually associated with each discovered pattern.

Data mining functionalities, and the kinds of patterns they can discover, are described Mining
Frequent Patterns, Associations, and Correlations Frequent patterns, as the name suggests, are
patterns that occur frequently in data. There are many kinds of frequent patterns, including item
sets, subsequence’s, and substructures.

18
A frequent item set typically refers to a set of items that frequently appear together in a
transactional data set, such as milk and bread. A frequently occurring subsequence, such as the
pattern that customers tend to purchase first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern. A substructure can refer to different structural
forms, such as graphs, trees, or lattices, which may be combined with item sets or
subsequence’s. If a substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and correlations
within data below.

Data mining involves six common classes of tasks:

 Anomaly detection (Outlier/change/deviation detection) – The identification of


unusual data records, that might be interesting or data errors that require further
investigation.
 Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket analysis.
 Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
 Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
 Regression – attempts to find a function which models the data with the least error.
 Summarization – providing a more compact representation of the data set, including
Visualization and report generation.

19
Architecture of Data Mining

A typical data mining system may have the following major components.

1. Knowledge Base:

This is the domain knowledge that is used to guide the search or evaluate the interestingness of
resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes
or attribute values into different levels of abstraction. Knowledge such as user beliefs, which
can be used to assess a pattern’s interestingness based on its unexpectedness, may also be
included. Other examples of domain knowledge are additional interestingness constraints or
thresholds, and metadata (e.g., describing data from multiple heterogeneous sources).

2. Data Mining Engine:

This is essential to the data mining system and ideally consists of a set of functional modules
for tasks such as characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis, and evolution analysis.

20
3. Pattern Evaluation Module:

This component typically employs interestingness measures interacts with the data mining
modules so as to focus the search toward interesting patterns. It may use interestingness
thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may
be integrated with the mining module, depending on the implementation of the data mining
method used. For efficient data mining, it is highly recommended to push the evaluation of
pattern interestingness as deep as possible into the mining process as to confine the search to
only the interesting patterns.

4. User interface:

This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

Classification of Data Mining Systems

Data mining is an interdisciplinary field, the confluence of a set of disciplines, including


database systems, statistics, machine learning, visualization, and information science.
Moreover, depending on the data mining approach used, techniques from other disciplines may
be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation,
inductive logic programming, or high-performance computing. Depending on the kinds of data
to be mined or on the given data mining application, the data mining system may also integrate
techniques from spatial data analysis, information retrieval, pattern recognition, image analysis,
signal processing, computer graphics, Web technology, economics, business, bioinformatics,
or psychology. Because of the diversity of disciplines contributing to data mining, data mining
research is expected to generate a large variety of data mining systems. Therefore, it is
necessary to provide a clear classification of data mining systems, which may help potential
users distinguish between such systems and identify those that best match their needs.

21
Data mining systems can be categorized according to various criteria, as follows:

Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications
involved), each of which may require its own data mining technique. Data mining systems can
therefore be classified accordingly.

For instance, if classifying according to data models, we may have a relational, transactional,
object- relational, or data warehouse mining system. If classifying according to the special
types of data handled, we may have a spatial, time-series, text, stream data, multimedia data
mining system, or a World Wide Web mining system.

Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and/or integrated data mining functionalities.

Moreover, data mining systems can be distinguished based on the granularity or levels of
abstraction of the knowledge mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels
(considering several levels of abstraction). An advanced data mining system should facilitate
the discovery of knowledge at multiple levels of abstraction

22
Data mining systems can also be categorized as those that mine data regularities (commonly
occurring patterns) versus those that mine data irregularities (such as exceptions, or outliers).
In general, concept description, association and correlation analysis, classification, prediction,
and clustering mine data regularities, rejecting outliers as noise. These methods may also help
detect outliers.

Classification according to the applications adapted Classification according to the kinds of


techniques utilized: Data mining systems can be categorized according to the underlying data
mining techniques employed. These techniques can be described according to the degree of
user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-
driven systems) or the methods of data analysis employed (e.g., database-oriented or data
warehouse–oriented techniques, machine learning, statistics, visualization, pattern recognition,
neural networks, and so on). A sophisticated data mining system will often adopt multiple data
mining techniques or work out an effective, integrated technique that combines the merits of a
few individual approaches.

: Data mining systems can also be categorized according to the applications they adapt. For
example, data mining systems may be tailored specifically for finance, telecommunications,
DNA, stock markets, e-mail, and so on. Different applications often require the integration of
application-specific methods. Therefore, a generic, all-purpose data mining system may not fit
domain-specific mining tasks.

Data Mining Process:

Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a particular application domain. Hence,
domain-specific knowledge and experience are usually necessary in order to come up with a
meaningful problem statement. Unfortunately, many application studies tend to focus on the
data- mining technique at the expense of a clear problem statement. In this step, a modeler
usually specifies a set of variables for the unknown dependency and, if possible, a general

23
form of this dependency as an initial hypothesis. There may be several hypotheses formulated
for a single problem at this stage.
The first step requires the combined expertise of an application domain and a data-mining
model. In practice, it usually means a close interaction between the data-mining expert and the
application expert. In successful data-mining applications, this cooperation does not stop in the
initial phase; it continues during the entire data-mining process.
2. Collect the data
This step is concerned with how the data are generated and collected. In general, there are two
distinct possibilities. The first is when the data-generation process is under the control of an
expert (modeler): this approach is known as a designed experiment.

The second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications.

Typically, the sampling distribution is completely unknown after data are collected, or it is
partially and implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori knowledge
can be very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later for
testing and applying a model come from the same, unknown, sampling distribution. If this is
not the case, the estimated model cannot be successfully used in a final application of the
results.

Data Mining Task Primitives

A data mining task can be specified in the form of a data mining query, which is input to the
data mining system. A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data mining system
during discovery to direct the mining process or examine the findings from different angles or
depths. The data mining primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

24
Integration of a data mining system with a database system
The data mining system is integrated with a database or data warehouse system so that it can
do its tasks in an effective presence. A data mining system operates in an environment that
needed it to communicate with other data systems like a database system. There are the
possible integration schemes that can integrate these systems which are as follows –
No coupling − No coupling defines that a data mining system will not use any function of a
database or data warehouse system. It can retrieve data from a specific source (including a file
system), process data using some data mining algorithms, and therefore save the mining results
in a different file.
Loose Coupling − In this data mining system uses some services of a database or data
warehouse system. The data is fetched from a data repository handled by these systems. Data
mining approaches are used to process the data and then the processed data is saved either in
a file or in a designated area in a database or data warehouse. Loose coupling is better than no
coupling as it can fetch some area of data stored in databases by using query processing or
various system facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives
can be supported in the database/Datawarehouse system. These primitives can contain
sorting, indexing, aggregation, histogram analysis, multi-way join, and pre-computation of
some important statistical measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated
into the database/data warehouse system. The data mining subsystem is considered as one
functional element of an information system.

Major Issues in Data Mining:


Mining different kinds of knowledge in databases. - The need of different users is not the
same. And Different user may be in interested in different kind of knowledge. Therefore, it is
necessary for data mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

25
Presentation and visualization of data mining results. - Once the patterns are discovered it
needs to be expressed in high level languages, visual representations. These representations
should be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered
should be interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed parallel. Then the results from the partitions are
merged. The incremental algorithms, updates the databases without having to mine the data
again from the scratch.
Data preprocessing: Data preprocessing is converting raw data into legible and defined sets
that allow businesses to conduct data mining, analyze the data, and process it for business
activities. It's important for businesses to preprocess their data correctly, as they use various
forms of input to collect raw data, which can affect its quality. Preprocessing data is an
important step, as raw data can be inconsistent or incomplete in its formatting. Effectively
preprocessing raw data can increase its accuracy, which can increase the quality of projects and
improve its reliability.

Importance of data preprocessing


Preprocessing data is an important step for data analysis. The following are some benefits of
preprocessing data:
1. It improves accuracy and reliability. Preprocessing data removes missing or
inconsistent data values resulting from human or computer error, which can improve
the accuracy and quality of a dataset, making it more reliable.
2. It makes data consistent. When collecting data, it's possible to have data duplicates, and
discarding them during preprocessing can ensure the data values for analysis are
consistent, which helps produce accurate results.

26
3. It increases the data's algorithm readability. Preprocessing enhances the data's quality
and makes it easier for machine learning algorithms to read, use, and interpret it.
Data Cleaning

Data cleaning is an essential step in the data mining process. It is crucial to the construction of
a model. The step that is required, but frequently overlooked by everyone, is data cleaning. The
major problem with quality information management is data quality. Problems with data
quality can happen at any place in an information system. Data cleansing offers a solution to
these issues.
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset. Even if results and algorithms appear
to be correct, they are unreliable if the data is inaccurate. There are numerous ways for data to
be duplicated or incorrectly labeled when merging multiple data sources.
In general, data cleaning lowers errors and raises the caliber of the data. Although it might be
a time-consuming and laborious operation, fixing data mistakes and removing incorrect
information must be done. A crucial method for cleaning up data is data mining. A method for
finding useful information in data is data mining. Data quality mining is a novel methodology
that uses data mining methods to find and fix data quality issues in sizable databases. Data
mining mechanically pulls intrinsic and hidden information from large data sets. Data cleansing
can be accomplished using a variety of data mining approaches.
To arrive at a precise final analysis, it is crucial to comprehend and improve the quality of your
data. To identify key patterns, the data must be prepared. Exploratory data mining is
understood. Before doing business analysis and gaining insights, data cleaning in data mining
enables the user to identify erroneous or missing data.
Data Integration
Data integration is the process of combining data from multiple sources into a cohesive and
consistent view. This process involves identifying and accessing the different data sources,
mapping the data to a common format, and reconciling any inconsistencies or discrepancies
between the sources. The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more complete and
accurate understanding of the data.

27
Data Transformation:
In data transformation, the data are transformed or consolidated into forms appropriate for
mining.
Data transformation can involve the following:
1. Smoothing, which works to remove noise from the data. Such techniques include
binning, regression, and clustering.
2. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the
data at multiple granularities.
3. Generalization of the data, where low-level or―primitive‖ (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher- level concepts, like city or country.
4. Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as 1:0 to 1:0, or 0:0 to 1:0.
5. Attribute construction (or feature construction),where new attributes are constructed
and added from the given set of attributes to help the mining process.
Data Reduction

Data reduction techniques can be applied to obtain a reduced representation of the data set that
is much smaller in volume, yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.

Strategies for data reduction include the following:


Data cube aggregation: where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection: where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Dimensionality reduction: where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction: where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters

28
instead of the actual data) or nonparametric methods such as clustering, sampling, and the use
of histograms.

Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity
reduction that is very useful for the automatic generation of concept hierarchies. Discretization
and concept hierarchy generation are powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction.

29

You might also like