0% found this document useful (0 votes)

41 views175 pages

DWDM Material

Uploaded by

akshayakeerthi1104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views175 pages

DWDM Material

Uploaded by

akshayakeerthi1104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 175

DATA WAREHOUSING & DATA MINING

1.Introduction to Data Mining:

::: Evolution of IT into DBMS

In the ancient times when there was no computer, the data began to be stored in voluminous data repositories, called books. And
eventually with the improvement in technology and expansion in knowledge the whole communities of books transferred to the first real “database”
libraries. The main objective of the database is to ensure that data can be stored and retrieved easily and effectively. It is a compilation of data
(records) in a structured way. In a database, the information is stored in a tabular form where data may or may not be interlinked. Hence we can say
that basically database is a compilation of database files and each database file is further a collection of records.

Database models:

Flat files (1960s – 1980s):

Flat file database is a database that stores information in a single file or table. In a text file, every line contains one record where fields
either have fixed length or they are separated by commas, whitespaces, tabs or any other character. In a flat file database, there is no
structural relationship among the records and they cannot contain multiple tables as well.

Advantages:

Flat file database is best for small databases.

It is easy to understand and implement. Fewer skills are required to handle a flat file database.
Less hardware and software skills are required to maintain a flat file database.
Disadvantages:
A flat file may contain fields which duplicate the data as there is no automation in flat files.
If one record is to be deleted from the flat file database, then all the relevant information in different fields has to be deleted manually
making the data manipulation inefficient.
Flat file database waste the computer space by requiring it to keep the information on items that are logically cannot be available.
Information retrieving is very time consuming in a large database.
Implementation of a flat file database:
Flat file database is implemented in:
Berkeley DB
SQLite
Mimesis
TheIntegrationEngineer etc.
Hierarchical database (1970s – 1990s):
As the name indicates, hierarchical database contains data in a hierarchically-arranged data. More perceptively it can be visualized as a
family tree where there is a parent and a child relationship. Each parent can have many children but one child can only have one parent i.e.;
one-to-many relationship. Its hierarchical structure contains levels or segments which are equivalent to the file system’s record type. All
attributes of a specific record are listed under the entity type.

In hierarchical database, the entity type is the main table, rows of a table represent the records and columns represent the attributes.

In the above figure, CUSTOMER is the parent and it has two children (CHCKACCT & SAVEACCT).

Advantages:
In a hierarchical database pace of accessing the information is speedy due to the predefined paths. This increases the performance of a
database.
The relationships among different entities are easy to understand.
Disadvantages:
Hierarchical database model lacks flexibility. If a new relationship is to be established between two entities then a new and possibly a
redundant database structure has to be build.
Maintenance and of data is inefficient in a hierarchical model. Any change in the relationships may require manual reorganization of the
data.
This model is also inefficient for non-hierarchical accesses.
Network database (1970s – 1990s):
The inventor of network model is Charles Bachmann. Unlike hierarchical database model, network database allows multiple parent and
child relationships i.e., it maintains many-to-many relationship. Network database is basically a graph structure. The network database
model was created to achieve three main objectives:
To represent complex data relationships more effectively.
To improve the performance of the database.
To implement a database standard.
In a network database a relationship is referred to as a set. Each set comprises of two types of records, an owner record which is same as
parent type in hierarchical and a member record which is similar to the child type record in hierarchical database model.
Advantages:
The network database model makes the data access quite easy and proficient as an application can access the owner record and all the
member records within a set.
This model is conceptually easy to design.
This model ensures data integrity because no member can exist without an owner. So the user must make an owner entry and then the
member records.
The network model also ensures the data independence because the application works independently of the data.
Disadvantages:
The model lacks structural independence which means that to bring any change in the database structure; the application program must
also be modified before accessing the data.
A user friendly database management system cannot be established via network model.
Implementation of network database:
Network database is implemented in:
Digital Equipment Corporation DBMS-10
Digital Equipment Corporation DBMS-20
RDM Embedded
Turbo IMAGE
Univac DMS-1100 etc.
Relational database (1980s – present):
Relational database model was proposed by E.F. Codd. After the hierarchical and network model, the birth of this model was huge step
ahead. It allows the entities to be related through a common attribute. So in order to relate two tables (entities), they simply need to have a
common attribute. In the tables there are primary keys and alternative keys. Primary keys form a relation with the alternative keys. This
property makes this model extremely flexible.
Thus using relational database ample information can be stored using small tables. The accessing of data is also very efficient. The user
only has to enter a query, and the application provides the user with the asked information.
Relational databases are established using a computer language, Structured Query Language (SQL). This language forms the basis of all the
database applications available today, from Access to Oracle.
Advantages:
Relational database supports mathematical set of operations like union, intersection, difference and Cartesian product. It also supports
select, project, relational join and division operations.
Relational database uses normalization structure which helps to achieve data independence more easily.
Security control can also be implemented more effectively by imposing an authorization control on the sensitive attributes present in a
table.
Relational database uses a language which is easy and human readable.
Disadvantages:
The response to a query becomes time-consuming and inefficient if the number of tables between which the relationships are established
increases.

Implementation of Relational Database:

Oracle
Microsoft
IBM
My SQL
PostgreSQL
SQLite
Object-oriented database (1990s – present):
Object oriented database management system is that database system in which the data or information is presented in the form of objects,
much like in object-oriented programming language. Furthermore, object oriented DBMS also facilitate the user by offering transaction
support, language for various queries, and indexing options. Also, these database systems have the ability to handle data efficiently over
multiple servers.
Unlike relational database, object-oriented database works in the framework of real programming languages like JAVA or C++.
Advantages:
If there are complex (many-to-many) relationships between the entities, the object-oriented database handles them much faster than any
of the above discussed database models.
Navigation through the data is much easier.
Objects do not require assembly or disassembly hence saving the coding and execution time.
Disadvantages:
Lower efficiency level when data or relationships are simple.
Data can be accessible via specific language using a particular API which is not the case in relational databases.
Object-relational database (1990s – present):
Defined in simple terms, an object relational database management system displays a modified object-oriented user-display over the
already implemented relational database management system. When various software interact with this modified-database management
system, they will customarily operate in a manner such that the data is assumed to be saved as objects.
The basic working of this database management system is that it translates the useful data into organized tables, distributed in rows and
columns, and from then onwards, it manages data the same way as done in a relational database system. Similarly, when the data is to be
accessed by the user, it is again translated from processed to complex form.
Advantages:
Data remains encapsulated in object-relational database.
Concept of inheritance and polymorphism can also be implemented in this database.

Disadvantages:
Object relational database is complex.
Proponents of relational approach believe simplicity and purity of relational model are lost.
It is costly as well.

Web enabled database (1990s – present):

Web enabled database simply put a database with a web-based interface.
This implies that there can be a separation of concerns; namely, the web designer does
not need to know the details about the DB’s underlying design. Similarly, the DB designer needs to concern himself with the DB’s web
interface.
A web enabled database uses three layers to function: a presentation layer, a middle layer and the database layer.
Advantages:
A web-enabled database allows users to get the information they need from a central repository on demand.
The database is easy and simple to use.
The data accessibility is easy via web-enabled database.
Disadvantages:
Main disadvantage is that it can be hacked easily.
::: Motivation and importance of Data Warehousing and Data Mining

Data Warehouse

According to The Data Warehouse Institute, a data warehouse is the foundation for a successful BI program. The concept of data warehousing is pretty easy
to understand—to create a central location and permanent storage space for the various data sources needed to support a company’s analysis, reporting
and other BI functions.

And it’s really important for your business.

But a data warehouse also costs money — big money. The problem is when big money is involved it’s tough to justify spending it on any project, especially
when you can’t really quantify the benefits upfront. When it comes to a data warehouse, it’s not easy to know what the benefits are until it’s up and running.
According to BI-Insider.com, here are the key benefits of a data warehouse once it’s launched.

A Data Warehouse Delivers Enhanced Business Intelligence

By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In
addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial
management, and sales.”

A Data Warehouse Saves Time

Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key
initiatives. They won’t waste precious time retrieving data from multiple sources.

Not only that but the business execs can query the data themselves with little or no support from IT—saving more time and more money. That means the
business users won’t have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do best—keep the
business running.

A Data Warehouse Enhances Data Quality and Consistency

A data warehouse implementation includes the conversion of data from numerous source systems into a common format. Since each data from the
various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence
in the accuracy of your data. And accurate data is the basis for strong business decisions.

A Data Warehouse Provides Historical Intelligence

A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data
typically cannot be stored in a transactional database or used to generate reports from a transactional system.

A Data Warehouse Generates a High ROI

Finally, the piece de resistance—return on investment. Companies that have implemented data warehouses and complementary BI systems have generated
more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.

And that should be reason enough for senior management to jump on the data warehouse bandwagon.

Data Mining
What is the meaning data mining? The process by which companies turn raw data into information that is useful is
Data Mining. In order to develop effective strategies in marketing, thereby decrease cost and increase sales, businesses use
software to look for patterns in the large batches of data. Data mining largely is dependent on warehousing, computer
processing, and effective data collection. These processes are useful for building machine learning models that power
applications like search engine technology and programs that recommend websites.

data mining is used in various fields like research, business, marketing, sales, product development, education, and
healthcare. When used appropriately, data mining provides an extreme advantage over competitive establishments by
providing more information about customers and helps to develop better and effective strategies in marketing which will raise
the revenue and lower the cost. In order to achieve excellent results from data mining, a number of tools and techniques are
required.

Some of the most common Data mining concepts are

• Data cleansing and preparation- in this step transformation of data into a suitable form required for further processing and
analysis such as identification and error removal and missing data.
• Artificial intelligence (AI)- analytical activities that are associated with human intelligence like reasoning, planning, learning,
and problem-solving are performed by these systems.
• Association rule learning- also known as market basket analysis, these tools look in the dataset, for the relationship between
variables such as concluding which products are purchased by the customers together.
• Clustering- is a process in which the dataset is partitioned into sets of relevant divisions called clusters, that would help the
users to understand the structure or natural groups in the data.
• Classification- with the goal of predicting the target class for each and every case in the data, items are assigned by this
technique in the dataset.
• Data analytics- data analytics is the process of evaluating digital information and converting it into information useful for
business.
• Data warehousing- is the component of the foundational importance of most huge-scale data mining efforts with a large
collection of data, that is used for decision making in organizations.
• Machine learning- is a computer programmed technique, that makes use of statistical probabilities that gives the computer the
capacity to ‘learn’ even without being clearly programmed.
• Regression- is a technique that is made use of to predict a variety of numeric values, including sales, price of a stock,
temperatures, that are based on a precise dataset.

::: Kinds of Patterns

The data mining tasks are classified into two categories – descriptive and predictive.
Descriptive Mining Tasks
This describes the general character or properties of data.

Predictive Mining Tasks

These tasks perform inference on data to make future predictions.

Descriptive Mining Tasks

Concept / Class Description
The concept or class description deals with the task of characterization and description of data. Data can be associated with classes or concepts. For example,

Classes of items for sale include computers and printers

Customer concept include big spender and budget spender which purchasing items.
The classes must be described in clear and concise terms which is known as ” class / concept description “.
How To Find Such Descriptions ?
There are three ways to find class / concept description.

Data Characterization – This is summarizing the data of target class based on features.
Data Discrimination – This compares the target class with one or more comparative classes called the contrasting classes.
The third option is to use both data characterization and data discrimination.
What Are The Methods Of Data Characterization ?
There are simple methods to characterize the data. One is simple summaries based on statistical measures. Second is the roll-up operation on OLAP data
cube that also summarizes the data. The third method is to use the attribute oriented induction technique.

The output of characterization can be presented in various forms. For example, Pie Charts, Bar Charts, Curves, Multi-dimensional data cubes and multi-
dimensional tables.

The resulting descriptions can also be presented as generalized relations or in rule form called characteristic rules.
Data Characterization Example
Q1. Produce a description of summarizing the characteristics of customer whole spend Rs 3000 in last 6 months.

The result could be a general profile of customers such as 40-50 year old, employed, credit rating and other features.
Data Discrimination Example
Q2. Compare features of software products whose sales up by 10 % in last year with software product whose sales went down by 30%.

The resultant description is same as that of data characterization, however, we also have comparative measures that distinguish the target
class and contrasting class.

Mining Frequent Patterns, Associations, and Correlations

The frequent patterns are those patterns that occur frequently in data. The frequent pattern includes frequent item sets, sub-sequences, and sub-structures.
Frequent Item-set
Those items that frequently appear together in a transactional data set is called frequent item-set.

Frequent Sub-sequence
The customer buying first PC, then camera, then a memory card is an example of sub-sequence.

Frequent Sub-structure
The sub-structure means different structural forms such as graphs, trees, and lattices, that are combined with item-sets and/or sub-sequences.

Association Analysis
Association analysis identify relationship between items that are bought together during transactions. Consider the following example.

Example: A store want to known which is purchased together. So they created rules such as

buys(X, "computer") => buys(X, accessories") [support = 1%, confidence = 50%]

where X is the customer, the confidence or certainty is 50% means if X buys computer, there is 50% chance that he will buy
computer accessories.

Support means of all transaction % is where computer and computer accessories bought together.
Consider another rule like above.

age(X, "20 ... 29") And income(X, "20K ... 29K") => buys (X, "CD Player") [support = 2%, confidence = 60%]
The second rule is example of multi-dimensional rule where more than two attributes – age and income are involved.

Therefore, frequent item-set mining is the simplest form of frequent pattern mining.

::: Technologies in Data Mining

Data mining includes the utilization of refined data analysis tools to find previously unknown, valid patterns and relationships in huge
data sets. These tools can incorporate statistical models, machine learning techniques, and mathematical algorithms, such as neural networks or
decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning, database management, and statistics, professionals
in data mining have devoted their careers to better understanding how to process and make conclusions from the huge amount of data, but what
are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed and used, including association, classification,
clustering, prediction, sequential patterns, and regression.
1. Classification:

This technique is used to obtain important and relevant information about data and metadata. This data mining technique helps to classify data
in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..

ii. Classification of data mining frameworks as per the database involved:

This classification based on the data model involved. For example. Object-oriented database, transactional database, relational
database, and so on..

iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities. For example, discrimination,
classification, clustering, characterization, etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..

iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine learning, genetic algorithms,
visualization, statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data mining procedure, such as query-driven
systems, autonomous systems, or interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing the data by a few clusters mainly loses certain confine details,
but accomplishes improvement. It models data by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data concept. From a practical point of view, clustering plays an
extraordinary job in data mining applications. For example, scientific data exploration, text mining, information retrieval, spatial database
applications, CRM, Web analysis, computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar data. This technique helps to recognize the
differences and similarities between the data. Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship between variables because of the presence of the
other factor. It is used to define the probability of the specific variable. Regression, primarily a form of planning and modeling. For example, we
might use it to project certain costs, depending on other factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different
types of databases. Association rule mining has several applications and is commonly used to help sales correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery items that you have been buying for the last six
months. It calculates a percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)

o Support:
This measurement technique measures how often multiple items are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)

o Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected
behavior. This technique may be used in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to discover sequential patterns. It comprises of
finding interesting subsequences in a set of sequences, where the stake of a sequence can be measured in terms of different criteria like length,
occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in transaction data over some time.
7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering, classification, etc. It analyzes past events or instances
in the right sequence to predict a future event.

::: Applications of Data Mining

Data mining is widely used in diverse areas. There are a number of commercial data mining system available today and yet there are many
challenges in this field. In this tutorial, we will discuss the applications and the trend of data mining.

Data Mining Applications

Here is the list of areas where data mining is widely used −

• Financial Data Analysis

• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −

• Design and construction of data warehouses for multidimensional data analysis and data mining.

• Loan payment prediction and customer credit policy analysis.

• Classification and clustering of customers for targeted marketing.

• Detection of money laundering and other financial crimes.

Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good
customer retention and satisfaction. Here is the list of examples of data mining in the retail industry −

• Design and Construction of data warehouses based on the benefits of data mining.

• Multidimensional analysis of sales, customers, products, time and region.

• Analysis of effectiveness of sales campaigns.

• Customer Retention.

• Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, internet
messenger, images, e-mail, web data transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of
resource, and improve quality of service. Here is the list of examples for which data mining improves telecommunication services −

• Multidimensional Analysis of Telecommunication data.

• Fraudulent pattern analysis.

• Identification of unusual patterns.

• Multidimensional association and sequential patterns analysis.

• Mobile Telecommunication services.

• Use of visualization tools in telecommunication data analysis.

Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics, proteomics, functional Genomics and biomedical
research. Biological data mining is a very important part of Bioinformatics. Following are the aspects in which data mining contributes for biological
data analysis −

• Semantic integration of heterogeneous, distributed genomic and proteomic databases.

• Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.

• Discovery of structural patterns and analysis of genetic networks and protein pathways.

• Association and path analysis.

• Visualization tools in genetic data analysis.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate.
Huge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics,
etc. Following are the applications of data mining in the field of Scientific Applications −

• Data Warehouses and data preprocessing.

• Graph-based mining.
• Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity,
security has become the major issue. With increased usage of internet and availability of the tools and tricks for intruding and attacking network
prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may
be applied for intrusion detection −

• Development of data mining algorithm for intrusion detection.

• Association and correlation analysis, aggregation to help select and build discriminating attributes.

• Analysis of Stream data.

• Distributed data mining.

• Visualization and query tools.

Data Mining System Products
There are many data mining system products and domain specific data mining applications. The new data mining systems and applications are
being added to the previous systems. Also, efforts are being made to standardize data mining languages.

Choosing a Data Mining System

The selection of a data mining system depends on the following features −

• Data Types − The data mining system may handle formatted text, record-based data, and relational data. The data could also be in ASCII
text, relational database data or data warehouse data. Therefore, we should check what exact format the data mining system can handle.

• System Issues − We must consider the compatibility of a data mining system with different operating systems. One data mining system
may run on only one operating system or on several. There are also data mining systems that provide web-based user interfaces and allow
XML data as input.

• Data Sources − Data sources refer to the data formats in which data mining system will operate. Some data mining system may work only
on ASCII text files while others on multiple relational sources. Data mining system should also support ODBC connections or OLE DB for
ODBC connections.

• Data Mining functions and methodologies − There are some data mining systems that provide only one data mining function such as
classification while some provides multiple data mining functions such as concept description, discovery-driven OLAP analysis, association
mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, etc.

• Coupling data mining with databases or data warehouse systems − Data mining systems need to be coupled with a database or a data
warehouse system. The coupled components are integrated into a uniform information processing environment. Here are the types of
coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
• Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining system is considered as row scalable when the number or rows are enlarged
10 times. It takes no more than 10 times to execute a query.
o Column (Dimension) Salability − A data mining system is considered as column scalable if the mining query execution time
increases linearly with the number of columns.

• Visualization Tools − Visualization in data mining can be categorized as follows −

o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
• Data Mining query language and graphical user interface − An easy-to-use graphical user interface is important to promote user-guided,
interactive data mining. Unlike relational database systems, data mining systems do not share underlying data mining query language.

::: Major Issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

• Mining Methodology and User Interaction

• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

• Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery task.

• Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data mining requests based on the returned results.

• Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.

• Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.

• Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should be easily understandable.

• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.

• Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

• Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide
the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from scratch.

Diverse Data Types Issues

• Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all these kind of data.

• Mining information from heterogeneous databases and global information systems − The data is available at different data sources
on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

::: Data Objects and Attributes Types

Data: It is how the data objects and their attributes are stored.

• An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
• An attribute set defines an object. The object is also referred to as a record of the instances or entity.
Different types of attributes or data types:

1. Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as Student Roll
No., Sex of the Person.

2. Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height
3. Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.
4. Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values ,are of
two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allows us to compare such as temperature in C or F and
thus values of attributes have order.

5. Ratio Scaled attribute:

Both differences and ratios are significant for Ratio. For eg. age, length, Weight.

Data Quality: Why do we preprocess the data?

Many charcteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which
are common properties of the big database in the real world. Factors used for data quality assessment are:

• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could
be human or computer errors.

• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction
data may not always be available.
• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent
format. Duplicate tuples need cleaning of details, too.

• Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record
on time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the
database are incomplete for a time after each month.

• Believability:
It is reflective of how much users trust the data.

• Interpretability:
It is a reflection of how easy the users can understand the data.

::: Statistical Descriptions of Data

*** Statistical Descriptions of Data use this pdf for this topic ****

::: Data Visualization

Data Visualization is used to communicate information clearly and efficiently to users by the usage of information graphics such as tables
and charts. It helps users in analyzing a large amount of data in a simpler way. It makes complex data more accessible, understandable, and usable.
Tables are used where users need to see the pattern of a specific parameter, while charts are used to show patterns or relationships in the data for
one or more parameters.

Tips to follow while representing data visually −

• Number all diagrams

• Label all diagrams
• Ensure that units of measurement on axes are clearly labelled
• Place any explanatory information in footnotes below the visual
• Check layouts to ensure maximum clarity

Pro and Cons of Data Visualization

Here are some pros and cons to representing data visually −

Pros
• It can be accessed quickly by a wider audience.

• It conveys a lot of information in a small space.

• It makes your report more visually appealing.

Cons
• It can misrepresent information – if an incorrect visual representation is made.

• It can be distracting – if the visual data is distorted or excessively used.

::: Estimating Data Similarity and Dissimilarity

Clustering consists of grouping certain objects that are similar to each other, it can be used to decide if two items are
similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions describing object features. That means if the
distance among two data points is small then there is a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application. For example, similarity among vegetables can be determined
from their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities or differences between a pair of objects, the mos t
popular distance measures used are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry. It can be simply explained as the ordinary
distance between two points. It is one of the most used algorithms in the cluster analysis. One of the algorithms that use this
formula would be K-mean. Mathematically it computes the root of squared differences between the coordinates between two
objects.
Figure – Euclidean Distance
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between these points we simply have to calculate the
perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Here the total distance of the Red line gives the Manhattan distance between both the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of
the data items.
Figure – Jaccard Index
4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-dimensional space, a point is represented
as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:

• When p = 2, Minkowski distance is same as the Euclidean distance.

• When p = 1, Minkowski distance is same as the Manhattan distance.
5. Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle between two vectors given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

Figure – Cosine Distance

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS
Theory Course at a student-friendly price and become industry ready.
2. Data pre-processing
:::Quality data IN Data Mining
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which are
common properties of the big database in the real world. Factors used for data quality assessment are:

• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be
human or computer errors.

• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data
may not always be available.

• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format .
Duplicate tuples need cleaning of details, too.

• Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record on
time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the database
are incomplete for a time after each month.

• Believability:
It is reflective of how much users trust the data.

• Interpretability:
It is a reflection of how easy the users can understand the data.

::: Data Cleaning

Introduction:
Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. Data Cleaning is
one of those things that everyone does but no one really talks about. It surely isn’t the fanciest part of machine learning and at the
same time, there aren’t any hidden tricks or secrets to uncover. However, proper data cleaning can make or break your project.
Professional data scientists usually spend a very large portion of their time on this step.
Because of the belief that, “Better data beats fancier algorithms”.
If we have a well-cleaned dataset, we can get desired results even with a very simple algorithm, which can prove very beneficial
at times.
Obviously, different types of data will require different types of cleaning. However, this systematic approach can always serve as a
good starting point.
Steps involved in Data Cleaning

1. Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations most frequently ar ise
during data collection and Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to
solve.
• Redundant observations alter the efficiency by a great extent as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful results.
• Irrelevant observations are any type of data that is of no use to us and can be removed directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data or other similar situations are called structural errors. Structural
errors include typos in the name of features, same attribute with different name, mislabeled classes, i.e. separate classes t hat
should really be the same or inconsistent capitalization.

• For example, the model will treat America and america as different classes or values, though they represent the same value
or red, yellow and red-yellow as different classes or attributes, though one class can be included in other two classes. So,
these are some structural errors that make our model inefficient and gives poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers
than decision tree models. Generally, we should not remove outliers until we have a legitimate reason to remove them.
Sometimes, removing them improves performance, sometimes not. So, one must have a good reason to remove the outlier,
such as suspicious measurements that are unlikely to be the part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove the missing observation. They
must be handled carefully as they can be an indication of something important. The two most common ways to deal with
missing data are:
1. Dropping observations with missing values.
Dropping missing values is sub-optimal because when you drop observations, you drop information.
• The fact that the value was missing may be informative in itself.
• Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!
2. Imputing the missing values from past observations.
Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a
loss in information, no matter how sophisticated your imputation method is.
• Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the
patterns already provided by other features.
Both of these approaches are sub-optimal because dropping an observation means dropping information, thereby reducing
data and imputing values also is sub-optimal as we fil the values that were not present in the actual dataset, which leads to a
loss of information.
Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s
like trying to squeeze in a piece from somewhere else in the puzzle.
So, missing data is always informative and indication of something important. And we must aware our algorithm of missing data
by flagging it. By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal
constant for missingness, instead of just filling it in with the mean.

::: Data Integration

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data
sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes,
databases or flat files.
The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.

There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling
approach”.
Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical location through the process of ETL –
Extraction, Transformation and Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand and then sends the query directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.

Issues in Data Integration:

There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection and resolution of data
value conflicts. These are explained in brief as following below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to as the entity identification problem.
For example, How can the data analyst and computer be sure that customer id in one data base and customer number in
another reference to the same attribute.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another attribute or set of attribute.
• Inconsistencies in attribute can also cause redundanciesin the resulting data set.
• Some redundancies can be detected by correlation analysis.
3. Detection and resolution of datavalue conflicts:
• This is the third important issues in data integration.
• Attribute values from another different sources may differ for the same real world entity.
• An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.

::: Data Reduction

The method of data reduction may achieve a condensed description of the original data which is much smaller in quantity but
keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis
for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual
sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data
size as it eliminates outdated or redundant features.
: Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on
their relevance to other attributes. We know it as a p-value in statistics.

Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

:: Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst
remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
::: Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.

3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-
length Encoding). We can divide it into two types based on their compression techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed data.
• Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the
image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve
information from them.

4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instea d of
actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling.
For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace
many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.
• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and
repeat of this method up to the end, then the process is known as top-down discretization also known as splitting.
• Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the neighbourhood
values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level concepts
(categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
• Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts
depends on the number of bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called br ackets.
There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from
0-20.
3. Clustering: Grouping the similar data together.

::: Data Transformation

The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting important features
present in the dataset. It helps in predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different trends and patterns.
This serves as a help to analysts or traders who need to look at a lot of data which can often be difficult to digest for fin ding
patterns that they wouldn’t see otherwise.
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data may be obtained fro m
multiple data sources to integrate these data sources into a data analysis description. This is a crucial step since the accu racy of
data analysis insights is highly dependent on the quantity and quality of the data used. Gathering accurate data of high quality and
a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the real world require
continuous attributes. Yet many of the existing data mining frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by replacing a
constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This simplifies the
original data & makes the mining more efficient.
5. Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age initially in Numerical
form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions, such as town or
country.
6. Normalization: Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:

• Min-Max Normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
We Have the Formula:
• Where v is the value you want to plot in the new range.
• v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We want to plot the
profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for attribute profit can be plotted to:

And hence, we get the value of v’ as 0.11

• Z-Score Normalization:
• In z-score normalization (or zero-mean normalization) the values of an attribute (A), are normalized based on the
mean of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by computing

For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z-score normalization, a value of
85000 for P can be transformed to:

And hence we get the value of v’ to be 2.5

• Decimal Scaling:
• It normalizes the values of an attribute by changing the position of their decimal points
• The number of points by which the decimal point is moved can be determined by the absolute maximum value of
attribute A.
• A value, v, of attribute A is normalized to v’ by computing

• where j is the smallest integer such that Max(|v’|) < 1.

For example:
• Suppose: Values of an attribute P varies from -99 to 99.
• The maximum absolute value of P is 99.
• For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the largest number)
so that values come out to be as 0.98, 0.97 and so on.

::: Discretization and Concept Hierarchy Generation

Data Discretization techniques can be used to divide the range of continuous attribute into intervals.Numerous continuous
attribute values are replaced by small interval labels.

This leads to a concise, easy-to-use, knowledge-level representation of mining results.

Top-down discretization

If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then
repeats this recursively on the resulting intervals, then it is called top-down discretization or splitting.

Bottom-up discretization

If the process starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood
values to form intervals, then it is called bottom-up discretization or merging.

Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as
a concept hierarchy.

Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives.

Data mining on a reduced data set means fewer input/output operations and is more efficient than mining on a larger data set.

Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than
during mining.

Discretization and Concept Hierarchy Generation for Numerical Data

Typical methods

1 Binning
Binning is a top-down splitting technique based on a specified number of bins.Binning is an unsupervised discretization technique.

2 Histogram Analysis
Because histogram analysis does not use class information so it is an unsupervised discretization technique.Histograms partition the
values for an attribute into disjoint ranges called buckets.

3 Cluster Analysis
Cluster analysis is a popular data discretization method.A clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.

Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.

Some other methods are

Entropy-Based Discretization

Discretization by Intuitive Partitioning

3. Data Warehouse and OLAP Technology:

::: Basic Concepts of Data warehouse

Understanding a Data Warehouse

• A data warehouse is a database, which is kept separate from the organization's operational database.

• There is no frequent updating done in a data warehouse.

• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.

• Data warehouse systems help in the integration of diversity of application systems.

• A data warehouse system helps in consolidated historical data analysis.

What is Data Warehousing?

Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data
cleaning, data integration, and data consolidations.

Using Data Warehouse Information

There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the
warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The
information gathered in a warehouse can be used in any of the following domains −

• Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and managing the product
portfolios by comparing the sales quarterly or yearly.

• Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.

• Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The
information also allows us to analyze business operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −

• Query-driven Approach
• Update-driven Approach

Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple
heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach

• When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous
sites involved.

• Now these queries are mapped and sent to the local query processor.

• The results from heterogeneous sites are integrated into a global answer set.

Disadvantages
• Query-driven approach needs complex integration and filtering processes.

• This approach is very inefficient.

• It is very expensive for frequent queries.

• This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach
discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.

Advantages
This approach has the following advantages −

• This approach provide high performance.

• The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.

• Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

• Data Extraction − Involves gathering data from multiple heterogeneous sources.

• Data Cleaning − Involves finding and correcting the errors in data.

• Data Transformation − Involves converting the data from legacy format to warehouse format.

• Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.

• Refreshing − Involves updating from data sources to warehouse.

::: Data Modeling using Cubes and OLAP

What is Data Cube?

When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has a few alternative names or a few
variants, such as "Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized into a set of eight views as shown
in fig, where psc indicates a view consisting of aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure attributes, i.e., the attributes
whose values are of interest. Another attributes are selected as dimensions or functional attributes. The measure attributes are aggregated
according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions time, item, branch, and location.
These dimensions enable the store to keep track of things like monthly sales of items, and the branches and locations at whic h the items were
sold. Each dimension may have a table identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data cubes could be sparse in many cases because not every cell in each
dimension may have corresponding data in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to make the best use of the
precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model. Data cubes usually model n-
dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data model is organized around a central theme,
like sales and transactions. A fact table represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such as
Rs_sold) and keys to each of the related dimensional tables.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in the city of Vancouver. The
measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would like to view the data according to time,
item as well as the location for the cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in thousands). These
3-D data are shown in the table. The 3-D data of the table are represented as a series of 2-D tables.

Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:

Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier. The measure
displayed is dollars sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In this example, this is the total sales,
or dollars sold, summarized over all four dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for the dimension time, item, location,
and supplier. Each cuboid represents a different degree of summarization.

The basics of OLAP data modeling

The basics of OLAP data modeling are explained in this blog post.
In this data driven world, an enormous amount of data is collected and stored on a daily basis. But why is it important to collect and store these huge amounts of data? Having
piles of raw data can help your organization to make better analysis. The problem is, data in its original form doesn’t always make much sense. By structuring your collected raw
data, you will be able to make more informed decisions. The process of structuring your raw data is called data modeling. This is where OLAP comes in.

Back to the basics of OLAP

OLAP is an acronym for Online Analytical Processing. Before we start explaining the basics of OLAP, you might wonder why OLAP can be handy for your organization. Well,
OLAP is a multi-dimensional database technology that permit to perform quick data analysis on many data records. This analysis will provide relevant information aimed at
better decisions taking , storytelling and planning. In summary OLAP is a software technology that allows organizations to perform multidimensional analysis of collected data.
It provides the capability for complex calculations, trend analysis and data modeling with one goal: understanding your business better.

When learning about OLAP, there is no getting away from the terms dimensions, cubes, measures and hierarchies. Here are some definitions that will make it easier to
understand their relevance.

1. Cubes

OLAP tools use multidimensional database structures, called cubes. An OLAP Cube, or a data cube, is a multidimensional data set that allows fast analysis of data, according
to the multiple dimensions you set up. You can compare a cube with a multidimensional spreadsheet: you can collect data from users and store that data in a transparent way
and calculate when needed. In order to form a cube you need dimensions.

2. Dimensions

Dimensions are lists of related items used to organize your data in similar categories, such as products, time and/or regions. Dimensions are the basis for the data structure
of an OLAP data cube. For example, the months and quarters may make up your Year dimension. You can compare dimensions with the business parameters that you normally
see in the rows and columns of a report. A model can consist in multiple dimensions : example :

• organization structure of the company

• product structure
• version (for simulations and final)
• scenario (actual , budget , forecast , best case , worst case , …)
• measure (Account list , FTE , Headcount , SKU , … )
• currencies
• exchange rates
• year – period
• …

In practice dimension need to be limited to +/- 12 in order to remain workable for end users and calculation engine. Depending on technology used the dimension can be higher
without impact on performance.

3. Measures

Each cube must have at least one measure. But in reality, we see that cubes often contain multiple measures. An OLAP measure is a numeric value by which the dimensions are
detailed or aggregated. It gives you the information about quantities you’re interested in. Do you have difficulties with defining you OLAP measures? Ask yourself the question
‘how much…?’ and your answer will be your OLAP measure. Measures can be financial or nonfinancial example: COA’s , KPI’s , FTE ‘s, Volumes , … .

4. Hierarchies

Hierarchies are the subcategories of your dimensions. They have multiple levels and allow you to drill down or drill up your data. What is drilling, you may ask? Drilling allows
you to analyze your data at different levels of granularity. (example : total volume , volume by product group, volume by packaging by product group, volume by KSU by
packaging by product group ).

Conclusion

OLAP is a common technology behind many Business Intelligence and CPM applications and is still most relevant today. Using OLAP can help your organization with your
analyses, forecasting and planning. In short, it should contribute to better decision-making and eventually lead to more profit.

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight
of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers

We have four types of OLAP servers −

• Relational OLAP (ROLAP)

• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following −

• Implementation of aggregation navigation logic.

• Optimization for each DBMS back end.
• Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage
utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.

Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers
allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a
read-only environment.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data.
Here is the list of OLAP operations −

• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

• By climbing up a concept hierarchy for a dimension

• By dimension reduction
The following diagram illustrates how roll-up works.
• Roll-up is performed by climbing up a concept hierarchy for the dimension location.

• Initially the concept hierarchy was "street < city < province < country".

• On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.

• The data is grouped into cities rather than countries.

• When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

• By stepping down a concept hierarchy for a dimension

• By introducing a new dimension.
The following diagram illustrates how drill-down works −
• Drill-down is performed by stepping down a concept hierarchy for the dimension time.

• Initially the concept hierarchy was "day < month < quarter < year."

• On drilling down, the time dimension is descended from the level of quarter to the level of month.

• When drill-down is performed, one or more dimensions from the data cube are added.

• It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows
how slice works.
• Here Slice is performed for the dimension "time" using the criterion time = "Q1".

• It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three dimensions.

• (location = "Toronto" or "Vancouver")

• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the
following diagram that shows the pivot operation.

OLAP vs OLTP

Sr.No. Data Warehouse (OLAP) Operational Database (OLTP)

1 Involves historical processing of Involves day-to-day processing.

information.

2 OLAP systems are used by OLTP systems are used by clerks, DBAs,
knowledge workers such as or database professionals.
executives, managers and analysts.

3 Useful in analyzing the business. Useful in running the business.

4 It focuses on Information out. It focuses on Data in.

5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.

Schema and Fact Constellation
Schema.

6 Contains historical data. Contains current data.

7 Provides summarized and Provides primitive and highly detailed
consolidated data. data.

8 Provides summarized and Provides detailed and flat relational view

multidimensional view of data. of data.

9 Number or users is in hundreds. Number of users is in thousands.

10 Number of records accessed is in Number of records accessed is in tens.

millions.

11 Database size is from 100 GB to 1 Database size is from 100 MB to 1 GB.

12 Highly flexible. Provides high performance

::: DWH Design and usage

Data Warehouse Design
A data warehouse is a single data repository where a record from multiple data sources is integrated for online business analytical processing
(OLAP). This implies a data warehouse needs to meet the requirements from all the business stages within the entire organization. Thus, data
warehouse design is a hugely complex, lengthy, and hence error-prone process. Furthermore, business analytical functions change over time,
which results in changes in the requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and the design
process is continuous.

Data warehouse design takes a method different from view materialization in the industries. It sees data warehouses as database systems with
particular needs such as answering management related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach

2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-variant, non-volatile and integrated data
repository for the entire enterprise data from different sources are validated, reformatted and saved in a normalized (up to 3NF) database as the
data warehouse. The data warehouse stores "atomic" information, the data at the lowest level of granularity, from where dimensional data marts
can be built by selecting the data required for specific business subjects or particular departments. An approach is a data-driven approach as the
information is gathered and integrated first and then business requirements by subjects for building data marts are formulated. The advantage
of this method is which it supports a single integrated data source. Thus data marts built from it will have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical architecture for query and analysis," term
the star schema. In this approach, a data mart is created first to necessary reporting and analytical capabilities for particular business processes
(or subjects). Thus it is needed to be a business-driven approach in contrast to Inmon's data-driven approach.

Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized database for the data warehouse, a
denormalized dimensional database is adapted to meet the data delivery requirements of data warehouses. Using this method, to use the set of
data marts as the enterprise data warehouse, data marts should be built with conformed dimensions in mind, defining that ordinary objects are
represented the same in different data marts. The conformed dimensions connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a data warehouse for a single subject,
takes far less time and effort than developing an enterprise-wide data warehouse. Also, the risk of failure is even less. This method is inherently
incremental. This method allows the project team to learn and grow.
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller Solves the essential low-level problem and integrates them into
subproblems. a higher one.

Inherently architected- not a union of several Inherently incremental; can schedule essential data marts first.
data marts.

Single, central storage of information about the Departmental information stored.

content.
Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favorable return on investment, and proof of
repetitions. techniques.

Data Warehouse Usage

Simply defined, a data warehouse is a system that pulls together data from many different sources within an organization. On top of this
system, business users can create reports from complex queries that answer questions about business operations to improve business efficiency,
make better decisions, and even introduce competitive advantages.

It’s important to understand that a data warehouse is definitely different than a traditional database. Sure, data warehouses and databases are both
relational data systems, but they were definitely built to serve different purposes. A data warehouse is built to store large quantities of historical data
and enable fast, complex queries across all the data, typically using Online Analytical Processing (OLAP). A database was built to store current
transactions and enable fast access to specific transactions for ongoing business processes, known as Online Transaction Processing (OLTP).

So, data warehousing allows you to aggregate data, from various sources. This data, typically structured, can come from Online Transaction
Processing (OLTP) data such as invoices and financial transactions, Enterprise Resource Planning (ERP) data, and Customer Relationship
Management (CRM) data. Finally, data warehousing focuses on data relevant for business analysis, organizes and optimizes it to enable efficient
analysis.

How are data warehouses used?

A data warehouse is a decision support system which stores historical data from across the organization, processes it, and makes it possible to use
the data for critical business analysis, reports and dashboards.

Furthermore, there are some great benefits f leveraging a data warehouse architecture for your requirements. This includes:

• High performance querying on large volumes of data

• Simpler queries to allow in-depth data exploration

• Collects historical data from multiple periods and multiple data sources from across the organization, allowing strategic analysis

• Provides an easy interface for business analysts and data ready for analysis

Stepping back, data warehouse use-cases focus on providing high-level reporting and analysis that lead to more informed business decisions. Use-
cases include:

• Carrying out data mining to gain new insights from the information held in many large databases

• Conducting market research by analyzing large volumes of data in-depth

• An online business analyzing user behavior to make business decisions

More specifically, organizations have been using data warehouses for some really interesting business use-cases. For example:

Data modernization

Data warehouses are constantly evolving to support new technologies and business requirements—and remain relevant when it comes to big data
and analytics. This means that if you’re leveraging older data storage systems, you might be running into problems supporting new and advanced
data analytics solutions. And if you're trying to run a modern data operation solely on the back of a database, that can create a whole host of
issues.

Simplifying big data and analytics

This is a big point bridging from the previous note. It’s absolutely key to understand just how much is possible with big data platforms like Hadoop.
But can your current infrastructure support it? One big use-case for data warehouse design is the integration with big data systems like Hadoop. For
example, Panoply makes it easy to combine your Hadoop HDFS data into your Panoply data warehouse, giving you instant cloud access to your
HDFS data without any ETL or ELT process. HDFS, or the Hadoop Distributed File System, is an open source data storage software framework.
Panoply’s end to end data management solution is able to load Hadoop Data into your Panoply data warehouse with only a few clicks, giving your
analysts and scientists instant access.

Data integration

One key aspect in working with data is the ability to integrate with other key systems. For example, maybe you have a lot of data in your data
warehouse architecture. How are you integrating it with data visualization? Or, do you have integration with things like reporting and analytics?

A great use-case for data warehousing is to integrate with amazing data services ranging from everything like business intelligence (BI), to data
visualization (Tableau). For example, you can quickly integrate Amazon Kinesis Firehose reporting and analysis into your data warehouse with the
Panoply Amazon Kinesis Firehose integration.

Preparing for a data-driven future

The future of data warehousing revolves around your ability to integrate and work with data. Leading data warehousing systems will allow you to
leverage integration as a great way to get the most of your data without requiring a complicated data infrastructure. Working with next-generation
data warehousing revolves around simplicity, more data delivery capabilities, and working to advance the business.

This might be a lot to take in; but it doesn’t have to be hard to get started. If you have data requirements that are being complicated by your current
systems, look at a data warehouse as a real-world option to make things easier. Remember, the future will absolutely be driven by your ability to
work with and house data. And, this can’t be done with traditional databases. Those organizations that capture the full benefits of data will be the
ones which can deeply understand the market and evolving customer requirements.

::: Implementation using Data Cubes and OLAP

I think this topic is same as previous
::: Data Generalization with AOI
Attribute-Oriented Induction

The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the introduction of
the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre computational – oriented,
materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted for processing.

On the other hand, the attribute oriented induction approach, at least in its initial proposal, a relational
database query – oriented, generalized – based, on-line data analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on online aggregation
versus offline precomputation.

Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.

How it is done?

• Collect the task-relevant data( initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization.
• Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
• Reduces the size of the generalized data set.
• Interactive presentation with users.

Basic Principles Of Attribute Oriented Induction

Data focusing:

• Analyzing task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal:

• To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher-level concepts are expressed in terms of other attributes.

Attribute-generalization:

• If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then
select an operator and generalize A.

Attribute-threshold control:

• Typical 2-8, specified/default.

Generalized relation threshold control (10-30):

• To control the final relation/rule size.

Algorithm for Attribute Oriented Induction

InitialRel:

• It is nothing but query processing of task-relevant data and deriving the initial relation.

PreGen:

• It is based on the analysis of the number of distinct values in each attribute and to determine the
generalization plan for each attribute: removal? or how high to generalize?

PrimeGen:

• It is based on the PreGen plan and performing the generalization to the right level to derive a “prime
generalized relation” and also accumulating the counts.

Presentation:

• User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

Example
Let's say there is a University database that is to be characterized, for that its corresponding DMQL will
be

use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone_no, GPA
from student

Its corresponding SQL statement can be:

Select name, gender, major, birth_place, birth_date, residence, phone_no, GPA

from student
where status in {“Msc”, “MBA”, “Ph.D.” }

Now for this database let's create a characterized view:

InitialRel:

• From this table, we are querying task-relevant data.

• From this table, we also removed a few attributes like name and phoneno, because they make no sense
in concluding insights.

PreGen

• Now, we have generalized these results by removing a few attributes and retaining important attributes.
• And also we have generalized a few attributes by naming them "Country" rather than "Birth_Place", "Age
Range" rather than "Birth_data", "City" rather than "Residence" and so on as per the table given below.

PrimeGen

• Based on the PreGen plan we've performed generalization to the right level to derive a “prime
generalized relation” and also we've accumulated the counts.

Final Results

• Now we've and analyzed and concluded our final generalized results as shown below.

Presentation Of Results
Generalized relation:

• Relations where some or all attributes are generalized, with counts or other aggregation values
accumulated.

Cross-tabulation:

• Mapping results into cross-tabulation form (similar to contingency tables).

Visualization techniques:

• Pie charts, bar charts, curves, cubes, and other visual forms.

Quantitative characteristic rules:

• Mapping generalized results in characteristic rules with quantitative information associated with it.

Summary

4. Data Cube Technology

::: Preliminary Concepts of Data Cube Computation
::: Data Cube Computation Methods
***Data Cube Computation Methods.ppt use this ppt for this***

::: Multi-way Array Aggregation for Full Cube

::: Multi dimensional Data Analysis in cube space
Data cubes create a flexible and powerful means to group and aggregate data subsets. They allow
data to be explored in multiple dimensional combinations and at varying aggregate granularities. This
capability greatly increases the analysis bandwidth and helps effective discovery of interesting
patterns and knowledge from data. The use of cube space makes the data space both meaningful and
tractable.
This section presents methods of multidimensional data analysis that make use of data cubes to
organize data into intuitive regions of interest at varying granularities. Section 5.4.1 presents prediction
cubes, a technique for multidimensional data mining that facilitates predictive modeling in
multidimensional space. Section 5.4.2 describes how to construct multifeature cubes. These support
complex analytical queries involving multiple dependent aggregates at multiple granularities.
Finally, Section 5.4.3 describes an interactive method for users to systematically explore cube space. In
such exception-based, discovery-driven exploration, interesting exceptions or anomalies in the data are
automatically detected and marked for users with visual cues.

1 PREDICTION CUBES: PREDICTION MINING IN CUBE SPACE

Recently, researchers have turned their attention toward multidimensional data mining to uncover
knowledge at varying dimensional combinations and granularities. Such mining is also known
as exploratory multidimensional data mining and online analytical data mining (OLAM). Multidimensional
data space is huge. In preparing the data, how can we identify the interesting subspaces for exploration?
To what granularities should we aggregate the data? Multidimensional data mining in cube space
organizes data of interest into intuitive regions at various granularities. It analyzes and mines the data by
applying various data mining techniques systematically over these regions.
There are at least four ways in which OLAP-style analysis can be fused with data mining techniques:
1. Use cube space to define the data space for mining. Each region in cube space represents a subset of data
over which we wish to find interesting patterns. Cube space is defined by a set of expert-designed,
informative dimension hierarchies, not just arbitrary subsets of data. Therefore, the use of cube space
makes the data space both meaningful and tractable.
2. Use OLAP queries to generate features and targets for mining. The features and even the targets (that we
wish to learn to predict) can sometimes be naturally defined as OLAP aggregate queries over regions in
cube space.
3. Use data mining models as building blocks in a multistep mining process. may consist of multiple steps,
where data mining models can be viewed as building blocks that are used to describe the behavior of
interesting data sets, rather than the end results.
4. Use data cube computation techniques to speed up repeated model construction. may require building a
model for each candidate data space, which is usually too expensive to be feasible. However, by carefully
sharing computation across model construction for different candidates based on data cube computation
techniques, efficient mining is achievable.
In this subsection we study prediction cubes, an example of multidimensional data mining where the cube
space is explored for prediction tasks. A prediction cube is a cube structure that stores prediction models
in multidimensional data space and supports prediction in an OLAP manner. Recall that in a data cube,
each cell value is an aggregate number (e.g., count) computed over the data subset in that cell. However,
each cell value in a prediction cube is computed by evaluating a predictive model built on the data subset
in that cell, thereby representing that subset’s predictive behavior.
Instead of seeing prediction models as the end result, prediction cubes use prediction models as building
blocks to define the interestingness of data subsets, that is, they identify data subsets that indicate more
accurate prediction. This is best explained with an example.

Example
Pre d i ct io n c ube fo r i de n ti f ic at io n o f i nte re st i ng cub e sub s pa ce s
Suppose a company has a customer table with the attributes time (with two granularity
levels: month and year), location (with two granularity levels: state and country), gender, salary, and one
class-label attribute: valued_customer. A manager wants to analyze the decision process of whether a
customer is highly valued with respect to time and location. In particular, he is interested in the
question “Are there times at and locations in which the value of a customer depended greatly on the
customer’s gender?” Notice that he believes time and location play a role in predicting valued customers,
but at what granularity levels do they depend on gender for this task? For example, is performing analysis
using {month, country} better than {year, state}?
Consider a data table D (e.g., the customer table). Let X be the (e.g., gender, salary). Let Y be the class-label
attribute (e.g., valued_customer), and Z be the set of attributes, that is, attributes for which concept
hierarchies have been defined (e.g., time, location). Let V be the set of attributes for which we would like
to define their predictiveness. In our example, this set is {gender}. The predictiveness of V on a data subset
can be quantified by the difference in accuracy between the model built on that subset using X to predict Y
and the model built on that subset using X − V (e.g., {salary}) to predict Y. The intuition is that, if the
difference is large, V must play an important role in the prediction of class label Y.
Given a set of attributes, V, and a learning algorithm, the cube at granularity (e.g., ) is
a d-dimensional array, in which the value in each cell (e.g., [2010, Illinois]) is the predictiveness
of V evaluated on the subset defined by the cell (e.g., the records in the customer table with time in 2010
and location in Illinois).
Supporting OLAP roll-up and drill-down operations on a prediction cube is a computational challenge
requiring the materialization of cell values at many different granularities. For simplicity, we can consider
only full materialization. A na ve way to fully materialize a prediction cube is to exhaustively build models
and evaluate them for each cell and granularity. This method is very expensive if the base data set is large.
An ensemble method called Probability-Based Ensemble (PBE) was developed as a more feasible
alternative. It requires model construction for only the finest-grained cells. OLAP-style bottom-up
aggregation is then used to generate the values of the coarser-grained cells.
The prediction of a predictive model can be seen as finding a class label that maximizes a scoring function.
The PBE method was developed to approximately make the scoring function of any predictive model
distributively decomposable. In our discussion of data cube measures in Section 4.2.4, we showed that
distributive and algebraic measures can be computed efficiently. Therefore, if the scoring function used is
distributively or algebraically decomposable, prediction cubes can also be computed with efficiency. In
this way, the PBE method reduces prediction cube computation to data cube computation.
For example, previous studies have shown that the naïve Bayes classifier has an algebraically
decomposable scoring function, and the kernel density–based classifier has a distributively decomposable
scoring function. Therefore, either of these could be used to implement prediction cubes efficiently. The
8

PBE method presents a novel approach to multidimensional data mining in cube space.

2 MULTIFEATURE CUBES: COMPLEX AGGREGATION AT MULTIPLE

GRANULARITIES
Data cubes facilitate the answering of queries as they allow the computation of aggregate data at multiple
granularity levels. Traditional data cubes are typically constructed on commonly used dimensions
(e.g., time, location, and product) using simple measures (e.g., count(), average(), and sum()). In this
section, you will learn a newer way to define data cubes called multifeature cubes. Multifeature cubes
enable more in-depth analysis. They can compute more complex queries of which the measures depend
on groupings of multiple aggregates at varying granularity levels. The queries posed can be much more
elaborate and task-specific than traditional queries, as we shall illustrate in the next examples. Many
complex data mining queries can be answered by multifeature cubes without significant increase in
computational cost, in comparison to cube computation for simple queries with traditional data cubes.

5. Mining Frequent Patterns Based on Associations and Correlations

6. Classification & Prediction

::: Basic Concepts

There are two forms of data analysis that can be used for extracting models describing important classes or to predict
future data trends. These two forms are as follows −

• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict continuous valued functions. For
example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer equipment given their income and
occupation.

What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which
are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky
or safe for loan application data and yes or no for marketing data.

What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company.
In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric
prediction. In this case, a model or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us understand the working of classification.
The Data Classification process includes two steps −

• Building the Classifier or Model

• Using Classifier for Classification

Building the Classifier or Model

• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred
to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classi fication
rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the following activities
−
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The noise is removed
by applying smoothing techniques and the problem of missing values is solved by replacing a missing value with
most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used to know
whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the following methods.
o Normalization − The data is transformed using normalization. Normalization involves scaling all values for
given attribute in order to make them fall within a small specified range. Normalization is used when in the
learning step, the neural networks or the methods involving measurements are used.
o Generalization − The data can also be transformed by generalizing it to the higher concept. For this
purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation, binning, histogram analysis,
and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label correctly and the
accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new
data.
• Speed − This refers to the computational cost in generating and using the classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently; given large amount of
data.
• Interpretability − It refers to what extent the classifier or predictor understands.

::: Decision Tree Induction

Decision Tree is a supervised learning method used in data mining for classification and regression methods. It is a tree
that helps us in decision-making purposes. The decision tree creates classification or regression models as a tree
structure. It separates a data set into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two branches. The leaf nodes show
a classification or decision. We can't accomplish more split on leaf nodes-The uppermost decision node in a tree that
relates to the best predictor called the root node. Decision trees can deal with both categorical and numerical data.

Key factors:

Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or impurity in
data sets.
Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy Reduction. Building a
decision tree is all about discovering attributes that return the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions. Starting with the
dataset, we can measure the entropy to find a way to segment the set until the data belongs to the same class.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of accomplishing them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an extensive
collection of records into smaller sets of the class by implementing a sequence of simple decision rules. A decision tree
model comprises a set of rules for portioning a huge heterogeneous population into smaller, more homogeneous, or
mutually exclusive classes. The attributes of the classes can be any variables from nominal, ordinal, binary, and
quantitative values, in contrast, the classes must be a qualitative type, such as categorical or ordinal or binary. In brief,
the given data of attributes together with its class, a decision tree creates a set of rules that can be used to identify the
class. One rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy is
known as the tree, and each segment is called a node. With each progressive division, the members from the
subsequent sets become more and more similar to each other. Hence, the algorithm used to build a decision tree is
referred to as recursive partitioning. The algorithm is known as CART (Classification and Regression Trees)

Consider the given example of a factory where

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to $8 million profit, and
the probability of a bad economy is 0.4 (40%), which leads to $6 million profit.

Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to $4 million profit, and
the probability of a bad economy is 0.4, which leads to $2 million profit.

The management teams need to take a data-driven decision to expand or not based on the given data.

Net Expand = ( 0.6 8 + 0.46 ) - 3 = $4.2M

Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:

The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class levels (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for choosing the attribute that "best" discriminates the
given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.

Advantages of using decision trees:

A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.

A decision tree does not require a standardization of data.

::: Bayes Classification

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian
classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular
class.

Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

• Posterior Probability [P(H/X)]

• Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks,
Bayesian Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined between subsets of variables.
• It provides a graphical model of causal relationship on which learning can be performed.
• We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −

• Directed acyclic graph

• A set of conditional probability tables
Directed Acyclic Graph
• Each node in a directed acyclic graph represents a random variable.
• These variable may be discrete or continuous valued.
• These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's
family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable
PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of
the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

::: Rule- Based Classification

IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from
−
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes

Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not require to generate
a decision tree first. In this algorithm, each rule for a given class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the rules are learned
one at a time. For each time rules are learned, a tuple covered by the rule is removed and the process continues for the
rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When learnin g a rule
from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.

Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set

end for
return Rule_Set;

Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may perform well on training data
but less well on subsequent data. That's why the rule pruning is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what
was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for
the pruned version of R, then we prune R.

::: Model Evaluation and Selection

::: Techniques to Improve Classification Accuracy

some methods to enhance a classification accuracy, talking generally, are:

1 - Cross Validation : Separe your train dataset in groups, always separe a group for prediction and change the groups in each
execution. Then you will know what data is better to train a more accurate model.

2 - Cross Dataset : The same as cross validation, but using different datasets.

3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK which classification
algorithm you're using so its hard to help more).

4 - Improve, or use (if you're not using) the normalization process : Discover which techniques (change the geometry, colors etc) will
provide a more concise data to you to use on the training.

5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at
least more than one way to solve the same problem. You maybe not using the best approach.
::: Classification by Back Propagation
The backpropagation (BP) algorithm learns the classification model by training a multilayer feed-forward
neural network. The generic architecture of the neural network for BP is shown in the following diagrams, with
one input layer, some hidden layers, and one output layer. Each layer contains some units or perceptron.
Each unit might be linked to others by weighted connections. The values of the weights are initialized before
the training. The number of units in each layer, number of hidden layers, and the connections will be
empirically defined at the very start.

The training tuples are assigned to the input layer; each unit in the input layer calculates the result with
certain functions and the input attributes from the training tuple, and then the output is served as the input
parameter for the hidden layer; the calculation here happened layer by layer. As a consequence, the output of
the network contains all the output of each unit in...

::: Associative Classification

Associative classification (AC) is a promising data mining approach that integrates classification and association rule
discovery to build classification models (classifiers). In the last decade, several AC algorithms have been proposed
such as Classification based Association (CBA), Classification based on Predicted Association Rule (CPAR), Multi-
class Classification using Association Rule (MCAR), Live and Let Live (L 3) and others. These algorithms use different
procedures for rule learning, rule sorting, rule pruning, classifier building and class allocation for test cases. This
paper sheds the light and critically compares common AC algorithms with reference to the abovementioned
procedures. Moreover, data representation formats in AC mining are discussed along with potential new research
directions.
::: K-nearest neighbor classifier
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It
belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and
intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make any underlying
assumptions about the distribution of data (as opposed to other algorithms such as GMM, which assume a
Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into groups identified by an
attribute.
As an example, consider the following table of data points containing two features:

Now, given another set of data points (also called testing data), allocate these points a group by analyzing the
training set. Note that the unclassified points are marked as ‘White’.
Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified
point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point
close to a cluster of points classified as ‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the second point (5.5, 4.5)
should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element of this array represents a
tuple (x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified
data point.
5. Return the majority label among S.
Recommended: Please try your approach on {IDE} first, before moving on to the solution.
K can be kept as an odd number so that we can calculate a clear majority in the case where only two groups are
possible (e.g. Red/Blue). With increasing K, we get smoother, more defined boundaries across different
classifications. Also, the accuracy of the above classifier increases as we increase the number of data points in
the training set.
Example Program
Assume 0 and 1 as the two classifiers (groups).
Output:

The value classified to unknown point is 0.

7. Cluster Analysis
::: Basic Concepts and issues in clustering
Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster
and dissimilar objects are grouped in another cluster.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign
the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful
features that distinguish different groups.

Applications of Cluster Analysis

• Clustering analysis is broadly used in many applications such as market research, pattern recognition, data
analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base. And they can characterize their
customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth observation database. It also helps in
the identification of groups of houses in a city according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe
characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of
data such as interval-based (numerical) data, categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of
arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small
sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also
the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are
sensitive to such data and may lead to poor quality clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods
Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method

Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition
will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following
requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to
other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods
on the basis of how the hierarchical decomposition is formed. There are two approaches here −

• Agglomerative Approach
• Divisive Approach

Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It
keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are
merged into one or until the termination condition holds.

Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the
continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into
micro-clusters, and then performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given
cluster has to contain at least a minimum number of points.

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates
the clusters by clustering the density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on standard statistics, taking
outlier or noise into account. It therefore yields robust clustering methods.

Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint
refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive
way of communication with the clustering process. Constraints can be specified by the user or the application requirement.

::: Types of Data in Cluster Analysis

Types Of Data Used In Cluster Analysis Are:

• Interval-Scaled variables
• Binary variables
• Nominal, Ordinal, and Ratio variables
• Variables of mixed types

Types Of Data Structures

First of all, let us know what types of data structures are widely used in cluster analysis.

We shall know the types of data that often occur in cluster analysis and how to preprocess them for such analysis.

Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries,
and so on.

Main memory-based clustering algorithms typically operate on either of the following two data structures.
Types of data structures in cluster analysis are

• Data Matrix (or object by variable structure)

• Dissimilarity Matrix (or object by object structure

Data Matrix

This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age,
height, weight, gender, race and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects x p
variables)

The Data Matrix is often called a two-mode matrix since the rows and columns of this represent the different entities.

Dissimilarity Matrix

This stores a collection of proximities that are available for all pairs of n objects. It is often represented by a n – by – n
table, where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a non-negative
number that is close to 0 when objects i and j are higher similar or “near” each other and becomes larger the more they
differ. Since d(i,j) = d(j,i) and d(i,i) =0, we have the matrix in figure.

This is also called as one mode matrix since the rows and columns of this represent the same entity.

Types Of Data In Cluster Analysis Are:

Interval-Scaled Variables

Interval-scaled variables are continuous measurements of a roughly linear scale.

Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and
weather temperature.

The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure.

In general, expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on
the resulting clustering structure.
To help avoid dependence on the choice of measurement units, the data should be standardized. Standardizing
measurements attempts to give all variables an equal weight.

This is especially useful when given no prior knowledge of the data. However, in some applications, users may
intentionally want to give more weight to a certain set of variables than to others.

For example, when clustering basketball player candidates, we may prefer to give more weight to the variable height.

Binary Variables

A binary variable is a variable that can take only 2 values.

For example, generally, gender variables can take 2 variables male and female.

Contingency Table For Binary Data

Let us consider binary values 0 and 1

Let p=a+b+c+d

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Nominal or Categorical Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green.

Method 1: Simple matching

The dissimilarity between two objects i and j can be computed based on the simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in the same state).

p: Let p be total no of variables.

Method 2: use a large number of binary variables

Creating a new binary variable for each of the M nominal states.

Ordinal Variables

An ordinal variable can be discrete or continuous.

In this order is important, e.g., rank.
It can be treated like interval-scaled
By replacing xif by their rank,
By mapping the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by,

Then compute the dissimilarity using methods for interval-scaled variables.

Ratio-Scaled Intervals

Ratio-scaled variable: It is a positive measurement on a nonlinear scale, approximately at an exponential scale, such
as Ae^Bt or A^e-Bt.

Methods:

• First, treat them like interval-scaled variables — not a good choice! (why?)
• Then apply logarithmic transformation i.e.y = log(x)

• Finally, treat them as continuous ordinal data treat their rank as interval-scaled.

Variables Of Mixed Type

A database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal, ordinal, interval, and ratio.

And those combinedly called as mixed-type variables.

::: Partitioning Methods

::: Hierarchical Methods
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins
by treating every data points as a separate cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged
together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A diagram
called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences of merges or splits)
graphically represents this hierarchy and is an inverted tree that describes the order in which factors are merged
(bottom-up view) or cluster are break up (top-down view).
The basic method to generate hierarchical clustering are:
1. Agglomerative:
Initially consider every data point as an individual Cluster and at every step, merge the nearest pairs of the
cluster. (It is a bottom-up method). At first everydata set set is considered as individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is formed.
Algorithm for Agglomerative Hierarchical Clustering is:

• Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
• Consider every data point as a individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been performed below all the
proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.

Figure – Agglomerative Hierarchical clustering

• Step-1:
Consider each alphabet as a single cluster and calculate the distance of one cluster from all the other clusters.
• Step-2:
In the second step comparable clusters are merged together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we merge them in the second step similarly with cluster (D)
and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
• Step-3:
We recalculate the proximity according to the algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
• Step-4:
Repeating the same process; The clusters DEF and BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
• Step-5:
At last the two remaining clusters are merged together to form a single cluster [(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the Agglomerative Hierarchical
clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in
every iteration, we separate the data points from the clusters which aren’t comparable. In the end, we are left with
N clusters.
::: Density Based Methods
https://learning.oreilly.com/library/view/data-mining-
concepts/9780123814791/xhtml/ST0075_CHP010.html

::: Grid Based Methods

Density-based and/or grid-based approaches are popular for mining clusters in a large multidimensional space wherein clusters are regarded as denser
regions than their surroundings. In this chapter, we present some grid-based clustering algorithms.

The computational complexity of most clustering algorithms is at least linearly proportional to the size of the data set. The great advantage of grid-based clustering is
its significant reduction of the computational complexity, especially for clustering very large data sets.

The grid-based clustering approach differs from the conventional clustering algorithms in that it is concerned not with the data points but with the value space that
surrounds the data points. In general, a typical grid-based clustering algorithm consists of the following five basic steps (Grabusts and Borisov, 2002):

1. Creating the grid structure, i.e., partitioning the data space into a finite number of cells.

2. Calculating the cell density for each cell.

3. Sorting of the cells according to their densities.

4. Identifying cluster centers.

5. Traversal of neighbor cells.

12.1 STING

Wang et al. (1997) proposed a STatistical INformation Grid-based clustering method (STING) to cluster spatial databases. The algorithm can be used to facilitate
several kinds of spatial queries. The spatial area is divided into rectangle cells, which are represented by a hierarchical structure. Let the root of the hierarchy be at
level 1, its children at level 2, etc. The number of layers could be obtained by changing the number of cells that form a higher-level cell. A cell in level i corresponds to
the union of the areas of its children in level i + 1. In the algorithm STING, each cell has 4 children and each child corresponds to one quadrant of the parent cell. Only
two-dimensional spatial space is considered in this algorithm. Some related work can be found in (Wang et a1., 1999b)

::: Evaluation of Clustering Solutions

This paper discusses the evaluation of a clustering solution. Criteria based on the number of clusters and
discrimination and classification processes are used to evaluate a clustering solution. The proposed
approach is based on two paradigms: Statistics and Machine Learning. A multimethodological approach is
advocated in the construction of models associating between properties and clusters, to provide a wider
and richer set of analysis perspectives and a better knowledge discovery. Specifically, the construction of
classification and discrimination logical models as a complement of quantitative statistical models is
particularly useful when most of the available information is of a qualitative nature (nominal or ordinal
variables). Both, the classification’s global precision and the comprehension added by the discriminant
model to the association between variables and clusters, are essential to evaluate a clustering solution.
Depending on the dimension of the sample, descriptive analysis performed can be validated through the
partition in two of the total sample – (one sub-sample for model build-up and another (holdout) for
validation) – or by other procedures of cross-validation. The proposed evaluation approach is applied to a
Marketing Tourism case study. The clustering solution is built upon a sample of more than 2500
Portuguese clients of Pousadas de Portugal Hotels. The database includes variables related to the
evaluation of stay (per client) at the Pousadas and profiles of the surveyed clients on holidays,
demographic and psychographic aspects. Measures of association, Chi-square tests, ANOVA, Discriminant
Analysis, Logistic Regression, and Rule Induction (based on CN2 and C4.5 algorithms) are applied in
evaluating the clustering solution built through a K-Means process.

Module-4 Reasoning Under Uncertainty
No ratings yet
Module-4 Reasoning Under Uncertainty
60 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
PSPD LAB ACTIVITY 3A
50% (2)
PSPD LAB ACTIVITY 3A
6 pages
Database Evolution
No ratings yet
Database Evolution
8 pages
Document Review Checklist
No ratings yet
Document Review Checklist
7 pages
UNIT 1 On Databases
No ratings yet
UNIT 1 On Databases
66 pages
Evolution of DBMS
No ratings yet
Evolution of DBMS
3 pages
A Brief History of Database Systems
100% (1)
A Brief History of Database Systems
4 pages
Dbms All Units Notes
No ratings yet
Dbms All Units Notes
140 pages
Dbms Notes
No ratings yet
Dbms Notes
206 pages
DBMS Unit-I Notes
No ratings yet
DBMS Unit-I Notes
37 pages
Prof. Ishani Saha Computer Department Mpstme (Nmims)
No ratings yet
Prof. Ishani Saha Computer Department Mpstme (Nmims)
38 pages
The Evolution of Database Management System
No ratings yet
The Evolution of Database Management System
2 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
46 pages
Database Management System
No ratings yet
Database Management System
6 pages
Database
No ratings yet
Database
129 pages
Evolution of Database
No ratings yet
Evolution of Database
15 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
15 pages
ACU MSC Database
No ratings yet
ACU MSC Database
71 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
49 pages
Lesson Two DMS
No ratings yet
Lesson Two DMS
11 pages
File Systems Introduction To Databases
No ratings yet
File Systems Introduction To Databases
37 pages
History of Database Management Systems
No ratings yet
History of Database Management Systems
7 pages
Unit - 1 PDF
No ratings yet
Unit - 1 PDF
24 pages
DF Unit3 DataBaseManagement
No ratings yet
DF Unit3 DataBaseManagement
25 pages
1 Fundamental Concepts of A Database Systems
No ratings yet
1 Fundamental Concepts of A Database Systems
14 pages
Entities Such As Students, Faculty, Courses, and Classrooms
No ratings yet
Entities Such As Students, Faculty, Courses, and Classrooms
10 pages
Assignmt1 (522) Wajid Sir
No ratings yet
Assignmt1 (522) Wajid Sir
11 pages
Learning Computing History
No ratings yet
Learning Computing History
12 pages
INTRODUCTION - 15.02.2023 To 17.02.2023
No ratings yet
INTRODUCTION - 15.02.2023 To 17.02.2023
46 pages
6testing DB Imp m06
No ratings yet
6testing DB Imp m06
14 pages
Dbms U-1 (P-I) PDF
No ratings yet
Dbms U-1 (P-I) PDF
18 pages
Chapter 2 Database Models
No ratings yet
Chapter 2 Database Models
8 pages
Unit 1
No ratings yet
Unit 1
32 pages
Simon Kamau Dbms JT
No ratings yet
Simon Kamau Dbms JT
203 pages
Database
No ratings yet
Database
11 pages
Database 2nd Semester
No ratings yet
Database 2nd Semester
18 pages
DBMS
No ratings yet
DBMS
63 pages
DBMS2
No ratings yet
DBMS2
28 pages
DBMS Complete Notes by RS
No ratings yet
DBMS Complete Notes by RS
9 pages
DBMS Tutorial
No ratings yet
DBMS Tutorial
173 pages
Unit - 1
No ratings yet
Unit - 1
24 pages
Dbms 1
No ratings yet
Dbms 1
13 pages
Research On Database Fundamentals
No ratings yet
Research On Database Fundamentals
19 pages
Unit I
No ratings yet
Unit I
32 pages
Database Midterm Reviewer
No ratings yet
Database Midterm Reviewer
4 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Oracle Material-Latest PDF
No ratings yet
Oracle Material-Latest PDF
330 pages
BICT2205 Databases
No ratings yet
BICT2205 Databases
46 pages
BIT 1201 Database Lesson 1
No ratings yet
BIT 1201 Database Lesson 1
40 pages
INFOMAN Prelim Notes
No ratings yet
INFOMAN Prelim Notes
9 pages
Unit 1 DBMS
No ratings yet
Unit 1 DBMS
33 pages
DBMS
No ratings yet
DBMS
22 pages
Chapter 9 PPT (AIS - James Hall)
0% (1)
Chapter 9 PPT (AIS - James Hall)
17 pages
CSC 303
No ratings yet
CSC 303
132 pages
Final Dbms
No ratings yet
Final Dbms
32 pages
History of Databases
No ratings yet
History of Databases
5 pages
Ahmad Farooqui Dbms Practical File
No ratings yet
Ahmad Farooqui Dbms Practical File
15 pages
DBMS
No ratings yet
DBMS
114 pages
DBMS Notes
No ratings yet
DBMS Notes
85 pages
Colloquium On Advanced
No ratings yet
Colloquium On Advanced
25 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Data Warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data Warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Decision Tree
0% (1)
Decision Tree
24 pages
DWDS Unit 6 Cluster Analysis
No ratings yet
DWDS Unit 6 Cluster Analysis
31 pages
Module-3 Knowledge Representation
No ratings yet
Module-3 Knowledge Representation
96 pages
Security Model
No ratings yet
Security Model
18 pages
Python Programs1
No ratings yet
Python Programs1
7 pages
KSS Catalog-E
No ratings yet
KSS Catalog-E
236 pages
Directory of Sugar Refineries CY 2018 2019
No ratings yet
Directory of Sugar Refineries CY 2018 2019
2 pages
Custom Mesh Model Guide
No ratings yet
Custom Mesh Model Guide
3 pages
7.3.2.6 Lab - Build and Test Network Cables
No ratings yet
7.3.2.6 Lab - Build and Test Network Cables
5 pages
Title: University of Northeastern Philippines
No ratings yet
Title: University of Northeastern Philippines
3 pages
Meraki Datasheet mr44
No ratings yet
Meraki Datasheet mr44
10 pages
Retire The Threetier Applica 308298
No ratings yet
Retire The Threetier Applica 308298
17 pages
PSoC - Voltage (ADC) To Freq Converter Tutorial (Graphite Piano)
No ratings yet
PSoC - Voltage (ADC) To Freq Converter Tutorial (Graphite Piano)
30 pages
01 2019 1 01182805 Fee Voucher
No ratings yet
01 2019 1 01182805 Fee Voucher
1 page
School Education and Sports Department
No ratings yet
School Education and Sports Department
1 page
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
API-fication: Core Building Block of The Digital Enterprise
No ratings yet
API-fication: Core Building Block of The Digital Enterprise
14 pages
Coal Project Report
No ratings yet
Coal Project Report
15 pages
ADM940 - EN - Part-5
No ratings yet
ADM940 - EN - Part-5
10 pages
SDLC Topic Computer New Book 1st Year 2025
No ratings yet
SDLC Topic Computer New Book 1st Year 2025
5 pages
SDLC Grade 11 - EM
100% (1)
SDLC Grade 11 - EM
41 pages
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
No ratings yet
Nabl Test Report Cross Verification Methodology and Steps For Report Authenticity
4 pages
Telnet
No ratings yet
Telnet
2 pages
Question: 2. An Air Conditioning Plant Comprising Lter, Cooler Coil, Fan A
No ratings yet
Question: 2. An Air Conditioning Plant Comprising Lter, Cooler Coil, Fan A
2 pages
Large It List
No ratings yet
Large It List
864 pages
Doctor Patient
No ratings yet
Doctor Patient
5 pages
How To Register and Badge An Oracle - Com - v14 - 2
No ratings yet
How To Register and Badge An Oracle - Com - v14 - 2
11 pages
Multimedia - Learning Livro Inglês
No ratings yet
Multimedia - Learning Livro Inglês
99 pages
F0072 Slave Exec Erase
No ratings yet
F0072 Slave Exec Erase
1 page
Penjelasan Listing Program
No ratings yet
Penjelasan Listing Program
63 pages