DWDM Material
DWDM Material
Database models:
Flat file database is a database that stores information in a single file or table. In a text file, every line contains one record where fields
either have fixed length or they are separated by commas, whitespaces, tabs or any other character. In a flat file database, there is no
structural relationship among the records and they cannot contain multiple tables as well.
Advantages:
In hierarchical database, the entity type is the main table, rows of a table represent the records and columns represent the attributes.
In the above figure, CUSTOMER is the parent and it has two children (CHCKACCT & SAVEACCT).
Advantages:
In a hierarchical database pace of accessing the information is speedy due to the predefined paths. This increases the performance of a
database.
The relationships among different entities are easy to understand.
Disadvantages:
Hierarchical database model lacks flexibility. If a new relationship is to be established between two entities then a new and possibly a
redundant database structure has to be build.
Maintenance and of data is inefficient in a hierarchical model. Any change in the relationships may require manual reorganization of the
data.
This model is also inefficient for non-hierarchical accesses.
Network database (1970s – 1990s):
The inventor of network model is Charles Bachmann. Unlike hierarchical database model, network database allows multiple parent and
child relationships i.e., it maintains many-to-many relationship. Network database is basically a graph structure. The network database
model was created to achieve three main objectives:
To represent complex data relationships more effectively.
To improve the performance of the database.
To implement a database standard.
In a network database a relationship is referred to as a set. Each set comprises of two types of records, an owner record which is same as
parent type in hierarchical and a member record which is similar to the child type record in hierarchical database model.
Advantages:
The network database model makes the data access quite easy and proficient as an application can access the owner record and all the
member records within a set.
This model is conceptually easy to design.
This model ensures data integrity because no member can exist without an owner. So the user must make an owner entry and then the
member records.
The network model also ensures the data independence because the application works independently of the data.
Disadvantages:
The model lacks structural independence which means that to bring any change in the database structure; the application program must
also be modified before accessing the data.
A user friendly database management system cannot be established via network model.
Implementation of network database:
Network database is implemented in:
Digital Equipment Corporation DBMS-10
Digital Equipment Corporation DBMS-20
RDM Embedded
Turbo IMAGE
Univac DMS-1100 etc.
Relational database (1980s – present):
Relational database model was proposed by E.F. Codd. After the hierarchical and network model, the birth of this model was huge step
ahead. It allows the entities to be related through a common attribute. So in order to relate two tables (entities), they simply need to have a
common attribute. In the tables there are primary keys and alternative keys. Primary keys form a relation with the alternative keys. This
property makes this model extremely flexible.
Thus using relational database ample information can be stored using small tables. The accessing of data is also very efficient. The user
only has to enter a query, and the application provides the user with the asked information.
Relational databases are established using a computer language, Structured Query Language (SQL). This language forms the basis of all the
database applications available today, from Access to Oracle.
Advantages:
Relational database supports mathematical set of operations like union, intersection, difference and Cartesian product. It also supports
select, project, relational join and division operations.
Relational database uses normalization structure which helps to achieve data independence more easily.
Security control can also be implemented more effectively by imposing an authorization control on the sensitive attributes present in a
table.
Relational database uses a language which is easy and human readable.
Disadvantages:
The response to a query becomes time-consuming and inefficient if the number of tables between which the relationships are established
increases.
Oracle
Microsoft
IBM
My SQL
PostgreSQL
SQLite
Object-oriented database (1990s – present):
Object oriented database management system is that database system in which the data or information is presented in the form of objects,
much like in object-oriented programming language. Furthermore, object oriented DBMS also facilitate the user by offering transaction
support, language for various queries, and indexing options. Also, these database systems have the ability to handle data efficiently over
multiple servers.
Unlike relational database, object-oriented database works in the framework of real programming languages like JAVA or C++.
Advantages:
If there are complex (many-to-many) relationships between the entities, the object-oriented database handles them much faster than any
of the above discussed database models.
Navigation through the data is much easier.
Objects do not require assembly or disassembly hence saving the coding and execution time.
Disadvantages:
Lower efficiency level when data or relationships are simple.
Data can be accessible via specific language using a particular API which is not the case in relational databases.
Object-relational database (1990s – present):
Defined in simple terms, an object relational database management system displays a modified object-oriented user-display over the
already implemented relational database management system. When various software interact with this modified-database management
system, they will customarily operate in a manner such that the data is assumed to be saved as objects.
The basic working of this database management system is that it translates the useful data into organized tables, distributed in rows and
columns, and from then onwards, it manages data the same way as done in a relational database system. Similarly, when the data is to be
accessed by the user, it is again translated from processed to complex form.
Advantages:
Data remains encapsulated in object-relational database.
Concept of inheritance and polymorphism can also be implemented in this database.
Disadvantages:
Object relational database is complex.
Proponents of relational approach believe simplicity and purity of relational model are lost.
It is costly as well.
Data Warehouse
According to The Data Warehouse Institute, a data warehouse is the foundation for a successful BI program. The concept of data warehousing is pretty easy
to understand—to create a central location and permanent storage space for the various data sources needed to support a company’s analysis, reporting
and other BI functions.
But a data warehouse also costs money — big money. The problem is when big money is involved it’s tough to justify spending it on any project, especially
when you can’t really quantify the benefits upfront. When it comes to a data warehouse, it’s not easy to know what the benefits are until it’s up and running.
According to BI-Insider.com, here are the key benefits of a data warehouse once it’s launched.
By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In
addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial
management, and sales.”
Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key
initiatives. They won’t waste precious time retrieving data from multiple sources.
Not only that but the business execs can query the data themselves with little or no support from IT—saving more time and more money. That means the
business users won’t have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do best—keep the
business running.
A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data
typically cannot be stored in a transactional database or used to generate reports from a transactional system.
Finally, the piece de resistance—return on investment. Companies that have implemented data warehouses and complementary BI systems have generated
more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.
And that should be reason enough for senior management to jump on the data warehouse bandwagon.
Data Mining
What is the meaning data mining? The process by which companies turn raw data into information that is useful is
Data Mining. In order to develop effective strategies in marketing, thereby decrease cost and increase sales, businesses use
software to look for patterns in the large batches of data. Data mining largely is dependent on warehousing, computer
processing, and effective data collection. These processes are useful for building machine learning models that power
applications like search engine technology and programs that recommend websites.
data mining is used in various fields like research, business, marketing, sales, product development, education, and
healthcare. When used appropriately, data mining provides an extreme advantage over competitive establishments by
providing more information about customers and helps to develop better and effective strategies in marketing which will raise
the revenue and lower the cost. In order to achieve excellent results from data mining, a number of tools and techniques are
required.
• Data cleansing and preparation- in this step transformation of data into a suitable form required for further processing and
analysis such as identification and error removal and missing data.
• Artificial intelligence (AI)- analytical activities that are associated with human intelligence like reasoning, planning, learning,
and problem-solving are performed by these systems.
• Association rule learning- also known as market basket analysis, these tools look in the dataset, for the relationship between
variables such as concluding which products are purchased by the customers together.
• Clustering- is a process in which the dataset is partitioned into sets of relevant divisions called clusters, that would help the
users to understand the structure or natural groups in the data.
• Classification- with the goal of predicting the target class for each and every case in the data, items are assigned by this
technique in the dataset.
• Data analytics- data analytics is the process of evaluating digital information and converting it into information useful for
business.
• Data warehousing- is the component of the foundational importance of most huge-scale data mining efforts with a large
collection of data, that is used for decision making in organizations.
• Machine learning- is a computer programmed technique, that makes use of statistical probabilities that gives the computer the
capacity to ‘learn’ even without being clearly programmed.
• Regression- is a technique that is made use of to predict a variety of numeric values, including sales, price of a stock,
temperatures, that are based on a precise dataset.
Customer concept include big spender and budget spender which purchasing items.
The classes must be described in clear and concise terms which is known as ” class / concept description “.
How To Find Such Descriptions ?
There are three ways to find class / concept description.
Data Characterization – This is summarizing the data of target class based on features.
Data Discrimination – This compares the target class with one or more comparative classes called the contrasting classes.
The third option is to use both data characterization and data discrimination.
What Are The Methods Of Data Characterization ?
There are simple methods to characterize the data. One is simple summaries based on statistical measures. Second is the roll-up operation on OLAP data
cube that also summarizes the data. The third method is to use the attribute oriented induction technique.
The output of characterization can be presented in various forms. For example, Pie Charts, Bar Charts, Curves, Multi-dimensional data cubes and multi-
dimensional tables.
The resulting descriptions can also be presented as generalized relations or in rule form called characteristic rules.
Data Characterization Example
Q1. Produce a description of summarizing the characteristics of customer whole spend Rs 3000 in last 6 months.
The result could be a general profile of customers such as 40-50 year old, employed, credit rating and other features.
Data Discrimination Example
Q2. Compare features of software products whose sales up by 10 % in last year with software product whose sales went down by 30%.
The resultant description is same as that of data characterization, however, we also have comparative measures that distinguish the target
class and contrasting class.
Frequent Sub-sequence
The customer buying first PC, then camera, then a memory card is an example of sub-sequence.
Frequent Sub-structure
The sub-structure means different structural forms such as graphs, trees, and lattices, that are combined with item-sets and/or sub-sequences.
Association Analysis
Association analysis identify relationship between items that are bought together during transactions. Consider the following example.
Example: A store want to known which is purchased together. So they created rules such as
where X is the customer, the confidence or certainty is 50% means if X buys computer, there is 50% chance that he will buy
computer accessories.
Support means of all transaction % is where computer and computer accessories bought together.
Consider another rule like above.
age(X, "20 ... 29") And income(X, "20K ... 29K") => buys (X, "CD Player") [support = 2%, confidence = 60%]
The second rule is example of multi-dimensional rule where more than two attributes – age and income are involved.
Therefore, frequent item-set mining is the simplest form of frequent pattern mining.
Depending on various methods and technologies from the intersection of machine learning, database management, and statistics, professionals
in data mining have devoted their careers to better understanding how to process and make conclusions from the huge amount of data, but what
are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed and used, including association, classification,
clustering, prediction, sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data mining technique helps to classify data
in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities. For example, discrimination,
classification, clustering, characterization, etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine learning, genetic algorithms,
visualization, statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data mining procedure, such as query-driven
systems, autonomous systems, or interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few clusters mainly loses certain confine details,
but accomplishes improvement. It models data by its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data concept. From a practical point of view, clustering plays an
extraordinary job in data mining applications. For example, scientific data exploration, text mining, information retrieval, spatial database
applications, CRM, Web analysis, computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar data. This technique helps to recognize the
differences and similarities between the data. Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between variables because of the presence of the
other factor. It is used to define the probability of the specific variable. Regression, primarily a form of planning and modeling. For example, we
might use it to project certain costs, depending on other factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different
types of databases. Association rule mining has several applications and is commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you have been buying for the last six
months. It calculates a percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not match an expected pattern or expected
behavior. This technique may be used in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover sequential patterns. It comprises of
finding interesting subsequences in a set of sequences, where the stake of a sequence can be measured in terms of different criteria like length,
occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification, etc. It analyzes past events or instances
in the right sequence to predict a future event.
Data mining is widely used in diverse areas. There are a number of commercial data mining system available today and yet there are many
challenges in this field. In this tutorial, we will discuss the applications and the trend of data mining.
• Design and construction of data warehouses for multidimensional data analysis and data mining.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good
customer retention and satisfaction. Here is the list of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Customer Retention.
• Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity,
security has become the major issue. With increased usage of internet and availability of the tools and tricks for intruding and attacking network
prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may
be applied for intrusion detection −
• Association and correlation analysis, aggregation to help select and build discriminating attributes.
• Data Types − The data mining system may handle formatted text, record-based data, and relational data. The data could also be in ASCII
text, relational database data or data warehouse data. Therefore, we should check what exact format the data mining system can handle.
• System Issues − We must consider the compatibility of a data mining system with different operating systems. One data mining system
may run on only one operating system or on several. There are also data mining systems that provide web-based user interfaces and allow
XML data as input.
• Data Sources − Data sources refer to the data formats in which data mining system will operate. Some data mining system may work only
on ASCII text files while others on multiple relational sources. Data mining system should also support ODBC connections or OLE DB for
ODBC connections.
• Data Mining functions and methodologies − There are some data mining systems that provide only one data mining function such as
classification while some provides multiple data mining functions such as concept description, discovery-driven OLAP analysis, association
mining, linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis, similarity search, etc.
• Coupling data mining with databases or data warehouse systems − Data mining systems need to be coupled with a database or a data
warehouse system. The coupled components are integrated into a uniform information processing environment. Here are the types of
coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
• Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining system is considered as row scalable when the number or rows are enlarged
10 times. It takes no more than 10 times to execute a query.
o Column (Dimension) Salability − A data mining system is considered as column scalable if the mining query execution time
increases linearly with the number of columns.
• Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows
users to focus the search for patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple
levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide
the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from scratch.
• Mining information from heterogeneous databases and global information systems − The data is available at different data sources
on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
• An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
• An attribute set defines an object. The object is also referred to as a record of the instances or entity.
Different types of attributes or data types:
1. Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as Student Roll
No., Sex of the Person.
2. Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height
3. Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.
4. Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values ,are of
two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allows us to compare such as temperature in C or F and
thus values of attributes have order.
• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could
be human or computer errors.
• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction
data may not always be available.
• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent
format. Duplicate tuples need cleaning of details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record
on time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the
database are incomplete for a time after each month.
• Believability:
It is reflective of how much users trust the data.
• Interpretability:
It is a reflection of how easy the users can understand the data.
*** Statistical Descriptions of Data use this pdf for this topic ****
Pros
• It can be accessed quickly by a wider audience.
Cons
• It can misrepresent information – if an incorrect visual representation is made.
Here the total distance of the Red line gives the Manhattan distance between both the points.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of
the data items.
Figure – Jaccard Index
4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-dimensional space, a point is represented
as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.
• Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be
human or computer errors.
• Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data
may not always be available.
• Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format .
Duplicate tuples need cleaning of details, too.
• Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record on
time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the database
are incomplete for a time after each month.
• Believability:
It is reflective of how much users trust the data.
• Interpretability:
It is a reflection of how easy the users can understand the data.
• For example, the model will treat America and america as different classes or values, though they represent the same value
or red, yellow and red-yellow as different classes or attributes, though one class can be included in other two classes. So,
these are some structural errors that make our model inefficient and gives poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers
than decision tree models. Generally, we should not remove outliers until we have a legitimate reason to remove them.
Sometimes, removing them improves performance, sometimes not. So, one must have a good reason to remove the outlier,
such as suspicious measurements that are unlikely to be the part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove the missing observation. They
must be handled carefully as they can be an indication of something important. The two most common ways to deal with
missing data are:
1. Dropping observations with missing values.
Dropping missing values is sub-optimal because when you drop observations, you drop information.
• The fact that the value was missing may be informative in itself.
• Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!
2. Imputing the missing values from past observations.
Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a
loss in information, no matter how sophisticated your imputation method is.
• Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the
patterns already provided by other features.
Both of these approaches are sub-optimal because dropping an observation means dropping information, thereby reducing
data and imputing values also is sub-optimal as we fil the values that were not present in the actual dataset, which leads to a
loss of information.
Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s
like trying to squeeze in a piece from somewhere else in the puzzle.
So, missing data is always informative and indication of something important. And we must aware our algorithm of missing data
by flagging it. By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal
constant for missingness, instead of just filling it in with the mean.
There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling
approach”.
Tight Coupling:
• Here, a data warehouse is treated as an information retrieval component.
• In this coupling, data is combined from different sources into a single physical location through the process of ETL –
Extraction, Transformation and Loading.
Loose Coupling:
• Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand and then sends the query directly to the source databases to obtain the result.
• And the data only remains in the actual source databases.
Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-
length Encoding). We can divide it into two types based on their compression techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed data.
• Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this
compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the
image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve
information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instea d of
actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling.
For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace
many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.
• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and
repeat of this method up to the end, then the process is known as top-down discretization also known as splitting.
• Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a combination of the neighbourhood
values in the interval, that process is called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level concepts
(categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
• Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts
depends on the number of bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called br ackets.
There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
2. Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from
0-20.
3. Clustering: Grouping the similar data together.
• Min-Max Normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
We Have the Formula:
• Where v is the value you want to plot in the new range.
• v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We want to plot the
profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for attribute profit can be plotted to:
For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z-score normalization, a value of
85000 for P can be transformed to:
Data Discretization techniques can be used to divide the range of continuous attribute into intervals.Numerous continuous
attribute values are replaced by small interval labels.
Top-down discretization
If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range, and then
repeats this recursively on the resulting intervals, then it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-points, removes some by merging neighborhood
values to form intervals, then it is called bottom-up discretization or merging.
Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning of the attribute values, known as
a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of
abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining, rather than
during mining.
1 Binning
Binning is a top-down splitting technique based on a specified number of bins.Binning is an unsupervised discretization technique.
2 Histogram Analysis
Because histogram analysis does not use class information so it is an unsupervised discretization technique.Histograms partition the
values for an attribute into disjoint ranges called buckets.
3 Cluster Analysis
Cluster analysis is a popular data discretization method.A clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.
• Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and managing the product
portfolios by comparing the sales quarterly or yearly.
• Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.
• Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The
information also allows us to analyze business operations.
• Query-driven Approach
• Update-driven Approach
Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple
heterogeneous databases. These integrators are also known as mediators.
• Now these queries are mapped and sent to the local query processor.
• The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
• Query-driven approach needs complex integration and filtering processes.
• This approach is also very expensive for queries that require aggregations.
Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach
discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a
warehouse. This information is available for direct querying and analysis.
Advantages
This approach has the following advantages −
• The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.
• Query processing does not require an interface to process data at local sources.
• Data Transformation − Involves converting the data from legacy format to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
The general idea of this approach is to materialize certain expensive computations that are frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be materialized into a set of eight views as shown
in fig, where psc indicates a view consisting of aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be measure attributes, i.e., the attributes
whose values are of interest. Another attributes are selected as dimensions or functional attributes. The measure attributes are aggregated
according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the dimensions time, item, branch, and location.
These dimensions enable the store to keep track of things like monthly sales of items, and the branches and locations at whic h the items were
sold. Each dimension may have a table identify with it, known as a dimensional table, which describes the dimensions. For example, a dimension
table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many cases because not every cell in each
dimension may have corresponding data in the database.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how to make the best use of the
precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data model. Data cubes usually model n-
dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data model is organized around a central theme,
like sales and transactions. A fact table represents this theme. Facts are numerical measures. Thus, the fact table contains measure (such as
Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per quarter in the city of Vancouver. The
measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would like to view the data according to time,
item as well as the location for the cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in thousands). These
3-D data are shown in the table. The 3-D data of the table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier. The measure
displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid. In this example, this is the total sales,
or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes for the dimension time, item, location,
and supplier. Each cuboid represents a different degree of summarization.
OLAP is an acronym for Online Analytical Processing. Before we start explaining the basics of OLAP, you might wonder why OLAP can be handy for your organization. Well,
OLAP is a multi-dimensional database technology that permit to perform quick data analysis on many data records. This analysis will provide relevant information aimed at
better decisions taking , storytelling and planning. In summary OLAP is a software technology that allows organizations to perform multidimensional analysis of collected data.
It provides the capability for complex calculations, trend analysis and data modeling with one goal: understanding your business better.
When learning about OLAP, there is no getting away from the terms dimensions, cubes, measures and hierarchies. Here are some definitions that will make it easier to
understand their relevance.
1. Cubes
OLAP tools use multidimensional database structures, called cubes. An OLAP Cube, or a data cube, is a multidimensional data set that allows fast analysis of data, according
to the multiple dimensions you set up. You can compare a cube with a multidimensional spreadsheet: you can collect data from users and store that data in a transparent way
and calculate when needed. In order to form a cube you need dimensions.
2. Dimensions
Dimensions are lists of related items used to organize your data in similar categories, such as products, time and/or regions. Dimensions are the basis for the data structure
of an OLAP data cube. For example, the months and quarters may make up your Year dimension. You can compare dimensions with the business parameters that you normally
see in the rows and columns of a report. A model can consist in multiple dimensions : example :
In practice dimension need to be limited to +/- 12 in order to remain workable for end users and calculation engine. Depending on technology used the dimension can be higher
without impact on performance.
3. Measures
Each cube must have at least one measure. But in reality, we see that cubes often contain multiple measures. An OLAP measure is a numeric value by which the dimensions are
detailed or aggregated. It gives you the information about quantities you’re interested in. Do you have difficulties with defining you OLAP measures? Ask yourself the question
‘how much…?’ and your answer will be your OLAP measure. Measures can be financial or nonfinancial example: COA’s , KPI’s , FTE ‘s, Volumes , … .
4. Hierarchies
Hierarchies are the subcategories of your dimensions. They have multiple levels and allow you to drill down or drill up your data. What is drilling, you may ask? Drilling allows
you to analyze your data at different levels of granularity. (example : total volume , volume by product group, volume by packaging by product group, volume by KSU by
packaging by product group ).
Conclusion
OLAP is a common technology behind many Business Intelligence and CPM applications and is still most relevant today. Using OLAP can help your organization with your
analyses, forecasting and planning. In short, it should contribute to better decision-making and eventually lead to more profit.
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight
of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following −
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers
allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data.
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.
• When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the level of month.
• When drill-down is performed, one or more dimensions from the data cube are added.
• It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows
how slice works.
• Here Slice is performed for the dimension "time" using the criterion time = "Q1".
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the
following diagram that shows the pivot operation.
OLAP vs OLTP
2 OLAP systems are used by OLTP systems are used by clerks, DBAs,
knowledge workers such as or database professionals.
executives, managers and analysts.
Data warehouse design takes a method different from view materialization in the industries. It sees data warehouses as database systems with
particular needs such as answering management related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data warehouse.
1. "top-down" approach
2. "bottom-up" approach
Developing new data mart from the data warehouse is very easy.
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized database for the data warehouse, a
denormalized dimensional database is adapted to meet the data delivery requirements of data warehouses. Using this method, to use the set of
data marts as the enterprise data warehouse, data marts should be built with conformed dimensions in mind, defining that ordinary objects are
represented the same in different data marts. The conformed dimensions connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a data warehouse for a single subject,
takes far less time and effort than developing an enterprise-wide data warehouse. Also, the risk of failure is even less. This method is inherently
incremental. This method allows the project team to learn and grow.
Advantages of bottom-up design
It is just developing new data marts and then integrating with other data marts.
the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.
Breaks the vast problem into smaller Solves the essential low-level problem and integrates them into
subproblems. a higher one.
Inherently architected- not a union of several Inherently incremental; can schedule essential data marts first.
data marts.
It may see quick results if implemented with Less risk of failure, favorable return on investment, and proof of
repetitions. techniques.
Simply defined, a data warehouse is a system that pulls together data from many different sources within an organization. On top of this
system, business users can create reports from complex queries that answer questions about business operations to improve business efficiency,
make better decisions, and even introduce competitive advantages.
It’s important to understand that a data warehouse is definitely different than a traditional database. Sure, data warehouses and databases are both
relational data systems, but they were definitely built to serve different purposes. A data warehouse is built to store large quantities of historical data
and enable fast, complex queries across all the data, typically using Online Analytical Processing (OLAP). A database was built to store current
transactions and enable fast access to specific transactions for ongoing business processes, known as Online Transaction Processing (OLTP).
So, data warehousing allows you to aggregate data, from various sources. This data, typically structured, can come from Online Transaction
Processing (OLTP) data such as invoices and financial transactions, Enterprise Resource Planning (ERP) data, and Customer Relationship
Management (CRM) data. Finally, data warehousing focuses on data relevant for business analysis, organizes and optimizes it to enable efficient
analysis.
Furthermore, there are some great benefits f leveraging a data warehouse architecture for your requirements. This includes:
• Collects historical data from multiple periods and multiple data sources from across the organization, allowing strategic analysis
• Provides an easy interface for business analysts and data ready for analysis
Stepping back, data warehouse use-cases focus on providing high-level reporting and analysis that lead to more informed business decisions. Use-
cases include:
• Carrying out data mining to gain new insights from the information held in many large databases
Data modernization
Data warehouses are constantly evolving to support new technologies and business requirements—and remain relevant when it comes to big data
and analytics. This means that if you’re leveraging older data storage systems, you might be running into problems supporting new and advanced
data analytics solutions. And if you're trying to run a modern data operation solely on the back of a database, that can create a whole host of
issues.
This is a big point bridging from the previous note. It’s absolutely key to understand just how much is possible with big data platforms like Hadoop.
But can your current infrastructure support it? One big use-case for data warehouse design is the integration with big data systems like Hadoop. For
example, Panoply makes it easy to combine your Hadoop HDFS data into your Panoply data warehouse, giving you instant cloud access to your
HDFS data without any ETL or ELT process. HDFS, or the Hadoop Distributed File System, is an open source data storage software framework.
Panoply’s end to end data management solution is able to load Hadoop Data into your Panoply data warehouse with only a few clicks, giving your
analysts and scientists instant access.
Data integration
One key aspect in working with data is the ability to integrate with other key systems. For example, maybe you have a lot of data in your data
warehouse architecture. How are you integrating it with data visualization? Or, do you have integration with things like reporting and analytics?
A great use-case for data warehousing is to integrate with amazing data services ranging from everything like business intelligence (BI), to data
visualization (Tableau). For example, you can quickly integrate Amazon Kinesis Firehose reporting and analysis into your data warehouse with the
Panoply Amazon Kinesis Firehose integration.
The future of data warehousing revolves around your ability to integrate and work with data. Leading data warehousing systems will allow you to
leverage integration as a great way to get the most of your data without requiring a complicated data infrastructure. Working with next-generation
data warehousing revolves around simplicity, more data delivery capabilities, and working to advance the business.
This might be a lot to take in; but it doesn’t have to be hard to get started. If you have data requirements that are being complicated by your current
systems, look at a data warehouse as a real-world option to make things easier. Remember, the future will absolutely be driven by your ability to
work with and house data. And, this can’t be done with traditional databases. Those organizations that capture the full benefits of data will be the
ones which can deeply understand the market and evolving customer requirements.
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the introduction of
the data cube approach.
The data cube approach can be considered as a data warehouse – based, pre computational – oriented,
materialized approach.
It performs off-line aggregation before an OLAP or data mining query is submitted for processing.
On the other hand, the attribute oriented induction approach, at least in its initial proposal, a relational
database query – oriented, generalized – based, on-line data analysis technique.
However, there is no inherent barrier distinguishing the two approaches based on online aggregation
versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.
How it is done?
• Collect the task-relevant data( initial relation) using a relational database query
• Perform generalization by attribute removal or attribute generalization.
• Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
• Reduces the size of the generalized data set.
• Interactive presentation with users.
Data focusing:
• Analyzing task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal:
• To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher-level concepts are expressed in terms of other attributes.
Attribute-generalization:
• If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then
select an operator and generalize A.
Attribute-threshold control:
InitialRel:
• It is nothing but query processing of task-relevant data and deriving the initial relation.
PreGen:
• It is based on the analysis of the number of distinct values in each attribute and to determine the
generalization plan for each attribute: removal? or how high to generalize?
PrimeGen:
• It is based on the PreGen plan and performing the generalization to the right level to derive a “prime
generalized relation” and also accumulating the counts.
Presentation:
• User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Example
Let's say there is a University database that is to be characterized, for that its corresponding DMQL will
be
use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone_no, GPA
from student
InitialRel:
PreGen
• Now, we have generalized these results by removing a few attributes and retaining important attributes.
• And also we have generalized a few attributes by naming them "Country" rather than "Birth_Place", "Age
Range" rather than "Birth_data", "City" rather than "Residence" and so on as per the table given below.
PrimeGen
• Based on the PreGen plan we've performed generalization to the right level to derive a “prime
generalized relation” and also we've accumulated the counts.
Final Results
• Now we've and analyzed and concluded our final generalized results as shown below.
Presentation Of Results
Generalized relation:
• Relations where some or all attributes are generalized, with counts or other aggregation values
accumulated.
Cross-tabulation:
Visualization techniques:
• Pie charts, bar charts, curves, cubes, and other visual forms.
• Mapping generalized results in characteristic rules with quantitative information associated with it.
Summary
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the introduction of
the data cube approach.
Example
Pre d i ct io n c ube fo r i de n ti f ic at io n o f i nte re st i ng cub e sub s pa ce s
Suppose a company has a customer table with the attributes time (with two granularity
levels: month and year), location (with two granularity levels: state and country), gender, salary, and one
class-label attribute: valued_customer. A manager wants to analyze the decision process of whether a
customer is highly valued with respect to time and location. In particular, he is interested in the
question “Are there times at and locations in which the value of a customer depended greatly on the
customer’s gender?” Notice that he believes time and location play a role in predicting valued customers,
but at what granularity levels do they depend on gender for this task? For example, is performing analysis
using {month, country} better than {year, state}?
Consider a data table D (e.g., the customer table). Let X be the (e.g., gender, salary). Let Y be the class-label
attribute (e.g., valued_customer), and Z be the set of attributes, that is, attributes for which concept
hierarchies have been defined (e.g., time, location). Let V be the set of attributes for which we would like
to define their predictiveness. In our example, this set is {gender}. The predictiveness of V on a data subset
can be quantified by the difference in accuracy between the model built on that subset using X to predict Y
and the model built on that subset using X − V (e.g., {salary}) to predict Y. The intuition is that, if the
difference is large, V must play an important role in the prediction of class label Y.
Given a set of attributes, V, and a learning algorithm, the cube at granularity (e.g., ) is
a d-dimensional array, in which the value in each cell (e.g., [2010, Illinois]) is the predictiveness
of V evaluated on the subset defined by the cell (e.g., the records in the customer table with time in 2010
and location in Illinois).
Supporting OLAP roll-up and drill-down operations on a prediction cube is a computational challenge
requiring the materialization of cell values at many different granularities. For simplicity, we can consider
only full materialization. A na ve way to fully materialize a prediction cube is to exhaustively build models
and evaluate them for each cell and granularity. This method is very expensive if the base data set is large.
An ensemble method called Probability-Based Ensemble (PBE) was developed as a more feasible
alternative. It requires model construction for only the finest-grained cells. OLAP-style bottom-up
aggregation is then used to generate the values of the coarser-grained cells.
The prediction of a predictive model can be seen as finding a class label that maximizes a scoring function.
The PBE method was developed to approximately make the scoring function of any predictive model
distributively decomposable. In our discussion of data cube measures in Section 4.2.4, we showed that
distributive and algebraic measures can be computed efficiently. Therefore, if the scoring function used is
distributively or algebraically decomposable, prediction cubes can also be computed with efficiency. In
this way, the PBE method reduces prediction cube computation to data cube computation.
For example, previous studies have shown that the naïve Bayes classifier has an algebraically
decomposable scoring function, and the kernel density–based classifier has a distributively decomposable
scoring function. Therefore, either of these could be used to implement prediction cubes efficiently. The
8
PBE method presents a novel approach to multidimensional data mining in cube space.
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict continuous valued functions. For
example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential customers on computer equipment given their income and
occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which
are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky
or safe for loan application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company.
In this example we are bothered to predict a numeric value. Therefore the data analysis task is an example of numeric
prediction. In this case, a model or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree, it measures the randomness or impurity in
data sets.
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split. It is also called Entropy Reduction. Building a
decision tree is all about discovering attributes that return the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing decisions. Starting with the
dataset, we can measure the entropy to find a way to segment the set until the data belongs to the same class.
It provides us a framework to measure the values of outcomes and the probability of accomplishing them.
It helps us to make the best decisions based on existing data and best speculations.
In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split an extensive
collection of records into smaller sets of the class by implementing a sequence of simple decision rules. A decision tree
model comprises a set of rules for portioning a huge heterogeneous population into smaller, more homogeneous, or
mutually exclusive classes. The attributes of the classes can be any variables from nominal, ordinal, binary, and
quantitative values, in contrast, the classes must be a qualitative type, such as categorical or ordinal or binary. In brief,
the given data of attributes together with its class, a decision tree creates a set of rules that can be used to identify the
class. One rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy is
known as the tree, and each segment is called a node. With each progressive division, the members from the
subsequent sets become more and more similar to each other. Hence, the algorithm used to build a decision tree is
referred to as recursive partitioning. The algorithm is known as CART (Classification and Regression Trees)
Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which leads to $8 million profit, and
the probability of a bad economy is 0.4 (40%), which leads to $6 million profit.
Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which leads to $4 million profit, and
the probability of a bad economy is 0.4, which leads to $2 million profit.
The management teams need to take a data-driven decision to expand or not based on the given data.
Initially, D is the entire set of training tuples and their related class levels (input training data).
Missing values in data also do not influence the process of building a choice tree to any considerable extent.
A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-processing.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's
family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable
PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from
−
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may perform well on training data
but less well on subsequent data. That's why the rule pruning is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what
was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for
the pruned version of R, then we prune R.
1 - Cross Validation : Separe your train dataset in groups, always separe a group for prediction and change the groups in each
execution. Then you will know what data is better to train a more accurate model.
2 - Cross Dataset : The same as cross validation, but using different datasets.
3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK which classification
algorithm you're using so its hard to help more).
4 - Improve, or use (if you're not using) the normalization process : Discover which techniques (change the geometry, colors etc) will
provide a more concise data to you to use on the training.
5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at
least more than one way to solve the same problem. You maybe not using the best approach.
::: Classification by Back Propagation
The backpropagation (BP) algorithm learns the classification model by training a multilayer feed-forward
neural network. The generic architecture of the neural network for BP is shown in the following diagrams, with
one input layer, some hidden layers, and one output layer. Each layer contains some units or perceptron.
Each unit might be linked to others by weighted connections. The values of the weights are initialized before
the training. The number of units in each layer, number of hidden layers, and the connections will be
empirically defined at the very start.
The training tuples are assigned to the input layer; each unit in the input layer calculates the result with
certain functions and the input attributes from the training tuple, and then the output is served as the input
parameter for the hidden layer; the calculation here happened layer by layer. As a consequence, the output of
the network contains all the output of each unit in...
Now, given another set of data points (also called testing data), allocate these points a group by analyzing the
training set. Note that the unclassified points are marked as ‘White’.
Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups. Now, given an unclassified
point, we can assign it to a group by observing what group its nearest neighbors belong to. This means a point
close to a cluster of points classified as ‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the second point (5.5, 4.5)
should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element of this array represents a
tuple (x, y).
2. for i=0 to m:
3. Calculate Euclidean distance d(arr[i], p).
4. Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified
data point.
5. Return the majority label among S.
Recommended: Please try your approach on {IDE} first, before moving on to the solution.
K can be kept as an odd number so that we can calculate a clear majority in the case where only two groups are
possible (e.g. Red/Blue). With increasing K, we get smoother, more defined boundaries across different
classifications. Also, the accuracy of the above classifier increases as we increase the number of data points in
the training set.
Example Program
Assume 0 and 1 as the two classifiers (groups).
Output:
7. Cluster Analysis
::: Basic Concepts and issues in clustering
Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster
and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign
the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful
features that distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition
will represent a cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy the following
requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to
other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods
on the basis of how the hierarchical decomposition is formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It
keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are
merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the
continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given model. This method locates
the clusters by clustering the density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on standard statistics, taking
outlier or noise into account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-oriented constraints. A constraint
refers to the user expectation or the properties of desired clustering results. Constraints provide us with an interactive
way of communication with the clustering process. Constraints can be specified by the user or the application requirement.
• Interval-Scaled variables
• Binary variables
• Nominal, Ordinal, and Ratio variables
• Variables of mixed types
First of all, let us know what types of data structures are widely used in cluster analysis.
We shall know the types of data that often occur in cluster analysis and how to preprocess them for such analysis.
Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries,
and so on.
Main memory-based clustering algorithms typically operate on either of the following two data structures.
Types of data structures in cluster analysis are
Data Matrix
This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age,
height, weight, gender, race and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects x p
variables)
The Data Matrix is often called a two-mode matrix since the rows and columns of this represent the different entities.
Dissimilarity Matrix
This stores a collection of proximities that are available for all pairs of n objects. It is often represented by a n – by – n
table, where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a non-negative
number that is close to 0 when objects i and j are higher similar or “near” each other and becomes larger the more they
differ. Since d(i,j) = d(j,i) and d(i,i) =0, we have the matrix in figure.
This is also called as one mode matrix since the rows and columns of this represent the same entity.
Interval-Scaled Variables
Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and
weather temperature.
The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure.
In general, expressing a variable in smaller units will lead to a larger range for that variable, and thus a larger effect on
the resulting clustering structure.
To help avoid dependence on the choice of measurement units, the data should be standardized. Standardizing
measurements attempts to give all variables an equal weight.
This is especially useful when given no prior knowledge of the data. However, in some applications, users may
intentionally want to give more weight to a certain set of variables than to others.
For example, when clustering basketball player candidates, we may prefer to give more weight to the variable height.
Binary Variables
For example, generally, gender variables can take 2 variables male and female.
Let p=a+b+c+d
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green.
The dissimilarity between two objects i and j can be computed based on the simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in the same state).
Ordinal Variables
Ratio-Scaled Intervals
Ratio-scaled variable: It is a positive measurement on a nonlinear scale, approximately at an exponential scale, such
as Ae^Bt or A^e-Bt.
Methods:
• First, treat them like interval-scaled variables — not a good choice! (why?)
• Then apply logarithmic transformation i.e.y = log(x)
• Finally, treat them as continuous ordinal data treat their rank as interval-scaled.
• Calculate the similarity of one cluster with all the other clusters (calculate proximity matrix)
• Consider every data point as a individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Step 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note:
This is just a demonstration of how the actual algorithm works no calculation has been performed below all the
proximity among the clusters are assumed.
Let’s say we have six data points A, B, C, D, E, F.
The computational complexity of most clustering algorithms is at least linearly proportional to the size of the data set. The great advantage of grid-based clustering is
its significant reduction of the computational complexity, especially for clustering very large data sets.
The grid-based clustering approach differs from the conventional clustering algorithms in that it is concerned not with the data points but with the value space that
surrounds the data points. In general, a typical grid-based clustering algorithm consists of the following five basic steps (Grabusts and Borisov, 2002):
1. Creating the grid structure, i.e., partitioning the data space into a finite number of cells.
12.1 STING
Wang et al. (1997) proposed a STatistical INformation Grid-based clustering method (STING) to cluster spatial databases. The algorithm can be used to facilitate
several kinds of spatial queries. The spatial area is divided into rectangle cells, which are represented by a hierarchical structure. Let the root of the hierarchy be at
level 1, its children at level 2, etc. The number of layers could be obtained by changing the number of cells that form a higher-level cell. A cell in level i corresponds to
the union of the areas of its children in level i + 1. In the algorithm STING, each cell has 4 children and each child corresponds to one quadrant of the parent cell. Only
two-dimensional spatial space is considered in this algorithm. Some related work can be found in (Wang et a1., 1999b)