DWDM Final
DWDM Final
UNIT -1
PART-A
1.How is data ware house different from a database? Identify the similarity.
Data warehouse is a repository of multiple heterogenous data sources, organized
under a unified schema at a single site in order to facilitate management decision-making. A
relational database’s is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values. Both are used to store and manipulate the data.
2. Differentiate metadata and data mart.
It is used for maintaining, managing and They are used for rapid delivery of
using the data warehouse enhanced decision support functionality to
end users.
3. Analyze why one of the biggest challenges when designing a data ware house is the
data placement and distribution strategy.
One of the biggest challenges when designing a data warehouse is the data placement
and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).
4. How would you evaluate the goals of data mining?
5. List the two ways the parallel execution of the tasks within SQL statements can be
done.
6. What elements would you use to relate the design of data warehouse?
➢ Quality Screens.
➢ External Parameters File / Table.
➢ Team and Its responsibilities.
➢ Up to date data connectors to external sources.
➢ Consistent architecture between environments (development / uat (user – acceptance –
testing / production)
➢ Repository of DDL’s and other script files (.SQL, Bash / Powershell)
➢ Testing processes – unit tests, integration tests, regression tests
➢ Audit tables, monitoring and alerting of audit tables
➢ Known and described data lineage
7. Define Data mart
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement.
➢ The absence of a budget for a full-scale data warehouse strategy.
➢ The decentralization of business needs.
BENEFITS:
➢ Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
➢ Data warehousing is an efficient way to manage and report on data that is from a
variety of sources, non-uniform and scattered throughout a company.
➢ Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
➢ Data warehousing provides the capability to analyze large amounts of historical data
for nuggets of wisdom that can provide an organization with competitive advantage.
11. Describe the alternate technologies used to improve the performance in data
warehouse environment
15. Point out the major differences between the star schema and the snowflake schema
The dimension table of the snowflake schema model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space.
PART – B
1.What is data warehouse? Give the Steps for design and construction of Data
Warehouses and explain with three tier architecture diagram.
DATA WAREHOUSE:
A data warehouse is a repository of multiple heterogeneous data sources organized
under a unified schema at a single site to facilitate management decision making. (or)A data
warehouse is a subject-oriented, time-variant and non-volatile collection of data in support of
management’s decision-making process.
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level. The reason to build the EDW on
the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
Once the EDW is implemented we start building subject area specific data marts which
contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we
would end up in a complex multiple level joins that would be much slower as compared to
the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined. A Conformed fact has the same definition of
measures, same dimensions joined to it and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not
keeping detailed data in this approach hence loosing out on advantage of having detail data
.i.e.
flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.
3.(i) Draw the data warehouse architecture and explain its components
Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.
Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system
Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.
Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
1.MOLAP:
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.
Advantages:
✓ Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
✓ Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they return
quickly.
Disadvantages:
✓ Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
✓ Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
Examples:
Hyperion Essbase, Fusion (Information Builders)
ROLAP:
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Data stored in relational tables
Advantages:
✓ Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
✓ Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
✓ Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
✓ Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of the- box
complex functions as well as the ability to allow users to define their own functions.
Examples:
Micro-strategy Intelligence Server, Meta Cube (Informix/IBM)
The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:
Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.
➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
(ii) Describe in detail on data warehouse Metadata
METADATA:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
➢ It is classified into two:
✓ Technical Meta data
✓ Business Meta data
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
➢ Top - Down Approach (Suggested by Bill Inmon)
➢ Bottom - Up Approach (Suggested by Ralph Kimball)
The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important for the
data to be reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and
initial investment. The business has to wait for the EDW to be implemented followed by
building the data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it
and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse.
The advantage of using the Bottom Up approach is that they do not require high initial
costs and have a faster implementation time; hence the business can start using the marts
much earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence losing out on advantage of
having detail data i.e. flexibility to easily cater to future requirements. Bottom up approach is
more realistic but the complexity of the integration may become a serious obstacle.
(ii) Analyze the information needed to support DBMS schemas for Decision support.
Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
OLAP Tools:
are used to analyze the data in multi-dimensional and complex views. To enable
multidimensional properties and it uses MDDB and MRDB where MDDB refers multi-
dimensional data base and MRDB refers multi relational data bases.
Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system
Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.
Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
7.(i) Discuss the different types of data repositories on which mining can be performed?
A data repository, often called a data archive or library, is a generic terminology that
refers to a segmented data set used for reporting or analysis. It’s a huge database
infrastructure that gathers, manages, and stores varying data sets for analysis, distribution,
and reporting.
Some common types of data repositories include:
➢ Data Warehouse
➢ Data Lake
➢ Data Mart
➢ Metadata Repository
➢ Data Cube
Data Warehouse
A data warehouse is a large data repository that brings together data from several
sources or business segments. The stored data is generally used for reporting and analysis to
help users make critical business decisions. In a broader perspective, a data warehouse offers
a consolidated view of either a physical or logical data repository gathered from numerous
systems. The main objective of a data warehouse is to establish a connection between data
from current systems. For example, product catalogue data stored in one system and
procurement orders for a client stored in another one.
Data Lake
A data lake is a unified data repository that allows you to store structured, semi-
structured, and unstructured enterprise data at any scale. Data can be in raw form and used for
different tasks like reporting, visualizations, advanced analytics, and machine learning.
Data Mart:
Metadata Repositories:
Metadata incorporates information about the structures that include the actual data.
Metadata repositories contain information about the data model that store and share this data.
They describe where the source of data is, how it was collected, and what it signifies. It may
define the arrangement of any data or subject deposited in any format. For businesses,
metadata repositories are essential in helping people understand administrative changes, as
they contain detailed information about the data.
Data Cubes:
Data cubes are lists of data with multidimensions (usually 3 or more dimensions)
stored as a table. They are used to describe the time sequence of an image’s data and help
assess gathered data from a range of standpoints. Each dimension of a data cube signifies
specific characteristics of the database such as day-to-day, monthly or annual sales. The data
contained within a data cube allows you to analyze all the information for almost any or all
clients, sales representatives, products, and more. Consequently, a data cube can help you
identify trends and scrutinize business performance.
data.
Decrement in production cost Reduced redundant processing Enhanced
customer relation.
Improvement in selection of target markets Increased customer satisfaction
Data extraction is the process of collecting or retrieving disparate types of data from a
variety of sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process, and refine data so that it can be stored in
a centralized location in order to be transformed. These locations may be on-site, cloud-
based, or a hybrid of the two. Data extraction is the first step in both ETL (extract, transform,
load) and ELT (extract, load, transform) and processes. ETL/ELT are themselves part of a
complete data integration strategy. A proper attention must be paid to data extraction which
represents a success factor for a data warehouse architecture. When implementing data
warehouse several the following selection criteria that affect the ability to transform,
consolidate, integrate and repair the data should be considered:
➢ Timeliness of data delivery to the warehouse
➢ The tool must have the ability to identify the particular data and that can be
read by conversion tool
➢ The tool must support flat files, indexed files since corporate data is still in
this type
➢ The tool must have the capability to merge data from multiple data stores
➢ The tool should have specification interface to indicate the data to be extracted
➢ The tool should have the ability to read data from data dictionary
➢ The code generated by the tool should be completely maintainable
➢ The tool should permit the user to extract the required data
➢ The tool must have the facility to perform data type and character set
translation
➢ The tool must have the capability to create summarization, aggregation and
derivation of records
➢ The data warehouse database system must be able to perform loading data
directly from these tools
Index types:
The first of SYBASE IQ provide five index techniques, Most users apply two indexes
to every column. the default index called projection index and other is either a low or high –
cardinality index. For low cardinality data SYBASE IQ provides.
➢ Low fast index: it is optimized for queries involving scalar functions like SUM,
AVERAGE, and COUNTS.
➢ Low disk index which is optimized for disk space utilization at the cost of being more
CPU intensive.
Performance.
SYBAEE IQ technology achieves the very good performance on adhoc quires for
several reasons
➢ Bitwise technology: this allows various types of data type in query. And support fast
data aggregation and grouping.
➢ Compression: SYBAEE IQ uses sophisticated algorithms to compress data in to bit
maps.
➢ Optimized m/y based programming: SYBASE IQ caches data columns in m/y
according to the nature of user’s queries, it speed up the processor.
➢ Column wise processing: SYBASE IQ scans columns not rows, it reduce the amount
of data the engine has to search.
➢ Low overhead: An engine optimized for decision support SYBASE IQ does not carry
on overhead associated with that. Finally OLTP designed RDBMS performance.
➢ Large block I/P: Block size of SYBASE IQ can turned from 512 bytes to 64 Kbytes
so system can read much more information as necessary in single I/O.
➢ Operating system-level parallelism: SYBASE IQ breaks low level like the sorts,
bitmap manipulation, load, and I/O, into non blocking operations.
➢ Projection and ad hoc join capabilities: SYBASE IQ allows users to take advantage of
known join relation relationships between tables by defining them in advance and
building indexes between tables.
Shortcomings of indexing:-
The user should be aware of when choosing to use SYBASE IQ include
➢ No updates-SYNBASE IQ does not support updates the users would have to update
the source database and then load the update data in SYNBASE IQ on a periodic
basis.
➢ Lack of core RDBMS feature:-
Not support all the robust features of SYBASE SQL server, such as backup and
recovery
➢ Less advantage for planned queries:
SYBASE IQ, run on preplanned queries.
➢ High memory usage: Memory access for the expensive i\o operation
Column local storage:
➢ It is an another approach to improve query performance in the data warehousing
environment
➢ For example, thinking machine operation has developed an innovative data layout
solution that improves RDBMS query performance many times. Implemented in its
CM_SQL RDBMS product, this approach is based on storing data column wise as
opposed to traditional row wise approach.
➢ In figure-. (Row wise approach) This approach works well for OLTP environment in
which a typical transaction accesses a record at a time. However, in data warehousing
the goal is to retrieve multiple values of several columns.
➢ For example, if a problem is to calculate average minimum, maximum salary the
column wise storage of the salary field requires a DBMS to read only one record.(use
Column –wise approach)
Complex data types:-
➢ The best DBMS architecture for data warehousing has been limited to traditional
alphanumeric data types. But data management is the need to support complex data
types Include text, image, full-motion video &sound.
➢ Large data objects called binary large objects what’s required by business is much
more than just storage:
➢ The ability to retrieve the complex data type like an image by its content. The ability
to compare the content of one image to another in order to make rapid business
decision and ability to express all of this in a single SQL statement.
➢ The modern data warehouse DBMS has to be able to efficiently store, access and
manipulate complex data. The DBMS has to be able to define not only new data
structure but also new function that manipulates them and often new access methods,
to provide fast and often new access to the data.
➢ An example of advantage of handling complex data types is a insurance company that
wants to predict its financial exposure during a catastrophe such as flood that wants to
support complex data.
11.(i) What is data Pre-processing? Explain the various data pre-processing techniques.
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.
3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple:
This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies considerably.
14. (i) Generalize the potential performance problems with star schema.
Potential performance problem with star schemas
1.Indexing
➢ It improve the performance in the star schema design
➢ The table in star schema design contain the entire hierarchy of attributes(PERIOD
dimension this hierarchy could be day->week->month->quarter->year),one approach
is to create multi part key of day, week, month ,quarter ,year .it presents some
problems in the star schema model because it should be in normalized
Problems:
1. It require multiple metadata definitions
2. Since the fact table must carry all key components as part of its primary key,
addition or deletion of levels in the physical modification of the affected table.
3. Carrying all the segments of the compound dimensional key in the fact table
increases the size of the index, thus impacting both performance and scalability.
Solutions:
1.One alternative to the compound key is to concatenate the key into a single key for
the attributes (day, week, month, quarter, year) this is used to solve the first above two
problems.
2.The index is remains problem the best approach is to drop the use of meaningful
keys in favour of using an artificial, generated key which is the smallest possible key
that will ensure the uniqueness of each record
2.Level indicator
Problems
1.Another potential problem with the star schema design is that in order to navigate
the dimensions successfully.
2.The dimensional table design includes a level of hierarchy indicator for every
record.
3.Every query that is retrieving detail records from a table that stores details &
aggregates must use this indicator as an additional constraint to obtain a correct
result.
Solutions:
1.The best alternative to using the level indicator is the snowflake schema
2.The snowflake schema contains separate fact tables for each level of aggregation.
So it is Impossible to make a mistake of selecting product detail. The snowflake
schema is even more complicated than a star schema.
(ii) Design and discuss about the star and snowflake schema models of a Data
warehouse.
STAR SCHEMA:
The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
1. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
2. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
3. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.
Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
PART – C
1. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning
The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.
➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
2.Design a star-schema , snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions( Time , Item, Branch And
Location ). Include the appropriate measures required for the schema.
STAR SCHEMA:
• In a Star schema, there is only one fact table and multiple dimension tables.
• In a Star schema, each dimension is represented by one-dimension table.
• Dimension tables are not normalized in a Star schema.
• Each Dimension table.is joined to a key in a fact table.
There is a fact table at the center. It contains the keys to each of four dimensions. The
fact table also contains the attributes, namely dollars sold and units sold.
SNOWFLAKE SCHEMA:
Some dimension tables in the Snowflake schema are normalized. The normalization
splits up the data into additional tables. Unlike in the Star schema, the dimension’s table in a
snowflake schema are normalized. Due to the normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.
A fact constellation has multiple fact tables. It is also known as a Galaxy Schema. The
sales fact table is the same as that in the Star Schema. The shipping fact table has five
dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The
shipping fact table also contains two measures, namely dollars sold and units sold. It is also
possible to share dimension tables between fact tables.
The need for data pre-processing arises from the fact that the real-time data and many
times the data of the database is often incomplete and inconsistent which may result in
improper and inaccurate data mining results. Thus, to improve the quality of data on
which the observation and analysis are to be done, it is treated with these four steps of
data pre-processing. More the improved data, More, will be the accurate observation
and prediction.
Techniques:
5. Data Cleaning
6. Data Integration
7. Data Transformation
8. Data Reduction
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.
3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
5. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
6. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
7. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
8. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
(ii) Explain the various methods of data cleaning and data reduction technique
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
4.(i) Compare the similarities and differences between the database and data warehouse
DIFFERENCE:
SIMILARITY:
➢ Both the database and data warehouse is used for storing data. These are data
storage systems.
➢ Generally, the data warehouse bottom tier is a relational database system.
Databases are also relational database system. Relational DB systems consist
of rows and columns and a large amount of data.
➢ The DW and databases support multi-user access. A single instance of
database and data warehouse can be accessed by many users at a time.
➢ Both DW and database require queries for accessing the data. The Data
warehouse can be accessed using complex queries while OLTP database can
be accessed by simpler queries.
➢ The database and data warehouse servers can be present on the company
premise or on the cloud.
➢ A data warehouse is also a database.
It is used for maintaining, managing and They are used for rapid delivery of
using the data warehouse enhanced decision support functionality to
end users.
3. Analyze why one of the biggest challenges when designing a data ware house is the
data placement and distribution strategy.
One of the biggest challenges when designing a data warehouse is the data placement
and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).
4. How would you evaluate the goals of data mining?
5. List the two ways the parallel execution of the tasks within SQL statements can be
done.
6. What elements would you use to relate the design of data warehouse?
➢ Quality Screens.
➢ External Parameters File / Table.
➢ Team and Its responsibilities.
➢ Up to date data connectors to external sources.
➢ Consistent architecture between environments (development / uat (user – acceptance –
testing / production)
➢ Repository of DDL’s and other script files (.SQL, Bash / Powershell)
➢ Testing processes – unit tests, integration tests, regression tests
➢ Audit tables, monitoring and alerting of audit tables
➢ Known and described data lineage
7. Define Data mart
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement.
➢ The absence of a budget for a full-scale data warehouse strategy.
➢ The decentralization of business needs.
BENEFITS:
➢ Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
➢ Data warehousing is an efficient way to manage and report on data that is from a
variety of sources, non-uniform and scattered throughout a company.
➢ Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
➢ Data warehousing provides the capability to analyze large amounts of historical data
for nuggets of wisdom that can provide an organization with competitive advantage.
11. Describe the alternate technologies used to improve the performance in data
warehouse environment
15. Point out the major differences between the star schema and the snowflake schema
The dimension table of the snowflake schema model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space.
PART – B
1.What is data warehouse? Give the Steps for design and construction of Data
Warehouses and explain with three tier architecture diagram.
DATA WAREHOUSE:
A data warehouse is a repository of multiple heterogeneous data sources organized
under a unified schema at a single site to facilitate management decision making. (or)A data
warehouse is a subject-oriented, time-variant and non-volatile collection of data in support of
management’s decision-making process.
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level. The reason to build the EDW on
the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
Once the EDW is implemented we start building subject area specific data marts which
contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we
would end up in a complex multiple level joins that would be much slower as compared to
the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined. A Conformed fact has the same definition of
measures, same dimensions joined to it and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.
The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not
keeping detailed data in this approach hence loosing out on advantage of having detail data
.i.e.
flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.
3.(i) Draw the data warehouse architecture and explain its components
Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.
Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system
Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.
Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
1.MOLAP:
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.
Advantages:
✓ Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
✓ Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they return
quickly.
Disadvantages:
✓ Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
✓ Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
Examples:
Hyperion Essbase, Fusion (Information Builders)
ROLAP:
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Data stored in relational tables
Advantages:
✓ Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
✓ Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
✓ Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
✓ Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of the- box
complex functions as well as the ability to allow users to define their own functions.
Examples:
Micro-strategy Intelligence Server, Meta Cube (Informix/IBM)
The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:
Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.
➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
(ii) Describe in detail on data warehouse Metadata
METADATA:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
➢ It is classified into two:
✓ Technical Meta data
✓ Business Meta data
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
➢ Top - Down Approach (Suggested by Bill Inmon)
➢ Bottom - Up Approach (Suggested by Ralph Kimball)
The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.
The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important for the
data to be reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and
initial investment. The business has to wait for the EDW to be implemented followed by
building the data marts before which they can access their reports.
Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it
and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse.
The advantage of using the Bottom Up approach is that they do not require high initial
costs and have a faster implementation time; hence the business can start using the marts
much earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence losing out on advantage of
having detail data i.e. flexibility to easily cater to future requirements. Bottom up approach is
more realistic but the complexity of the integration may become a serious obstacle.
(ii) Analyze the information needed to support DBMS schemas for Decision support.
Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
OLAP Tools:
are used to analyze the data in multi-dimensional and complex views. To enable
multidimensional properties and it uses MDDB and MRDB where MDDB refers multi-
dimensional data base and MRDB refers multi relational data bases.
Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system
Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:
Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info
Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.
Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
7.(i) Discuss the different types of data repositories on which mining can be performed?
A data repository, often called a data archive or library, is a generic terminology that
refers to a segmented data set used for reporting or analysis. It’s a huge database
infrastructure that gathers, manages, and stores varying data sets for analysis, distribution,
and reporting.
Some common types of data repositories include:
➢ Data Warehouse
➢ Data Lake
➢ Data Mart
➢ Metadata Repository
➢ Data Cube
Data Warehouse
A data warehouse is a large data repository that brings together data from several
sources or business segments. The stored data is generally used for reporting and analysis to
help users make critical business decisions. In a broader perspective, a data warehouse offers
a consolidated view of either a physical or logical data repository gathered from numerous
systems. The main objective of a data warehouse is to establish a connection between data
from current systems. For example, product catalogue data stored in one system and
procurement orders for a client stored in another one.
Data Lake
A data lake is a unified data repository that allows you to store structured, semi-
structured, and unstructured enterprise data at any scale. Data can be in raw form and used for
different tasks like reporting, visualizations, advanced analytics, and machine learning.
Data Mart:
Metadata Repositories:
Metadata incorporates information about the structures that include the actual data.
Metadata repositories contain information about the data model that store and share this data.
They describe where the source of data is, how it was collected, and what it signifies. It may
define the arrangement of any data or subject deposited in any format. For businesses,
metadata repositories are essential in helping people understand administrative changes, as
they contain detailed information about the data.
Data Cubes:
Data cubes are lists of data with multidimensions (usually 3 or more dimensions)
stored as a table. They are used to describe the time sequence of an image’s data and help
assess gathered data from a range of standpoints. Each dimension of a data cube signifies
specific characteristics of the database such as day-to-day, monthly or annual sales. The data
contained within a data cube allows you to analyze all the information for almost any or all
clients, sales representatives, products, and more. Consequently, a data cube can help you
identify trends and scrutinize business performance.
data.
Decrement in production cost Reduced redundant processing Enhanced
customer relation.
Improvement in selection of target markets Increased customer satisfaction
Data extraction is the process of collecting or retrieving disparate types of data from a
variety of sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process, and refine data so that it can be stored in
a centralized location in order to be transformed. These locations may be on-site, cloud-
based, or a hybrid of the two. Data extraction is the first step in both ETL (extract, transform,
load) and ELT (extract, load, transform) and processes. ETL/ELT are themselves part of a
complete data integration strategy. A proper attention must be paid to data extraction which
represents a success factor for a data warehouse architecture. When implementing data
warehouse several the following selection criteria that affect the ability to transform,
consolidate, integrate and repair the data should be considered:
➢ Timeliness of data delivery to the warehouse
➢ The tool must have the ability to identify the particular data and that can be
read by conversion tool
➢ The tool must support flat files, indexed files since corporate data is still in
this type
➢ The tool must have the capability to merge data from multiple data stores
➢ The tool should have specification interface to indicate the data to be extracted
➢ The tool should have the ability to read data from data dictionary
➢ The code generated by the tool should be completely maintainable
➢ The tool should permit the user to extract the required data
➢ The tool must have the facility to perform data type and character set
translation
➢ The tool must have the capability to create summarization, aggregation and
derivation of records
➢ The data warehouse database system must be able to perform loading data
directly from these tools
Index types:
The first of SYBASE IQ provide five index techniques, Most users apply two indexes
to every column. the default index called projection index and other is either a low or high –
cardinality index. For low cardinality data SYBASE IQ provides.
➢ Low fast index: it is optimized for queries involving scalar functions like SUM,
AVERAGE, and COUNTS.
➢ Low disk index which is optimized for disk space utilization at the cost of being more
CPU intensive.
Performance.
SYBAEE IQ technology achieves the very good performance on adhoc quires for
several reasons
➢ Bitwise technology: this allows various types of data type in query. And support fast
data aggregation and grouping.
➢ Compression: SYBAEE IQ uses sophisticated algorithms to compress data in to bit
maps.
➢ Optimized m/y based programming: SYBASE IQ caches data columns in m/y
according to the nature of user’s queries, it speed up the processor.
➢ Column wise processing: SYBASE IQ scans columns not rows, it reduce the amount
of data the engine has to search.
➢ Low overhead: An engine optimized for decision support SYBASE IQ does not carry
on overhead associated with that. Finally OLTP designed RDBMS performance.
➢ Large block I/P: Block size of SYBASE IQ can turned from 512 bytes to 64 Kbytes
so system can read much more information as necessary in single I/O.
➢ Operating system-level parallelism: SYBASE IQ breaks low level like the sorts,
bitmap manipulation, load, and I/O, into non blocking operations.
➢ Projection and ad hoc join capabilities: SYBASE IQ allows users to take advantage of
known join relation relationships between tables by defining them in advance and
building indexes between tables.
Shortcomings of indexing:-
The user should be aware of when choosing to use SYBASE IQ include
➢ No updates-SYNBASE IQ does not support updates the users would have to update
the source database and then load the update data in SYNBASE IQ on a periodic
basis.
➢ Lack of core RDBMS feature:-
Not support all the robust features of SYBASE SQL server, such as backup and
recovery
➢ Less advantage for planned queries:
SYBASE IQ, run on preplanned queries.
➢ High memory usage: Memory access for the expensive i\o operation
Column local storage:
➢ It is an another approach to improve query performance in the data warehousing
environment
➢ For example, thinking machine operation has developed an innovative data layout
solution that improves RDBMS query performance many times. Implemented in its
CM_SQL RDBMS product, this approach is based on storing data column wise as
opposed to traditional row wise approach.
➢ In figure-. (Row wise approach) This approach works well for OLTP environment in
which a typical transaction accesses a record at a time. However, in data warehousing
the goal is to retrieve multiple values of several columns.
➢ For example, if a problem is to calculate average minimum, maximum salary the
column wise storage of the salary field requires a DBMS to read only one record.(use
Column –wise approach)
Complex data types:-
➢ The best DBMS architecture for data warehousing has been limited to traditional
alphanumeric data types. But data management is the need to support complex data
types Include text, image, full-motion video &sound.
➢ Large data objects called binary large objects what’s required by business is much
more than just storage:
➢ The ability to retrieve the complex data type like an image by its content. The ability
to compare the content of one image to another in order to make rapid business
decision and ability to express all of this in a single SQL statement.
➢ The modern data warehouse DBMS has to be able to efficiently store, access and
manipulate complex data. The DBMS has to be able to define not only new data
structure but also new function that manipulates them and often new access methods,
to provide fast and often new access to the data.
➢ An example of advantage of handling complex data types is a insurance company that
wants to predict its financial exposure during a catastrophe such as flood that wants to
support complex data.
11.(i) What is data Pre-processing? Explain the various data pre-processing techniques.
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
5. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.
3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
9. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
The various methods for handling the problem of missing values in data tuples include:
(b) Ignoring the tuple:
This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies considerably.
14. (i) Generalize the potential performance problems with star schema.
Potential performance problem with star schemas
1.Indexing
➢ It improve the performance in the star schema design
➢ The table in star schema design contain the entire hierarchy of attributes(PERIOD
dimension this hierarchy could be day->week->month->quarter->year),one approach
is to create multi part key of day, week, month ,quarter ,year .it presents some
problems in the star schema model because it should be in normalized
Problems:
1. It require multiple metadata definitions
2. Since the fact table must carry all key components as part of its primary key,
addition or deletion of levels in the physical modification of the affected table.
3. Carrying all the segments of the compound dimensional key in the fact table
increases the size of the index, thus impacting both performance and scalability.
Solutions:
1.One alternative to the compound key is to concatenate the key into a single key for
the attributes (day, week, month, quarter, year) this is used to solve the first above two
problems.
2.The index is remains problem the best approach is to drop the use of meaningful
keys in favour of using an artificial, generated key which is the smallest possible key
that will ensure the uniqueness of each record
2.Level indicator
Problems
1.Another potential problem with the star schema design is that in order to navigate
the dimensions successfully.
2.The dimensional table design includes a level of hierarchy indicator for every
record.
3.Every query that is retrieving detail records from a table that stores details &
aggregates must use this indicator as an additional constraint to obtain a correct
result.
Solutions:
1.The best alternative to using the level indicator is the snowflake schema
2.The snowflake schema contains separate fact tables for each level of aggregation.
So it is Impossible to make a mistake of selecting product detail. The snowflake
schema is even more complicated than a star schema.
(ii) Design and discuss about the star and snowflake schema models of a Data
warehouse.
STAR SCHEMA:
The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
4. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
5. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
6. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.
Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
PART – C
1. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning
The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.
➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
2.Design a star-schema , snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions( Time , Item, Branch And
Location ). Include the appropriate measures required for the schema.
STAR SCHEMA:
• In a Star schema, there is only one fact table and multiple dimension tables.
• In a Star schema, each dimension is represented by one-dimension table.
• Dimension tables are not normalized in a Star schema.
• Each Dimension table.is joined to a key in a fact table.
There is a fact table at the center. It contains the keys to each of four dimensions. The
fact table also contains the attributes, namely dollars sold and units sold.
SNOWFLAKE SCHEMA:
Some dimension tables in the Snowflake schema are normalized. The normalization
splits up the data into additional tables. Unlike in the Star schema, the dimension’s table in a
snowflake schema are normalized. Due to the normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.
A fact constellation has multiple fact tables. It is also known as a Galaxy Schema. The
sales fact table is the same as that in the Star Schema. The shipping fact table has five
dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The
shipping fact table also contains two measures, namely dollars sold and units sold. It is also
possible to share dimension tables between fact tables.
The need for data pre-processing arises from the fact that the real-time data and many
times the data of the database is often incomplete and inconsistent which may result in
improper and inaccurate data mining results. Thus, to improve the quality of data on
which the observation and analysis are to be done, it is treated with these four steps of
data pre-processing. More the improved data, More, will be the accurate observation
and prediction.
Techniques:
13. Data Cleaning
14. Data Integration
15. Data Transformation
16. Data Reduction
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.
3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
13. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
15. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
16. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
(ii) Explain the various methods of data cleaning and data reduction technique
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.
8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).
4.(i) Compare the similarities and differences between the database and data warehouse
DIFFERENCE:
SIMILARITY:
➢ Both the database and data warehouse is used for storing data. These are data
storage systems.
➢ Generally, the data warehouse bottom tier is a relational database system.
Databases are also relational database system. Relational DB systems consist
of rows and columns and a large amount of data.
➢ The DW and databases support multi-user access. A single instance of
database and data warehouse can be accessed by many users at a time.
➢ Both DW and database require queries for accessing the data. The Data
warehouse can be accessed using complex queries while OLTP database can
be accessed by simpler queries.
➢ The database and data warehouse servers can be present on the company
premise or on the cloud.
➢ A data warehouse is also a database.
STEPS:
➢ business understanding
➢ data understanding
➢ data preparation
➢ modelling
➢ evaluation
➢ deployment.
2. List the steps involved in the process of KDD. How does it relate to data mining?
STEPS:
➢ Data cleaning
➢ Data integration
➢ Data selection
➢ Data transformation
➢ Data mining
➢ Pattern evaluation
➢ Knowledge presentation
KDD refers to the overall process of discovering useful knowledge from data, and data
mining refers to a particular step in this process. Data mining is the application of specific
algorithms for extracting patterns from data.”
11. Are all patterns generated are interesting and useful? Give reasons to justify.
Typically not. Only a small fraction of the patterns potentially generated would actually
be of interest to any given user. A pattern is interesting if it is
(1) easily understood by humans,
(2)valid on new or test data with some degree of certainty,
(3) potentially useful, and
(4) novel.
16. Consider that the minimum and maximum values for the attribute “salary” are 12,000
and 98,000 respectively and the mapping range of salary is [0.0 ,1.0]. Find the
transformation for the salary 73,600 using min-max normalization.
73,600 – 12,000
Min-Max Normalization: ---------------------- (1.0 – 0) + 0 = 0.716.
98,000 – 12,000
17. Show how the attribute selection set is important in data reduction.
Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes
more efficiently. The data set may have a large number of attributes. But some of those
attributes can be irrelevant or redundant
19. Formulate why do we need data transformation. Mention the ways by which data can
be transformed.
Data transformation in data mining is done for combining unstructured data with
structured data to analyze it later. It is also important when the data is transferred to a new
cloud data warehouse. When the data is homogeneous and well-structured, it is easier to
analyze and look for patterns
PART – B
1.(i) Demonstrate in detail about data mining steps in the process of knowledge discovery?
1.Task-relevant data:
This is the database portion to be investigated. For example, suppose that you are a manager
of All Electronics in charge of sales in the United States and Canada. In particular, you would
like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes
3.Background knowledge:
Users can specify background knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge discovery process, and for evaluating the
patterns found. There are several kinds of background knowledge.
4.Interestingness measures:
These functions are used to separate uninteresting patterns from knowledge. They may
be used to guide the mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
Types of Attributes
There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-1 0),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
•Examples: temperature in Kelvin, length, time, counts
(ii) Design and discuss in detail about Primitives for specifying a data mining task
1.Task-relevant data:
This is the database portion to be investigated. For example, suppose that you are a manager
of All Electronics in charge of sales in the United States and Canada. In particular, you would
like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes
3.Background knowledge:
Users can specify background knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge discovery process, and for evaluating the
patterns found. There are several kinds of background knowledge.
4.Interestingness measures:
These functions are used to separate uninteresting patterns from knowledge. They may
be used to guide the mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
5.Presentation and visualization of discovered patterns:
This refers to the form in which discovered patterns are to be displayed. Users can
choose from different forms for knowledge presentation, such as rules, tables, charts, graphs,
decision trees, and cubes.
4(i). Discuss whether or not each of the following activities is a data mining task.
1. Credit card fraud detection using transaction records.
2. Dividing the customers of a company according to their gender.
3. Computing the total sales of a company
4. Predicting the future stock price of a company using historical records.
5.Monitoring seismic waves for earthquake activities.
(ii) Discuss on descriptive and predictive data mining tasks with illustrations
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions. In
some cases, users may have no idea of which kinds of patterns in their data may be interesting,
and hence may like to search for several different kinds of patterns in parallel. Thus it is
important to have a data mining system that can mine multiple kinds of patterns to
accommodate di_erent user expectations or applications. Furthermore, data mining systems
should be able to discover patterns at various granularities. To encourage interactive and
exploratory mining, users should be able to easily \play" with the output patterns, such as by
mouse clicking. Operations that can be speci_ed by simple mouse clicks include adding or
dropping a dimension (or an attribute), swapping rows and columns (pivoting, or axis rotation),
changing dimension representations (e.g., from a 3-D cube to a sequence of 2-D cross
tabulations, or crosstabs), or using OLAP roll-up or drill-down operations along dimensions.
Such operations allow data patterns to be expressed from different angles of view and at
multiple levels of abstraction. Data mining systems should also allow users to specify hints to
guide or focus the search for interesting patterns. Since some patterns may not hold for all of
the data in the database, a measure of certainty or \trustworthiness" is usually associated with
each discovered pattern.
5(i). State and Explain the various classification of data mining systems with example.
Classification of data mining systems:
There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive. Data mining systems can be
categorized according to various criteria among other classification are the following:
➢ Classification according to the type of data source mined
➢ Classification according to the data model drawn on
➢ Classification according to the king of knowledge discovered
➢ Classification according to mining techniques used
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules
A neural network, when used for classification, is typically a collection of neuron-like
processing units with weighted connections between the units. There are many other methods
for constructing classification models, such as naïve Bayesian classification, support vector
machines, and k-nearest neighbor classification. Whereas classification predicts categorical
(discrete, unordered) labels, prediction models Continuous-valued functions. That is, it is used
to predict missing or unavailable numerical data values rather than class labels. Although the
term prediction may refer to both numeric prediction and class label prediction,
➢ Cluster Analysis
Classification and prediction analyze class-labeled data objects, where as
clustering analyses data objects without consulting a known class label.
➢ Outlier Analysis
A database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection, the
rare events can be more interesting than the more regularly occurring ones. The analysis of
outlier data is referred to as outlier mining.
➢ Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects
whose behavior changes over time. Although this may include characterization,
discrimination, association and correlation analysis, classification, prediction, or clustering
of time related data, distinct features of such an analysis include time-series data analysis,
Sequence or periodicity pattern matching, and similarity-based data analysis.
6. Suppose that the data for analysis include the attributed age. The age values for the
data tuples are 13,15,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,
36,40,45,46,52,70.
(i).use smoothing by bin means to smooth the above data using a bin depth
of 3. Illustrate your steps.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20
Bin 3: 20, 21, 22 Bin 4: 22, 25, 25
Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45
Bin 9: 46, 52, 70
7. Sketch the various phases of data mining and explain the different steps involved in pre-
processing with their significance before mining, Give an example for each process.
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in
data cleaning
• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the
data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.
Noisy:
Noisy generally means random error or containing unnecessary data points.
Here are some of the methods to handle noisy data.
• Binning:
This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin. Smoothing by bin
mean method: In this method, the values in the bin are replaced by the mean
value of the bin; Smoothing by bin median: In this method, the values in the
bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken
and the values are replaced by the closest boundary value.
• Regression:
This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
• Clustering:
This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
Data integration:
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some problems to be
considered during data integration.
• Schema integration:
Integrates metadata(a set of data that describes other data) from different
sources.
Identifying entities from multiple databases. For example, the system or the use
should know student _id of one database and student_name of another database belongs
to the same entity.
The data taken from different databases while merging may differ. Like the
attribute values from one database may differ from another database. For example, the
date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.
• Dimensionality reduction:
This process is necessary for real-world applications as the data size is big. In
this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of
the data without losing its original characteristics. This also helps in the reduction of
storage space and computation time is reduced. When the data is highly dimensional
the problem called “Curse of Dimensionality” occurs.
• Numerosity Reduction:
In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.
• Data compression:
The compressed form of data is called data compression. This compression can
be lossless or lossy. When there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces information but it removes
only the unnecessary information.
Data Transformation:
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in data
transformation.
• Smoothing:
With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple
change that helps in prediction.
• Aggregation:
In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with data analysis description.
This is an important step since the accuracy of the data depends on the quantity and
quality of the data. When the quality and the quantity of the data are good the results
are more relevant.
• Discretization:
The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, 6 pm-8 pm).
• Normalization:
2. Performance issues:
These include efficiency, scalability, and parallelization of data mining algorithms.
Heuristic methods:
Data compression:
In data compression, data encoding or transformations are applied so as to obtain a
reduced or ”compressed" representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data compression
technique used is called lossless. If, instead, we can reconstruct only an approximation of the
original data, then the data compression technique is called lossy. The two popular and
effective methods of lossy data compression: wavelet transforms, and principal components
analysis.
Numerosity reduction :
Regression and log-linear models :
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called
a predictor variable), with the equation where the variance of Y is assumed to be constant.
These coefficients can be solved for by the method of least squares, which minimizes the error
between the actual line separating the data and the estimate of the line.
Histograms :
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.
Equi-width:
In an equi-width histogram, the width of each bucket range is constant (such as the
width of $10 for the buckets in Figure 3.8).
Equi-depth (or equi-height):
In an equi-depth histogram, the buckets are created so that, roughly, the frequency of
each bucket is constant (that is, each bucket contains roughly the same number of contiguous
data samples).
V-Optimal:
If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of
the original values that each bucket represents, where bucket weight is equal to the number of
values in the bucket.
MaxDiff:
In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having β-1 the largest
differences, where β is user-specified.
10. Describe in detail about various data transformation techniques
Data transformation:
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:
➢ Normalization:
where the attribute data are scaled so as to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data normalization : min-max
normalization, z-score normalization, and normalization by decimal scaling.
(i).Min-max normalization
performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization
maps a value v of A to v0 in the range [new minA; new maxA] by computing.
➢ Smoothing:
which works to remove the noise from data? Such techniques include binning,
clustering, and regression.
In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups or “clusters”. Intuitively, values which fall outside of the set of clusters may be
considered outliers.
➢ Aggregation:
where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total
amounts.
➢ Generalization of the data:
where low level or 'primitive' (raw) data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes, like street,
can be generalized to higher level concepts, like city or county.
11. List and explain the primitives for specifying a data mining task.
Task-relevant data:
This primitive specifies the data upon which mining is to be performed. It involves
specifying the database and tables or data warehouse containing the relevant data, conditions
for selecting the relevant data, the relevant attributes or dimensions for exploration, and
instructions regarding the
ordering or grouping of the data retrieved.
Background knowledge:
This primitive allows users to specify knowledge they have about the domain to be
mined. Such knowledge can be used to guide the knowledge discovery process and evaluate
the patterns that are found. Of the several kinds of background knowledge, this chapter focuses
on concept hierarchies.
12(i). How will you handle missing value in a dataset before mining process?
The architecture of a typical data mining system may have the following major
components
➢ Knowledge base:
This is the domain knowledge that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern's interestingness
based on its unexpectedness, may also be included.
13(i). Explain how integration is done with a database or data warehouse system.
1.No coupling: No coupling means that a DM system will not utilize any function of a
DB or DW system. It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in another file.
2.Loose coupling: Loose coupling means that a DM system will use some facilities of
a DB or DW system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a designated
place in a database or data Warehouse. Loose coupling is better than no coupling because it
can fetch any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities. However, many loosely coupled mining systems are main
memory-based. Because mining does not explore data structures and query optimization
methods provided by DB or DW systems, it is difficult for loose coupling to achieve high
scalability and good performance with large data sets.
4.Tight coupling: Tight coupling means that a DM system is smoothly integrated into
the DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system
(ii) Consider the following data for the attribute AGE:4,8,21,5,21,24,34,28,25. Perform
smoothing by bin means and bin boundaries using a bin depth of 3
Given:
AGE: 4,8,21,5,21,24,34,28,25
14. Analyze Using Equi-depth binning method, partition the data given below into 4 bins
and perform smoothing according to the following methods.(8)
1. Smoothing by bin means
2. Smoothing by bin median
3. Smoothing by bin boundaries
24,25,26,27,28,56,67,70,70,75,78,89,89,90,91,94,95,96,100,102,103,107,109,112.
Given:
Database technology since the 1980s has been characterized by the popular adoption of
relational technology and an upsurge of research and development activities on new aid
powerful database systems. The steady and amazing progress of computer hardware
technology in the past three decades has led to large supplies of powerful and affordable
computers, data collection equipment, and storage media. This technology provides a great
boost to the database information industry and makes a huge number of databases and
repositories available for transaction management, information retrieval, and data analysis. The
abundance of data, coupled with the need for powerful data analysis tools, has been described
as a data-rich but information poor situation. The fast-growing tremendous amount of data,
collected and stored in large and numerous data repositories have far exceeded our human for
comprehension without powerful tools. As a result, data collected. in large, data repositories
become “data tombs”- data archives that are seldom visited.
Data mining tools perform data analysis and may uncover important data patterns,
contributing greatly to business strategies, knowledge bases, and scientific and medical
research. The widening gap between data and information call for the systematic development
of data tools that will turn data tombs into “golden nuggets” of knowledge.
PART – C
• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
2.(i) What is interestingness of a pattern?
An interesting pattern represents knowledge. Several objective measures of pattern
interestingness exist. These are based on the structure of discovered patterns and the statistics
underlying them. An objective measure for association rules of the form X Y is rule support,
representing the percentage of transactions from a transaction database that the given rule
satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction
contains both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability that
a transaction containing X also contains Y. More formally, support and confidence are defined
as
support(X Y) = P(XUY) confidence(X Y) = P(Y | X)
In general, each interestingness measure is associated with a threshold, which may be
controlled by the user. For example, rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting. Rules below the threshold threshold likely reflect noise,
exceptions, or minority cases and are probably of less value.
(ii) Summarize the integration of data mining system with a data warehouse?
3.List the major data pre-processing techniques and explain in detail with examples?
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in
data cleaning
• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the
data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.
Noisy:
Noisy generally means random error or containing unnecessary data points.
Here are some of the methods to handle noisy data.
• Binning:
This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin. Smoothing by bin
mean method: In this method, the values in the bin are replaced by the mean
value of the bin; Smoothing by bin median: In this method, the values in the
bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken
and the values are replaced by the closest boundary value.
• Regression:
This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
• Clustering:
This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
Data transformation:
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:
➢ Normalization:
where the attribute data are scaled so as to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data normalization : min-max
normalization, z-score normalization, and normalization by decimal scaling.
(i).Min-max normalization
performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization
maps a value v of A to v0 in the range [new minA; new maxA] by computing.
➢ Smoothing:
which works to remove the noise from data? Such techniques include binning,
clustering, and regression.
In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.
(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups or “clusters”. Intuitively, values which fall outside of the set of clusters may be
considered outliers.
Outliers may be detected by clustering analysis.
➢ Aggregation:
where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total
amounts.
➢ Generalization of the data:
where low level or 'primitive' (raw) data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes, like street,
can be generalized to higher level concepts, like city or county.
DATA REDUCTION:
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
Strategies for data reduction include the following.
➢ Data cube aggregation:
where aggregation operations are applied to the data in the construction of a
data cube.
➢ Dimension reduction:
where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
➢ Data compression:
where encoding mechanisms are used to reduce the data set size.
➢ Numerosity reduction:
where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters
instead of the actual data), or nonparametric methods such as clustering, sampling, and
the use of histograms.
➢ Discretization and concept hierarchy generation:
where raw data values for attributes are replaced by ranges or higher conceptual
levels. Concept hierarchies allow the mining of data at multiple levels of abstraction,
and are a powerful tool for data mining.
Data compression:
In data compression, data encoding or transformations are applied so as to obtain a
reduced or ”compressed" representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data compression
technique used is called lossless. If, instead, we can reconstruct only an approximation of the
original data, then the data compression technique is called lossy. The two popular and
effective methods of lossy data compression: wavelet transforms, and principal components
analysis.
Numerosity reduction :
Regression and log-linear models :
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called
a predictor variable), with the equation where the variance of Y is assumed to be constant.
These coefficients can be solved for by the method of least squares, which minimizes the error
between the actual line separating the data and the estimate of the line.
Histograms :
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.
Equi-width:
In an equi-width histogram, the width of each bucket range is constant (such as the
width of $10 for the buckets in Figure 3.8).
Equi-depth (or equi-height):
In an equi-depth histogram, the buckets are created so that, roughly, the frequency of
each bucket is constant (that is, each bucket contains roughly the same number of contiguous
data samples).
V-Optimal:
If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of
the original values that each bucket represents, where bucket weight is equal to the number of
values in the bucket.
MaxDiff:
In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having β-1 the largest
differences, where β is user-specified.
Data mining systems can be categorized according to various criteria among other
classification are the following:
➢ Classification according to the type of data source mined
➢ Classification according to the data model drawn on
➢ Classification according to the king of knowledge discovered
➢ Classification according to mining techniques used
4. Define Data pruning. State the need for pruning phase in decision tree construction.
Pruning means to change the model by deleting the child nodes of a branch node. The
pruned node is regarded as a leaf node. Leaf nodes cannot be pruned. A decision tree consists
of a root node, several branch nodes, and several leaf nodes. The root node represents the top
of the tree.
➢ Pattern Generation:
FP growth generates pattern by constructing a FP tree whereas Apriori
generates pattern by pairing the items into singletons, pairs and triplets.
➢ Candidate Generation:
There is no candidate generation in FP growth whereas Apriori uses
candidate generation
➢ Process:
The process is faster as compared to Apriori. The runtime of process increases
linearly with increase in number of itemsets. But in Apriori the process is
comparatively slower than FP Growth, the runtime increases exponentially with
increase in number of itemsets
➢ Memory Usage:
A compact version of database is saved in FP growth. In Apriori algorithm,
the candidates combinations are saved in memory
6. Explain how will you generate association rules from frequent itemsets?
Association Rules find all sets of items (itemsets) that have support greater than the
minimum support and then using the large itemsets to generate the desired rules that have
confidence greater than the minimum confidence.
(iii) Write and explain the algorithm for mining frequent item sets without candidate generation
in many cases the Apriori candidate generate-and-test method significantly reduces the size of
candidate sets, leading to good performance gain.
An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a
divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a
frequent-pattern tree, or FP-tree, which retains the itemset association information. It then divides the
compressed
database into a set of conditional databases (a special kind of projected database), each associated with one
frequent item or ―pattern fragment,‖ and mines each such database separately.
Example:
FP-growth (finding frequent itemsets without candidate generation). We re-examine the mining of
transaction database, D, of Table 5.1 in Example 5.3 using the frequent pattern growth approach.
2(i) How would you summarize in detail about mining methods?
The method that mines the complete set of frequent item sets with candidate
generation.
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule above contains three predicates (age, occupation,
and buys), each of which occurs only once in the rule. Hence, we say that it has no repeated
predicates. Multidimensional association rules with no repeated predicates are called inter
dimensional association rules. We can also mine multidimensional association rules with
repeated predicates, which contain multiple occurrences of some predicates. These rules are
called hybrid-dimensional association rules. An example of such a rule is the following, where
the predicate buys is repeated:
Metarule-guided mining:-
Suppose that as a market analyst for AllElectronics, you have access to the data describing customers
(such as customer age, address, and credit rating) as well as the list of customer transactions. You are interested
in finding associations between customer traits and the items that customers buy. However, rather than finding
all of the association rules reflecting these relationships, you are particularly interested only in determining
which pairs of customer traits promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a metarule is
where P1 and P2 are predicate variables that are instantiated to attributes from the given database during
the mining process, X is a variable representing a customer, and Y and W take on values of the attributes
assigned to P1 and P2, respectively. Typically, a user will specify a list of attributes to be considered for
instantiation with P1 and P2. Otherwise, a default set may be used.
4(i). Develop an algorithm for classification using decision trees. Illustrate the algorithm with a relevant
example.
Decision tree induction is the learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure, where
➢ Each internal node denotes a test on an attribute.
➢ Each branch represents an outcome of the test.
➢ Each leaf node holds a class label.
➢ The topmost node in a tree is the root node.
The construction of decision tree classifiers does not require any domain knowledge or parameter
setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast. In general,
decision tree classifiers have good accuracy.
Decision tree induction algorithm shave been used for classification in many application areas,
such as medicine, manufacturing and production, financial analysis, astronomy, and molecular
biology.
(ii) What approach would you use to apply decision tree induction?
CLASSIFICATION:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
BAYESIAN CLASSIFICATION:
“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They
can predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, a simple Bayesian
classifier known as the naïve Bayesian classifier Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.
NEED:
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured.
Bayesian Theorem:
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the
Bayes theorem
Greatly reduces the computation cost, only count the class distribution.
6(i) Giving concrete example , explain a method that performs frequent itemset mining by using the prior
knowledge of frequent item set properties.
(ii) Discuss in detail the constraint based association mining.
A data mining process may uncover thousands of rules from a given set of data, most of which
end up being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction‖ of
mining may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find. Thus, a
good heuristic is to have the users specify such intuition or expectations as constraints to confine the search
space. This strategy is known as constraint-based mining.
The constraints can include the following:
Metarule-guided mining:-
Suppose that as a market analyst for AllElectronics, you have access to the data describing customers
(such as customer age, address, and credit rating) as well as the list of customer transactions. You are interested
in finding associations between customer traits and the items that customers buy. However, rather than finding
all of the association rules reflecting these relationships, you are particularly interested only in determining
which pairs of customer traits promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a metarule is
where P1 and P2 are predicate variables that are instantiated to attributes from the given database during
the mining process, X is a variable representing a customer, and Y and W take on values of the attributes
assigned to P1 and P2, respectively. Typically, a user will specify a list of attributes to be considered for
instantiation with P1 and P2. Otherwise, a default set may be used.
(ii) Describe about the process of multi-layer feed-forward neural network classification
using back propagation learning?
➢ The inputs to the network correspond to the attributes measured for each training
tuple. The inputs are fed simultaneously into the units making up the input layer.
These inputs pass through the input layer and are then weighted and fed
simultaneously to a second layer known as a hidden layer.
➢ The outputs of the hidden layer units can be input to another hidden layer, and so on.
The number of hidden layers is arbitrary.
➢ The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network’s prediction for given tuples
Classification by Backpropagation:
➢ Backpropagation is a neural network learning algorithm.
➢ A neural network is a set of connected input/output units in which each connection
has a weight associated with it.
➢ During the learning phase, the network learns by adjusting the weights so as to be able
to predict the correct class label of the input tuples.
➢ Neural network learning is also referred to as connectionist learning due to the
connections between units.
➢ Neural networks involve long training times and are therefore more suitable for
applications where this is feasible.
➢ Backpropagation learns by iteratively processing a data set of training tuples,
comparing the network’s prediction for each tuple with the actual known target value.
➢ The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
➢ For each training tuple, the weights are modified so as to minimize the mean squared
error between the network’s prediction and the actual target value. These
modifications are made in the ―backwards‖ direction, that is, from the output layer,
through each hidden layer down to the first hidden layer hence the name is
backpropagation.
➢ Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
8(i) Describe in detail about frequent pattern classification.
Frequent pattern mining can be classified in various ways, based on the following criteria:
• The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
• The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
• The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken at
the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
• The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root. This means that the common itemset is linked to the
new node of another itemset in this transaction.
• Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked
according to transactions.
• The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths are
called a conditional pattern base. Conditional pattern base is a sub-database consisting
of prefix paths in the FP tree occurring with the lowest node (suffix).
• Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
• Frequent Patterns are generated from the Conditional FP Tree.
10. Generalize the Bayes theorem of posterior probability and explain the working of a
Bayesian classifier with an example.
Bayes’ Theorem:
➢ Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is
described by measurements made on a set of n attributes.
➢ Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
➢ For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X.
➢ P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X).
Bayesian Classification:
➢ Bayesian classifiers are statistical classifiers.
➢ They can predictclass membership probabilities, such as the probability that a given
tuple belongs toa particular class.
➢ Bayesian classification is based on Bayes’ theorem
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
11. Explain and Apply the Apriori algorithm for discovering frequent item sets of the table.
12. (i).Define classification? With an example explain how support vector machines can be
used for classification
CLASSIFICATION:
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects whose class label is known).
A bank loans officer needs analysis of her data in order to learn which loan applicants
are ―safe‖ and which are ―risky‖ for the bank. A marketing manager at All Electronics
needs data analysis to help guess whether a customer with a given profile will buy a new
computer. A medical researcher wants to analyze breast cancer data in order to predict which
one of three specific treatments a patient should receive. In each of these examples, the data
analysis task is classification, where a model or classifier is constructed to predict categorical
labels, such as ―safe‖ or ―risky‖ for the loan application data; ―yes‖ or ―no‖ for the
marketing data; or ―treatment A,‖ -treatmentB,‖ or―treatmentC ‖ for the medical data. These
categories can be represented by discrete values, where the ordering among values has no
meaning. For example, the values1,2,and 3 may be used to represent
Treatments A, B, and C, where there is no ordering implied among this group of treatment
regimes. Suppose that the marketing manager would like to predict how much a given
customer will spend during a sale at All Electronics. This data analysis task is an example of
numeric prediction, where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor “How does
classification work? Data classification is a two-step process, as shown for the loan
application data of Figure 6.1. (The data are simplified for illustrative purposes. In reality, we
may expect many more attributes to be considered.) In the first step, a classifier is built
describing a predetermined set of data classes or concepts. This is the learning step (or
training phase), where a classification algorithm builds the classifier by analyzing or
―learning from‖ a training set made up of database tuples and their associated class labels.
(ii) What are the prediction techniques supported by a data mining systems?
“How are these probabilities estimated?” P(H), P(XjH), and P(X) may be
estimated from the given data, as we shall see below. Bayes’ theorem is useful
in that it provides a way of calculating the posterior probability, P(HjX),
from P(H), P(XjH), and P(X). Bayes’ theorem is
(ii) Explain how the Bayesian Belief Networks are trained to perform classification
A belief network has one conditional probability table (CPT) for each variable.
The CPT for a variable Y specifies the conditional distribution P(YjParents(Y)),
where Parents(Y) are the parents of Y. Figure(b) shows a CPT for the
variable LungCancer. The conditional probability for each known value
of LungCancer is given for each possible combination of values of its parents.
For instance, from the upper leftmost and bottom rightmost entries, respectively,
we see that
Let X = (x1, : : : , xn) be a data tuple described by the variables or attributes Y1,
: : : , Yn, respectively. Recall that each variable is conditionally independent of
its non descendants in the network graph, given its parents. This allows the
network to provide a complete representation of the existing joint probability
distribution with the
following equation:
BAYESIAN CLASSIFICATION:
“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They
can predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, a simple Bayesian
classifier known as the naïve Bayesian classifier Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.
➢ Bayesian classifiers are statistical classifiers.
➢ They can predictclass membership probabilities, such as the probability that a given
tuple belongs toa particular class.
➢ Bayesian classification is based on Bayes’ theorem
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
(ii) Classification by Back propagation
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
➢ Started by psychologists and neurobiologists to develop and test computational
analogues of neurons
➢ A neural network: A set of connected input/output units where each connection has
a weight associated with it.
➢ During the learning phase, the network learns by adjusting the weights so as to be
able to predict the correct class label of the input tuples
➢ Also referred to as connectionist learning due to the connections between units
o The inputs to the network correspond to the attributes measured for each training
tuple
o
Inputs are fed simultaneously into the units making up the input layer
o
They are then weighted and fed simultaneously to a hidden layer
o
The number of hidden layers is arbitrary, although usually only one
o
The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
o
The network is feed-forward in that none of the weights cycles back to an input unit
or to an output unit of a previous layer
o From a statistical point of view, networks perform nonlinear regression: Given
enough hidden units and enough training samples, they can closely approximate any function
PART – C
1. Find all frequent item sets for the given training set using Apriori and FP growth respectively. Compare
the efficiency of the two mining processes
2. Generalize and Discuss about constraint based association rule mining with examples and
state how association mining to correlation analysis is dealt with.
constraint based association rule mining:
A data mining process may uncover thousands of rules from a given set of data, most of which
end up being unrelated or uninteresting to the users. Often, users have a good sense of which
―direction‖ of mining may lead to interesting patterns and the ―form‖ of the patterns or rules
they would like to find. Thus, a good heuristic is to have the users specify such intuition or
expectations as constraints to confine the search space. This strategy is known as constraint-
based mining. The constraints can include the following:
“How are metarules useful?” Metarules allow users to specify the syntactic form of rules
that they are interested in mining. The rule forms can be used as constraints to help improve
the efficiency of the mining process. Metarules may be based on the analyst’s experience,
expectations, or intuition regarding the data or may be automatically generated based on the
database schema.
Metarule-guided mining:- Suppose that as a market analyst for AllElectronics, you have
access to the data describing customers (such as customer age, address, and credit rating) as
well as the list of customer transactions. You are interested in finding associations between
customer traits and the items that customers buy. However, rather than finding all of the
association rules reflecting these relationships, you are particularly interested only in
determining which pairs of customer traits SCE Department of Information
Technology promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a
metarule is
where P1 and P2 are predicate variables that are instantiated to attributes from the given
database during the mining process, X is a variable representing a customer, and Y and W take
on values of the attributes assigned to P1 and P2, respectively. Typically, a user will specify
a list of attributes to be considered for instantiation with P1 and P2. Otherwise, a default set
may be used.
Rule constraints specify expected set/subset relationships of the variables in the mined rules,
constant initiation of variables, and aggregate functions. Users typically employ their
knowledge of the application or data to specify rule constraints for the mining task. These rule
constraints may be used together with, or as an alternative to, metarule-guided mining. In this
section, we examine rule constraints as to how they can be used to make the mining process
more efficient. Let’s study an example where rule constraints are used to mine hybrid-
dimensional association rules.
Our association mining query is to “Find the sales of which cheap items (where the sum of
the prices is less than $100) may promote the sales of which expensive items (where the
minimum price is $500) of the same group for Chicago customers in 2004.” This can be
expressed in the DMQL data mining query language as follows,
That is, a correlation rule is measured not only by its support and confidence but also by the
correlation between itemsets A and B. There are many different correlation measures from
which to choose. In this section, we study various correlation measures to determine which
would be good for mining large data sets.
3. Discuss the single dimensional Boolean association rule mining for transaction database. Evaluate the
below transaction database
4. Construct the decision tree for the following training dataset using decision tree algorithm.
PART A
1.Identify what changes would you make to solve the problem in cluster analysis.
• Partitioning Method.
• Hierarchical Method.
• Density-based Method.
• Grid-Based Method.
• Model-Based Method.
• Constraint-based Method.
* In general, intrinsic methods evaluate a clustering by examining how well the clusters
are separated and how compact the clusters are. Many intrinsic methods have the advantage of a
similarity metric between objects in the data set.
* Similarity is an amount that reflects the strength of relationship between two data
items, it represents how similar 2 data patterns are. Clustering is done based on a similarity measure
to group similar data objects together
7. Define what is meant by K nearest neighbor algorithm.
dK-Nearest Neighbors is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense application
in pattern recognition, data mining and intrusion detection.
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
undefined
Undefined
An outlier is an observation that lies an abnormal distance from other values in a random sample
from a population. ... Examination of the data for unusual observations that are far removed from
the mass of data. These points are often referred to as outliers.
Example of an outlier box plot: The data set of N = 90 ordered observations as shown ...
Ways to describe data: Two activities are essential for characterizing a set of data
Low data quality and the presence of noise bring a huge challenge to outlier detection. ... Moreover,
noise and missing data may “hide” outliers and reduce the effectiveness of outlier detection—an
outlier may as a noise point, and an outlier detection method may mistakenly identify a noise point
as an outlier.
BASIS FOR
CLASSIFICATION CLUSTERING
COMPARISON
Basic This model function classifies the This function maps the data into one of the
data into one of numerous already multiple clusters where the arrangement of
between them.
1.Statistical Methods
Simply starting with visual analysis of the Univariate data by using Boxplots,
Scatter plots, Whisker plots, etc., can help in finding the extreme values in
the data.
2. Proximity Methods
3. Projection Methods
Projection methods utilize techniques such as the PCA to model the data into a
lower-dimensional subspace using linear correlations.
PART B
Answer:
(i)Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and
usable.
undefined
CLUSTER ANALYSIS
Points to Remember
• Scalability - We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kind of attributes - Algorithms should be capable to be applied
on any kind of data such as interval based (numerical) data, categorical, binary data.
• Discovery of clusters with attribute shape - The clustering algorithm should be capable
of detect cluster of arbitrary shape. The should not be bounded to only distance measures
that tend to find spherical cluster of small size.
• High dimensionality - The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
• Ability to deal with noisy data - Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability - The clustering results should be interpretable, comprehensible and usable.
Clustering Methods
• Kmeans
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
1. K-means
Given k, the k-means algorithm is implemented in four steps:
2.Compute seed points as the centroids of the clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)
3.Assign each object to the cluster with the nearest seed point
Typical methods:
3. Hierarchical Methods
This method creates the hierarchical decomposition of the given set of data objects.:
• Agglomerative Approach
• Divisive Approach
4.Density-based Method
Clustering based on density (local cluster criterion), such as density-connected points
Major features:
Two parameters:
Applications of Clustering
1. It is the backbone of search engine algorithms – where objects that are similar to each other must
be presented together and dissimilar objects should be ignored. Also, it is required to fetch
objects that are closely related to a search term, if not completely related.
2. A similar application of text clustering like search engine can be seen in academics where
clustering can help in the associative analysis of various documents – which can be in-turn used in
– plagiarism, copyright infringement, patent analysis etc.
3. Used in image segmentation in bioinformatics where clustering algorithms have proven their
worth in detecting cancerous cells from various medical imagery – eliminating the prevalent
human errors and other bias.
4. Netflix has used clustering in implementing movie recommendations for its users.
5. News summarization can be performed using Cluster analysis where articles can be divided into a
group of related topics.
3.What is clustering? Describe in detail about the features of K-means partitioning method. (13)
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group than those in other groups. In simple words, the aim is to segregate groups
with similar traits and assign them into clusters.
• Compute the sum of the squared distance between data points and all
centroids.
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also,
μk is the centroid of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and
treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically
speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-
step). Then we differentiate J w.r.t. μk and recompute the centroids after the
cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum
of squared distance from cluster’s centroid.
Which translates to recomputing the centroid of each cluster to reflect the new
assignments.
1. Get a meaningful intuition of the structure of the data we’re dealing with.
5.What is grid based clustering? With an example explain an algorithm for grid
based clustering.
Grid-Based Clustering
For each cell, the high level is partitioned into several smaller cells in the next lower level.
The statistical info of each cell is calculated and stored beforehand and is used to answer queries.
The parameters of higher-level cells can be easily calculated from parameters of lower-level cell
1.For each cell in the current level compute the confidence interval.
3.When finishing examining the current layer, proceed to the next lower level.
WaveCluster
It is a multi-resolution clustering approach which applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that decomposes a signal into different
frequency sub-band.
Input parameters:
It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.
1.It partitions each dimension into the same number of equal-length intervals.
3.A unit is dense if the fraction of the total data points contained in the unit exceeds the input model
parameter.
Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is
characterized by the parameters:
▪ μkμk: mean vector,
▪ ∑k∑k: covariance matrix,
▪ An associated probability in the mixture. Each point has a probability of belonging to each
cluster.
2. DBSCAN
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to
find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind
DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
DBSCAN is a clustering method that is used in machine learning to separate clusters of high
density from clusters of low density.
Approaches:
1. Subspace clustering
Subspace clustering is an unsupervised learning problem that aims at
grouping data points into multiple clusters so that data points at a single
cluster lie approximately on a low-dimensional linear subspace. Subspace
clustering is an extension of feature selection just as with feature selection
subspace clustering requires a search method and evaluation criteria but in
addition subspace clustering limit the scope of evaluation criteria. The
subspace clustering algorithm localizes the search for relevant dimensions
and allows them to find the cluster that exists in multiple overlapping
subspaces. Subspace clustering was originally purposed to solved very
specific computer vision problems having a union of subspace structure in
the data but it gains increasing attention in the statistic and machine learning
community.
2. Projected clustering
3. Correlation clustering
Correlation clustering is performed on databases and other large data
sources to group together similar datasets, while also alerting the user to
dissimilar datasets. This can be done perfectly in some graphs, while
others will experience errors because it will be difficult to differentiate
similar from dissimilar data. In the case of the latter, correlation clustering
will help reduce error automatically. This is often used for data mining, or
to search unwieldy data for similarities. Dissimilar data are commonly
deleted, or placed into a separate cluster.
ii).Consider five points { X1, X2,X3, X4, X5} with the following coordinates as a two
dimensional sample for clustering: X1 = (0,2.5); X2 = (0,0); X3= (1.5,0); X4 = (5,0); X5 = (5,2)
Illustrate the K-means partitioning algorithm using the above data set. (7)
8. i)How would you discuss the outlier analysis in detail ? (7)
Answer:
There are a wide range of techniques and tools used in outlier analysis.
Sorting: For an amateur data analyst, sorting is by far the easiest technique for
outlier analysis. The premise is simple: load your dataset into any kind of data
manipulation tool (such as a spreadsheet), and sort the values by their magnitude.
Graphing
An equally forgiving tool for outlier analysis is graphing. Once again, the premise is
straightforward: plot all of the data points on a graph, and see which points stand
out from the rest. The advantage of using a graphing approach over a sorting
approach is that it visualizes the magnitude of the data points, which makes it
much easier to spot outliers.
Z-score
A more statistical technique that can be used to identify outliers is the Z-score. The
Z-score measures how far a data point is from the average, as measured in
standard deviations. By calculating the Z-score for each data point, it’s easy to see
which data points are placed far from the average.
1. Statistical Methods
Simply starting with visual analysis of the Univariate data by using Boxplots, Scatter plots,
Whisker plots, etc., can help in finding the extreme values in the data. Assuming a normal
distribution, calculate the z-score, which means the standard deviation (σ) times the data
point is from the sample’s mean. Because we know from the Empirical Rule, which says that
68% of the data falls within one standard deviation, 95% percent within two standard
deviations, and 99.7% within three standard deviations from the mean, we can identify data
points that are more than three times the standard deviation, as outliers.
2. Proximity Methods
Proximity-based methods deploy clustering techniques to identify the clusters in the data
and find out the centroid of each cluster. They assume that an object is an outlier if the
nearest neighbors of the object are far away in feature space; that is, the proximity of the
object to its neighbors significantly deviates from the proximity of most of the other objects
3. Projection Methods
Projection methods utilize techniques such as the PCA to model the data into a lower-
dimensional subspace using linear correlations. Post that, the distance of each data point to
a plane that fits the sub-space is calculated. This distance can be used then to find the
outliers. Projection methods are simple and easy to apply and can highlight irrelevant values.
(ii). With relevant examples summarize in detail about constraint based cluster analysis. (8)
Ans not found..if anyone could find naahh…first uhh thedi kandupidida maanga!
10. Design statistical approaches in outlier detection with neat design and with examples. (13)
terila
undefined
12.(i). Disucss in detail about the different types of data in cluster analysis. (5)
(I)
Data Matrix
This represents n objects, such as persons, with p variables (also called measurements or attributes),
such as age, height, weight, gender, race and so on. The structure is in the form of a relational table,
or n-by-p matrix (n objects x p variables)
The Data Matrix is often called a two-mode matrix since the rows and columns of this represent the
different entities.
Dissimilarity Matrix
This stores a collection of proximities that are available for all pairs of n objects. It is often
represented by a n – by – n table, where d(i,j) is the measured difference or dissimilarity between
objects i and j. In general, d(i,j) is a non-negative number that is close to 0 when objects i and j are
higher similar or “near” each other and becomes larger the more they differ. Since d(i,j) = d(j,i) and
d(i,i) =0, we have the matrix in figure.
This is also called as one mode matrix since the rows and columns of this represent the same entity.
(ii). Discuss the following clustering algorithm using examples.(8)
1. K.means
2. K-medoid.
13. Describe the applications and trends in data mining in detail.
Data mining concepts are still evolving and here are the latest trends that we get to see in
this field −
• Application Exploration.
• Scalable and interactive data mining methods.
• Integration of data mining with database systems, data warehouse systems and web
database systems.
• SStandardization of data mining query language.
• Visual data mining.
• New methods for mining complex types of data.
• Biological data mining.
• Data mining and software engineering.
• Web mining.
• Distributed data mining.
• Real time data mining.
• Multi database data mining.
• Privacy protection and information security in data mining.
14 What is outlier mining important? Briefly describe the different approaches behind
statistical –based outlier detection, distance based outlier detection and deviation based
outlier detection.
Outliers are an integral part of data analysis. An outlier can be defined as observation point
that lies in a distance from other observations.
An outlier is important as it specifies an error in the experiment. Outliers are extensively
used in various areas such as detecting frauds, introducing potential new trends in the
market and others.
Usually, outliers are confused with noise. However, outliers are different from noise data in
the following sense:
1. Noise is a random error, but outlier is an observation point that is situated away from
different observations.
2. Noise should be removed for better outlier detection.
For many discordancy tests, it can be shown that if an object, o, is an outlier according to
the given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct and dmin.
For example, if objects that lie three or more standard deviations from the mean are
considered to be outliers, assuming a normal distribution, then this definition can be
generalized by a DB(0.9988, 0.13s) outlier. Several efficient algorithms for mining distance-
based outliers have been developed.
Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-based measures
to identify exceptional objects. Instead, it identifies outliers by examining the main
characteristics of objects in a group. Objects that ―deviate‖ from this description are
considered outliers. Hence, in this approach the term deviations is typically used to refer to
outliers. In this section, we study two techniques for deviation-based outlier detection. The
first sequentially compares objects in a set, while the second employs an OLAP data cube
approach.