Dwm_Complete_Notes______openinapp
Dwm_Complete_Notes______openinapp
🎗 🎗
👇👇👇👇👇👇👇👇👇👇👇👇
KEEP SHARE THIS LINKS
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
leg
ra
m
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gin
ee
rin
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gno
te
s_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
DWM
Module-1
1. Explain the architecture of DW with suitable diagram.
2. Write short note on data warehouse design strategies.
3. Explain dimensional modeling with example.
4. Define metadata with example .Discuss the types of metadata with example.
5. Define & differentiate star, snowflake schema and fact constellation with example.
6. Compare OLAP & OLTP systems. Describe the following OLAP operations using an
Example. (a) Roll up (b) Drill down (c) Slice (d)Dice (e) Pivot
7. Explain the major steps in the ETL process with a suitable diagram and an example.
8. Differentiate Data Warehouse & Data mart.
9. Explain Fact less Fact Table.
10. All electronics company have sales department consider three dimensions namely-
a. Time b) product c)store
The schema contains a central fact table sales with 2 measures
a. Dollars-cost b)units-sold
Using the above example describe the following operations
a)Dice b)Slice c)Roll up d)Drill down
and draw star ,snowflake schema
Module-2
1. Explain the steps in the KDD process with a suitable diagram.
2. What is data mining? what are the techniques and applications of data mining.
3. Explain the architecture of Data mining with suitable diagram.
4. Explain types of attributes and data visualization for data exploration.
Module-3
ot
2. What is tree pruning. Why tree pruning useful in decision tree induction?
er
ne
4. Create classification model using decision tree for stock market involving only discrete
@
:-
ranges has profit as categorical attribute, with values(up, down) and the training data is-
am
gr
le
Te
in
Jo
5. Why naive Bayesian classification is called “naive”? Apply Naïve Bayes Classifier
algorithm to classify an unknown sample
X(outlook=sunny,Temperature=hot,Humidity=High,Windy=False) the sample data set is
Module-4
Medicine A 1 1
gn
in
Medicine B 2 1
er
Medicine C 4 3
ne
Medicine D 5 4
gi
en
@
:-
am
5. Use the data given below. Create adjacency matrix. Use single link or complete link
algorithm to cluster given data set. Draw Dendrogram.
gr
le
Te
in
Jo
Module-5
A 1,3,4,6
B 2,3,5 ,7
C 1,2,3,5 ,8
D 2,5 ,9,10
E 1,4
Use Apriori Algorithm with min-support count 30% and min-confidence 75% to find all
frequent item sets and strong association rule.
Module-6
1. Explain spatial and web mining.
2. Explain web usage mining, text mining.
3. What is Web structure mining & page rank technique.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Priority 1 (Questions to prepare first)
A data warehouse is a single data repository where a record from multiple data sources is integrated for online
business analytical processing (OLAP). This implies a data warehouse needs to meet the requirements from all
the business stages within the entire organization. Thus, data warehouse design is a hugely complex, lengthy,
and hence error-prone process. Furthermore, business analytical functions change over time, which results in
changes in the requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and
the design process is continuous.
Data warehouse design takes a method different from view materialization in the industries. It sees data
warehouses as database systems with particular needs such as answering management related queries. The
target of the design becomes how the record from multiple data sources should be extracted, transformed, and
loaded (ETL) to be organized in a database as the data warehouse.
1. "top-down" approach
2. "bottom-up" approach
Developing new data mart from the data warehouse is very easy.
ot
gn
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized
database for the data warehouse, a denormalized dimensional database is adapted to meet the data delivery
u
_m
requirements of data warehouses. Using this method, to use the set of data marts as the enterprise data
es
warehouse, data marts should be built with conformed dimensions in mind, defining that ordinary objects are
ot
gn
represented the same in different data marts. The conformed dimensions connected the data marts to form a
in
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a data
en
warehouse for a single subject, takes far less time and effort than developing an enterprise-wide data
@
warehouse. Also, the risk of failure is even less. This method is inherently incremental. This method allows the
:-
am
It is just developing new data marts and then integrating with other data marts.
AD
u
_m
the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.
in
er
ne
Design Approach
@
:-
am
Breaks the vast problem into smaller subproblems. Solves the essential low-level problem and integrates
le
Te
Inherently architected- not a union of several data marts. Inherently incremental; can schedule essential data
Jo
marts first.
Single, central storage of information about the content. Departmental information stored.
Centralized rules and control. Departmental rules and control.
It includes redundant information. Redundancy can be removed.
It may see quick results if implemented with repetitions. Less risk of failure, favorable return on investment, and
proof of techniques.
A star schema is a relational schema where a relational schema whose design represents a multidimensional
data model. The star schema is the explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the dimension tables.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
Fact Tables
in
Jo
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key of the
fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same
level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of
the dimensions table are part of the composite primary keys of the fact table. Dimensional attributes help to
define the dimensional value. They are generally descriptive, textual values. Dimensional tables are usually
small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets, cities),
clients, products, times, channels.
o It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
AD
Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema,
the customer can instantly analyze large, multidimensional data sets.
es
ot
gn
A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the
two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two
tables. This design feature enforces authentic and consistent query results.
Structural simplicity also decreases the time required to load large batches of record into a star schema
es
database. By describing facts and dimensions and separating them into the various table, the impact of a load
ot
gn
structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts
in
A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
:-
am
because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
gr
foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
le
Te
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts
of the underlying business. Customer can also browse dimension table attributes before constructing a query.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city, state,
and country.
AD
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME, ITEM,
gr
BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three columns for
le
Te
BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
in
Jo
When we need to change an item, we need only make a single change in the dimension table, instead of
making many changes in the fact table.
The snowflake schema is an expansion of the star schema where each point of the star explodes into more
points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact table in
the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out
into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked
to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three levels. A snowflake
schemas can have any number of dimension, and each dimension can have any number of levels.
u
_m
es
ot
gn
in
er
ne
gi
en
@
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product, Line,
:-
and Family dimension tables. The Market dimension has two dimension tables with Store as the primary
am
dimension table, and Location as the outrigger dimension table. The product dimension has three dimension
gr
le
tables with Product as the primary dimension table, and the Line and Family table are the outrigger dimension
Te
tables.
in
Jo
A star schema store all attributes for a dimension into one denormalized table. This needed more disk space
than a more normalized snowflake schema. Snowflaking normalizes the dimension by moving attributes with
low cardinality into separate dimension tables that relate to the core dimension table by using foreign keys.
u
_m
Snowflaking for the sole purpose of minimizing disk space is not recommended, because it can adversely
es
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged
in
er
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include
@
quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the dimension
:-
am
tables.
gr
le
Te
in
Jo
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version now
extends to eleven tables. We will notice that in the snowflake schema, the attributes with low cardinality in each
original dimension tables are removed to form separate tables. These new tables are connected back to the
original dimension table through artificial keys.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
A snowflake schema is designed for flexible querying across more complex dimensions and relationship. It is
suitable for many to many and one to many relationships between dimension levels.
1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due
in
er
ne
to the increasing number of lookup tables. It is also known as a multi fact star schema.
gi
en
o In a star schema, the fact table will be at the center and is connected to the dimension tables.
u
_m
Snowflake Schema
es
ot
gn
in
o A snowflake schema is an extension of star schema where the dimension tables are connected to one or
er
ne
more dimensions.
gi
en
@
o The performance of SQL queries is a bit less when compared to star schema as more number of joins
gr
le
are involved.
Te
in
o Data redundancy is low and occupies less disk space when compared to star schema.
Jo
AD
4. It takes less time for the execution of While it takes more time than star schema for the execution of
ot
queries. queries.
gn
5. In star schema, Normalization is not used. While in this, Both normalization and denormalization are used.
in
6.
ne
7. The query complexity of star schema is low. While the query complexity of snowflake schema is higher than
gi
star schema.
en
9. It has less number of foreign keys. While it has more number of foreign keys.
:-
am
10. It has high data redundancy. While it has low data redundancy.
gr
le
Te
in
Jo
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy
schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation
Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact
Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.
The dimensions are the perspectives or entities concerning which an organization keeps records. For example,
a shop may create a sales data warehouse to keep records of the store's sales for the dimension time, item, and
u
location. These dimensions allow the save to keep track of things, for example, monthly sales of items and the
_m
locations at which the items were sold. Each dimension has a table related to it, called a dimensional table,
es
which describes the dimension further. For example, a dimensional table for an item may contain the attributes
ot
gn
A multidimensional data model is organized around a central theme, for example, sales. This theme is
gi
represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
en
@
u
_m
es
ot
gn
in
Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time
er
ne
and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D
gi
data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
en
@
:-
am
gr
le
Te
in
Jo
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
u
_m
es
1. ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
er
ne
extract data from various sources, transform it into a format suitable for loading into a data
gi
warehouse, and then load it into the warehouse. The process of ETL can be broken down into
en
2. Extract: The first stage in the ETL process is to extract data from various sources such as
:-
am
transactional systems, spreadsheets, and flat files. This step involves reading data from the
gr
3. Transform: In this stage, the extracted data is transformed into a format that is suitable for
Te
loading into the data warehouse. This may involve cleaning and validating the data, converting
in
Jo
data types, combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data warehouse is
accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as Informatica,
Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it in
the staging area, and then finally, loads it into the Data W arehouse system.
it into the staging area first and not directly into the data warehouse because the extracted data
_m
is in various formats and can be corrupted also. Hence loading it directly into the data
es
warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the
ot
gn
2. Transformation:
er
ne
The second step of the ETL process is transformation. In this step, a set of rules or functions
gi
are applied on the extracted data to convert it into a single standard format. It may involve
en
following processes/tasks:
@
Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
gr
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
CloverETL, and MarkLogic.
u
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery,
_m
and Firebolt.
es
ot
gn
ADVANTAGES OR DISADVANTAGES:
in
er
ne
gi
1. Improved data quality: ETL process ensures that the data in the data warehouse is accurate,
am
2. Better data integration: ETL process helps to integrate data from multiple sources and
Te
1. High cost: ETL process can be expensive to implement and maintain, es pecially for
organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be able to
handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be able to
handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large amounts
of data are collected, stored, and analyzed.
Explain Data Mining ? Describe the steps involved in Data Mining when viewed as a process
of Knowledge Discovery.
In other words, we can say that Data Mining is the process of investigating hidden patterns of information to
various perspectives for categorization into useful data, which is collected and assembled in particular areas
such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other data
requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends and patterns
that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data
segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of
u
Data (KDD).
_m
es
ot
Data Mining is a process used by organizations to extract specific data from huge databases to solve business
gn
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular data set,
gi
en
with an objective. This process includes various types of services such as text mining, web mining, audio and
@
video mining, pictorial data mining, and social media mining. It is done through software that is simple or
:-
highly specific. By outsourcing data mining, all the work can be done faster with low operation costs.
am
Specialized firms can also use new technologies to collect data that is impossible to locate manually. There are
gr
le
tonnes of information available on various platforms, but very little knowledge is accessible. The biggest
Te
challenge is to analyze the data to extract important information that can be used to solve a problem or for
in
Jo
company development. There are many powerful instruments and techniques available to mine data and find
better insight from it.
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and columns
from which data can be accessed in various ways without having to recognize the database tables. Tables
convey and share information, which facilitates data searchability, reporting, and organization.
Data warehouses:
u
_m
A Data Warehouse is the technology that collects the data from various sources within the organization to
es
provide meaningful business insights. The huge amount of data comes from multiple places such as Marketing
ot
and Finance. The extracted data is utilized for analytical purposes and helps in decision- making for a business
gn
organization. The data warehouse is designed for the analysis of data rather than transaction processing.
in
er
ne
Data Repositories:
gi
en
@
The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize
:-
am
the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of
gr
Object-Relational Database:
in
Jo
A combination of an object-oriented database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc.
AD
One of the primary objectives of the Object-relational data model is to close the gap between the Relational
database and the object-oriented model practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to undo a
database transaction if it is not performed appropriately. Even though this was a unique capability a very long
while back, today, most of the relational database systems support transactional database activities.
o Data mining enables organizations to make lucrative modifications in operation and production.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.
o It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short
time.
AD
o There is a probability that the organizations may sell useful data of customers to other organizations for
es
money. As per the report, American Express has sold credit card purchases of their customers to other
ot
gn
organizations.
in
er
ne
o Many data mining analytics software is difficult to operate and needs advance training to work on.
gi
en
Different data mining instruments operate in distinct ways due to the different algorithms used in their
@
o
:-
design. Therefore, the selection of the right data mining tools is a very challenging task.
am
gr
The data mining techniques are not precise, so that it may lead to severe consequences in certain
le
o
Te
conditions.
in
Jo
AD
These are the following areas where data mining is widely used:
AD
Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics for
u
_m
better insights and to identify best practices that will enhance health care services and reduce costs. Analysts
es
use data mining approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
ot
computing, and statistics. Data Mining can be used to forecast patients in each category. The procedures
gn
ensure that the patients get intensive care at the right place and at the right time. Data mining also enables
in
er
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products,
gr
then you are more likely to buy another group of products. This technique may enable the retailer to
le
understand the purchase behavior of a buyer. This data may assist the retailer in understanding the
Te
in
Jo
requirements of the buyer and altering the store's layout accordingly. Using a different analytical comparison of
results between various stores, between customers in different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of educational support, and promoting
learning science. An organization can use data mining to make precise decisions and also to predict the results
of the student. With the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial to
find patterns in a complex manufacturing process. Data mining can be used in system-level designing to obtain
the relationships between product architecture, product portfolio, and data needs of the customers. It can also
be used to forecast the product development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit time
consuming and sophisticated. Data mining provides meaningful patterns and turning data into information. An
ideal fraud detection system should protect the data of all the users. Supervised methods consist of a collection
of sample records, and these records are classified as fraudulent or non-fraudulent. A model is constructed
using this data, and the technique is made to identify whether the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task. Law
enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
u
_m
communications, etc. This technique includes text mining also, and it seeks meaningful patterns in data, which
es
is usually unstructured text. The information collected from the previous investigations is compared, and a
ot
The Digitalization of the banking system is supposed to generate an enormous amount of data with every new
@
transaction. The data mining technique can help bankers by solving business-related problems in banking and
:-
am
finance by identifying trends, casualties, and correlations in business information and market costs that are not
gr
instantly evident to managers or executives because the data volume is too large or are produced too rapidly
le
on the screen by experts. The manager may find these data for better targeting, acquiring, retaining,
Te
The process of extracting useful data from large volumes of data is data mining. The data in the real-world is
es
heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or unreliable. These
ot
gn
problems may occur due to data measuring instrument or because of human errors. Suppose a retail chain
in
collects phone numbers of customers who spend more than $ 500, and the accounting employees put the
er
ne
information into their system. The person may make a digit mistake when entering the phone number, which
gi
en
results in incorrect data. Even some customers may not be willing to disclose their phone numbers, which
@
results in incomplete data. The data could get changed due to human or system error. All these consequences
:-
Data Distribution:
Te
in
Jo
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might be in
a database, individual systems, or even on the internet. Practically, It is a quite tough task to make all the data
to a centralized data repository mainly due to organizational and technical concerns. For example, various
regional offices may have their servers to store their data. It is not feasible to store, all the data from all the
offices on a central server. Therefore, data mining requires the development of tools and algorithms that allow
the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools, and methodologies would have to
be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques used. If
the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining process
will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example, if a
retailer analyzes the details of the purchased items, then it reveals data about buying habits and preferences of
the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that shows the
output to the user in a presentable way. The extracted data should convey the exact meaning of what it intends
to express. But many times, representing the information to the end-user in a precise and easy way is difficult.
The input data and the output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.
https://www.youtube.com/watch?v=XzSlEA4ck2I&ab_channel=MaheshHuddar
ot
gn
in
er
https://www.youtube.com/watch?v=coOTEc-0OGw&ab_channel=MaheshHuddar
en
@
:-
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
AD
es
ot
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the below image:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
AD
Jo
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:
AD
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
Clustering?
Te
in
Jo
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But
choosing the optimal number of clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate method to find the number of clusters or
value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the
best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Note: We can choose the number of clusters equal to the given data points. If we choose the number of
clusters equal to the data points, then the value of WCSS becomes zero, and that will be the endpoint of the
plot.
AD
Before implementation, let's understand what type of problem we will solve here. So, we have a dataset
of Mall_Customers, which is the data of customers who visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is
the calculated value of how much a customer has spent in the mall, the more the value, the more he has spent).
From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't know what
to calculate exactly.
u
_m
es
o Data Pre-processing
er
ne
o
en
@
The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification.
But for the clustering problem, it will be different from other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics calculation, matplotlib is for
plotting the graph, and pandas are for managing the dataset.
By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the
below image: u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
From the above dataset, we need to find some patterns in it.
Here we don't need any dependent variable for data pre-processing step as it is a clustering problem, and we
have no idea about what to determine. So we will just add a line of code for the matrix of features.
As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to visualize the model,
es
Step-2: Finding the optimal number of clusters using the elbow method
er
ne
In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as
gi
en
discussed above, here we are going to use the elbow method for this purpose.
@
:-
am
As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis
gr
and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values
le
As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is used to contain the value of
wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since
for loop in Python, exclude the outbound limit, so it is taken as 11 to include 10 th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of
features and then plotted the graph between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be 5.
As we have got the number of clusters, so we can now train the model on the dataset.
u
_m
To train the model, we will use the same two lines of code as we have used in the above section, but here
es
instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given
ot
gn
below:
in
er
ne
3. y_predict= kmeans.fit_predict(x)
am
gr
le
The first line is the same as above for creating the object of KMeans class.
Te
in
In the second line of code, we have created the dependent variable y_predict to train the model.
Jo
By executing the above lines of code, we will get the y_predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset.
Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster
es
one by one.
ot
gn
in
To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
er
ne
gi
en
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
am
gr
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluste
le
Te
r
in
Jo
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first coordinate of the
mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and
the y_predict is ranging from 0 to 1.
Output:
u
_m
The output image is clearly showing the five different clusters with different colors. The clusters are formed
es
between two parameters of the dataset; Annual income of customer and Spending. We can change the colors
ot
and labels as per the requirement or choice. We can also observe some points from the above patterns, which
gn
o Cluster1 shows the customers with average salary and average spending so we can categorize these
en
@
customers as
:-
am
o Cluster2 shows the customer has a high income but low spending, so we can categorize them
gr
le
as careful.
Te
in
Cluster3 shows the low income and also low spending so they can be categorized as sensible.
Jo
o
o Cluster4 shows the customers with low income with very high spending so they can be categorized
as careless.
o Cluster5 shows the customers with high income and high spending so they can be categorized as
target, and these customers can be the most profitable customers for the mall owner.
Explain Web Structure Mining +Explain Page Rank Algorithm in Web Mining in Details
The challenge for Web structure mining is to deal with the structure of the hyperlinks within the web itself. Link
analysis is an old area of research. However, with the growing interest in Web mining, the research of structure
analysis has increased. These efforts resulted in a newly emerging research area called Link Mining, which is
located at the intersection of the work in link analysis, hypertext, web mining, relational learning, inductive logic
programming, and graph mining.
Web structure mining uses graph theory to analyze a website's node and connection structure. According to
the type of web structural data, web structure mining can be divided into two kinds:
o Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects
the web page to a different location.
o Mining the document structure: analysis of the tree-like structure of page structures to describe
HTML or XML tag usage.
The web contains a variety of objects with almost no unifying structure, with differences in the authoring style
and content much greater than in traditional collections of text documents. The objects in the WWW are web
u
_m
pages, and links are in, out, and co-citation (two pages linked to by the same page). Attributes include HTML
es
tags, word appearances, and anchor texts. Web structure mining includes the following terminology, such as:
ot
gn
in
er
o
@
o Edge: hyperlinks.
:-
am
o
le
Te
Link mining had produced some agitation on some traditional data mining tasks. Below we summarize some of
these possible tasks of link mining which are applicable in Web structure mining, such as:
1. Link-based Classification: The most recent upgrade of a classic data mining task to linked Domains.
The task is to predict the category of a web page based on words that occur on the page, links between
pages, anchor text, html tags, and other possible attributes found on the web page.
2. Link-based Cluster Analysis: The data is segmented into groups, where similar objects are grouped
together, and dissimilar objects are grouped into different groups. Unlike the previous task, link-based
cluster analysis is unsupervised and can be used to discover hidden patterns from data.
3. Link Type: There is a wide range of tasks concerning predicting the existence of links, such as
predicting the type of link between two entities or predicting the purpose of a link.
5. Link Cardinality: The main task is to predict the number of links between objects. page categorization
used to
o Finding duplicated websites and finding out the similarity between them.
6.
7.
gr
le
Te
Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141
Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
So, what have we done here is for node A we will see how many incoming signals are there so here we
have PR(B) and PR(C). And for each of the incoming signals, we will see the outgoing signals from
that particular incoming signal i.e. for PR(B) we have 4 outgoing signals and for PR(C) we have 2
outgoing signals. The same procedure will be applicable for the remaining nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
= 0.32
ot
gn
= 0.264
en
@
= 0.264
Te
= 0.392
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
= 0.2714
= 0.4141
So, the final PAGE RANK for the above-given question is,
NODES ITERATION 0 ITERATION 1 ITERATION 2
u
Priority 2
am
gr
le
5. Integration and transformation rules used to deliver information to end-user analytical tools.
Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow users
access to help understand the content and find data.
1. A library catalog may be considered metadata. The directory metadata consists of several predefined
components representing specific attributes of a resource, and each item can have one or more values.
These components could be the name of the author, the name of the document, the publisher's name,
the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
u
_m
person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the data
es
is 80.
ot
gn
4. Another examples of metadata are data about the tables and figures in a report like this book. A table
in
er
ne
(which is a record) has a name (e.g., table titles), and there are column names of the tables that may be
gi
en
o First, it acts as the glue that links all parts of the data warehouses.
in
Jo
o Next, it provides information about the contents and structures to the developers.
o Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.
Metadata is Like a Nerve Center. Various processes during the building and administering of the data
warehouse generate parts of the data warehouse metadata. Another uses parts of metadata generated by one
process. In the data warehouse, metadata assumes a key position and enables communication among various
methods. It acts as a nerve centre in the data warehouse.
u
_m
es
Types of Metadata
ot
gn
in
o Operational Metadata
@
:-
End-User Metadata
le
o
Te
in
AD
Jo
AD
Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the enterprise. These
source systems include different data structures. The data elements selected for the data warehouse have
various fields lengths and data types.
In selecting information from the source systems for the data warehouses, we divide records, combine factor of
documents from different source files, and deal with multiple coding schemes and field lengths. When we
deliver information to the end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.
Extraction and transformation metadata include data about the removal of data from the source systems,
namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this
category of metadata contains information about all the data transformation that takes place in the data
staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find data
from the data warehouses. The end-user metadata allows the end-users to use their business terminology and
look for the information in those ways in which they usually think of the business.
2. Enabling users to control and manage the access and manipulation of metadata in their unique
ot
gn
3. Users are allowed to build tools that meet their needs and also will enable them to adjust accordingly to
ne
gi
4. Allowing individual tools to satisfy their metadata requirements freely and efficiently within the content
:-
am
of an interchange model.
gr
le
Te
in
Jo
5. Describing a simple, clean implementation infrastructure which will facilitate compliance and speed up
adoption by minimizing the amount of modification.
6. To create a procedure and process not only for maintaining and establishing the interchange standard
specification but also for updating and extending it over time.
It is a framework that is based on a framework that will translate an access request into the standard
interchange index.
o Procedural Approach
o Hybrid Approach
In a procedural approach, the communication with API is built into the tool. It enables the highest degree of
flexibility.
In ASCII Batch approach, instead of relying on ASCII file format which contains information of various
metadata items and standardized access requirements that make up the interchange standards metadata
model.
AD
being exchanged.
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
2) The standard access framework that describes the minimum number of API functions.
4) The user configuration is a file explaining the legal interchange paths for metadata in the user's environment.
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform
u
_m
the data, generate code for data transformation, and to move data to the warehouse.
es
ot
AD
gn
in
er
8. It gives useful data administration tool to manage corporate information assets with the data dictionary.
OLAP vs OLTP
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)
Definition It is well-known as an online database query It is well-known as an online database modifying
management system. system.
Data source Consists of historical data from various Consists of only operational current data.
Databases.
Method used It makes use of a data warehouse. It makes use of a standard database management
system (DBMS).
Application It is subject-oriented. Used for Data Mining, It is application-oriented. Used for business tasks.
Analytics, Decisions making, etc.
Normalized In an OLAP database, tables are not In an OLTP database, tables are normalized (3NF).
normalized.
Usage of data The data is used in planning, problem-solving, The data is used to perform day-to-day fundamental
and decision-making. operations.
Task It provides a multi-dimensional view of It reveals a snapshot of present business tasks.
different business tasks.
Purpose It serves the purpose to extract information for It serves the purpose to Insert, Update, and Delete
analysis and decision-making. information from the database.
Volume of data A large amount of data is stored typically in The size of the data is relatively small as the
TB, PB historical data is archived in MB, and GB.
Queries Relatively slow as the amount of data involved Very Fast as the queries operate on 5% of the data.
is large. Queries may take hours.
Update The OLAP database is not often updated. As a The data integrity constraint must be maintained in
result, data integrity is unaffected. an OLTP database.
Backup and It only needs backup from time to time as The backup and recovery process is maintained
Recovery compared to OLTP. rigorously
Processing time The processing of complex queries can take a It is comparatively fast in processing because of
lengthy time. simple and straightforward queries.
Types of users This data is generally managed by CEO, MD, This data is managed by clerksForex and managers.
u
_m
and GM.
es
Operations Only read and rarely write operations. Both read and write operations.
ot
Updates With lengthy, scheduled batch operations, data The user initiates data updates, which are brief and
gn
Nature of The process is focused on the customer. The process is focused on the market.
ne
audience
gi
en
Database Design Design with a focus on the subject. Design that is focused on the application.
@
Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
:-
am
When creating a machine learning project, it is not always a case that we come across the clean and formatted
data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for
this, we use data preprocessing task.
o Importing libraries
o Importing datasets
o Feature scaling
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
es
due to faulty data collection, data entry errors etc. It can be handled in following ways :
ot
1. Binning Method:
gn
This method works on sorted data in order to smooth it. The whole data is divided into
in
er
segments of equal size and then various methods are performed to complete the task.
ne
Each segmented is handled separately. One can replace all data in a segment by its mean
gi
en
2. Regression:
am
Here data can be made smooth by fitting it to a regression function.The regression used
gr
le
may be linear (having one independent variable) or multiple (having multiple independent
Te
variables).
in
Jo
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate f orms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used
to reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points wi th a representative
u
_m
centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
es
based clustering.
ot
Compression: This involves compressing the dataset while preserving the important information.
gn
Compression is often used to reduce the size of the dataset for storage and transmission
in
er
purposes. It can be done using techniques such as wavelet compression, JPEG compression, and
ne
gzip compression.
gi
en
@
:-
am
2. In data warehouse, lightly denormalization takes place. While in Data mart, highly denormalization takes
in
place.
Jo
3. Data warehouse is top-down model. While it is a bottom-up model.
4. To built a warehouse is difficult. While to build a mart is easy.
5. In data warehouse, Fact constellation schema is used. While in this, Star schema and snowflake schema are
used.
6. Data Warehouse is flexible. While it is not flexible.
7. Data Warehouse is the data-oriented in nature. While it is the project-oriented in nature.
8. Data Ware house has long life. While data-mart has short life than warehouse.
9. In Data Warehouse, Data are contained in detail form. While in this, data are contained in summarized form.
10. Data Warehouse is vast in size. While data mart is smaller than warehouse.
11. The Data Warehouse might be somewhere between 100 The Size of Data Mart is less than 100 GB.
GB and 1 TB+ in size.
12. The time it takes to implement a data warehouse might The Data Mart deployment procedure is time-limited
range from months to years. to a few months.
13. It uses a lot of data and has comprehensive operational Operational data are not present in Data Mart.
data.
14. It collects data from various data sources. It generally stores data from a data warehouse.
15. Long time for processing the data because of large data. Less time for processing the data because of handling
only a small amount of data.
16. Complicated design process of creating schemas and Easy design process of creating schemas and views.
views.
o Divisive Clustering
u
_m
Agglomerative clustering is one of the most common types of hierarchical clustering used to group similar
in
agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
gi
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters are
en
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
in
Jo
2. Consider each data point as an individual cluster.
Let’s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work. Here no
calculation has been done below all the proximity among the clusters are assumed.
Step 1:
u
_m
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the individual
es
ot
Step 2:
ne
gi
en
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to each
@
other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]
:-
am
Step 3:
gr
le
Te
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)] together
in
Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
AD
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the data
points that are not similar are separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
u
_m
es
ot
o
ne
gi
AD
le
Te
in
Jo
Disadvantages of hierarchical clustering
o It breaks the large clusters.
o The algorithm can never be changed or deleted once it was done previously.
Design the data warehouse for wholesale furniture [10] Company. The data warehouse has to allow to
analyze the company’s situation at least with respect to the Furniture, Customer and Time. Moreover, the
company needs to analyze: The furniture with respect to its type, category and material. The customer with
respect to their spatial location, by considering at least cities, regions and states. The company is interested
in learning the quantity, income and discount
Dimensional model is used to analyze data/business facts with respect to business dimensions example
customers, products, etc.
DW for furniture company.
Dimensions:
1. Furniture : type, category, material.
2. Customer : Name, street, city, state, min.
3. Time : date, day of week, day of month, week, month, quarter, half year.
Facts :
Quantity, income and % discount.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Linear Regression
Linear regression is the type of regression that forms a relationship between the target variable and one or
more independent variables utilizing a straight line. The given equation represents the equation of linear
regression
Y = a + b*X + e.
Where,
In linear regression, the best fit line is achieved utilizing the least squared method, and it minimizes the total
sum of the squares of the deviations from each data point to the line of regression. Here, the positive and
negative deviations do not get canceled as all the deviations are squared.
Data visualization convert large and small data sets into visuals, which is easy to understand and process for
humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.
u
In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts of
_m
information.
es
ot
gn
Data visualizations are common in your everyday life, but they always appear in the form of graphs and charts.
in
er
The combination of multiple visualizations and bits of information are still referred to as Infographics.
ne
gi
en
Data visualizations are used to discover unknown facts and trends. You can see visualizations in the form of line
@
charts to display change over time. Bar and column charts are useful for observing relationships and making
:-
comparisons. A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
am
American statistician and Yale professor Edward Tufte believe useful data visualizations consist of ?complex
ideas communicated with clarity, precision, and efficiency.
u
_m
es
ot
To craft an effective data visualization, you need to start with clean data that is well-sourced and complete.
gn
After the data is ready to visualize, you need to pick the right chart.
in
er
ne
After you have decided the chart type, you need to design and customize your visualization to your liking.
gi
en
Simplicity is essential - you don't want to add any elements that distract from the data.
@
:-
am
The concept of using picture was launched in the 17th century to understand the data from the maps and
in
graphs, and then in the early 1800s, it was reinvented to the pie chart.
Jo
Several decades later, one of the most advanced examples of statistical graphics occurred when Charles
Minard mapped Napoleon's invasion of Russia. The map represents the size of the army and the path of
Napoleon's retreat from Moscow - and that information tied to temperature and time scales for a more in-
depth understanding of the event.
Computers made it possible to process a large amount of data at lightning-fast speeds. Nowadays, data
visualization becomes a fast-evolving blend of art and science that certain to change the corporate landscape
over the next few years.
Data visualization is an easy and quick way to convey concepts universally. You can experiment with a different
es
AD
en
@
Data visualization tools have been necessary for democratizing data, analytics, and making data-driven
perception available to workers throughout an organization. They are easy to operate in comparison to earlier
versions of BI software or traditional statistical analysis software. This guide to a rise in lines of business
implementing data visualization tools on their own, without support from IT.
AD
5. To competitive analyze.
6. To improve insights.
What is the relationship between data warehousing and data replication? Which form of [10] replication
(synchronous or asynchronous) is better suited for data warehousing? Why? Explain with appropriate
example.
Data warehousing and data replication can be used together to improve the performance, reliability, and
scalability of the data warehousing environment.
Data replication is the process of creating and maintaining multiple copies of the same data in different
locations or on different systems to improve fault tolerance, data availability, and disaster recovery capabilities.
It can be done in many ways, including synchronous replication and asynchronous replication.
Synchronous replication involves replicating changes to data in real-time as soon as they occur, ensuring that
multiple copies of the data are always synchronized with each other. Asynchronous replication, on the other
hand, involves replicating data changes on a scheduled or periodic basis, resulting in some delay between
updates to the original data and the replicated copies.
In a data warehousing environment, asynchronous replication can be a better choice because it allows for a
more flexible and scalable architecture. Since data warehouses are often subject to large volumes of data and
u
_m
complex data transformations, synchronous replication can result in performance issues or delay in data
es
transformation processes. In contrast, asynchronous replication allows for a more staggered data
ot
transformation process that can take advantage of off-peak hours or idle processing time to transform and
gn
replicate data.
in
er
For example, a retail company might use data replication to keep multiple copies of their sales data across
ne
multiple locations to ensure that all stores have access to the same information. Asynchronous replication can
gi
en
be used in this case because it allows the company to collect and transform sales data at each store and
@
periodically replicate it to the central data warehouse without affecting daily operations in the stores.
:-
am
gr
le
Te
in
Jo
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gin
ee
rin
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
B9
54
67
A4
CA
Paper / Subject Code: 31924 / Data Warehousing & Mining
D3
19
DD
C0
B9
54
E8
A4
6
CA
D3
19
DD
97
C0
B9
54
32
E8
A4
CA
D3
AC
DD
97
C0
54
Duration:(3 Hours) [80 Marks]
E8
9E
4
C3
A
D3
97
4C
2B
C0
A
8
N.B. 1) Question No. 1 is compulsory.
4
5
A0
0A
9
D3
C
4C
2B
2) Attempt any Three questions out of the remaining.
9
67
EA
AC
32
E8
35
A0
3) Assume suitable data wherever necessary and state them clearly.
19
97
4C
B
8D
B9
A
2
32
6
35
A0
E
9
DD
9
Q.1 Solve any four of the following (20)
AC
91
7
2B
8D
29
67
A4
B
A. Compare OLTP vs OLAP systems.
9E
35
0
C3
E
19
DD
A
C0
97
B
D
9
B. Explain the KDD process of data mining.
A
4
DB
02
32
E8
96
CA
E
A
B9
C. Explain any two methods of evaluating the accuracy of a Classifier.
7A
C
1
C0
97
D
9
54
EA
4
DB
02
32
96
A
D. Explain K-means clustering algorithm and draw flowchart.
0A
D3
9
C
7A
AC
91
4D
2B
E. Explain multilevel association rule mining with example.
54
C
E8
DB
6
A
E
A
0
D3
C3
9
97
B9
4C
7A
1
0
B9
C
32
E8
EA
A4
02
35
96
CA
AC
DD
7
B9
A
8D
91
C0
9
Q.2 A. Consider the following transaction database with minimum support 50% and minimum
4
2
67
9E
A4
DB
02
5
3
7E
CA
3
AC
19
confidence 66%. Find the frequent patterns and strong association rules. (10)
2B
7A
8D
C0
9
4D
B9
54
32
E
A0
96
E
A
Tid Items
A
9
3
C
DD
7
C
2B
91
0
9
67
AC
2
8
9E
DB
35
0
E
19
10 A,C,D
0A
A
AC
97
4C
B
8D
9
67
4D
DB
C
2
E
35
0
E
19
CA
20 B,C,E
0A
9
7A
AC
97
4D
8D
9
DB
54
02
AC
32
96
E
0A
E
9
3
30 A,B,C,E
7A
C
1
7
4D
4C
2B
D
B9
AC
9
EA
E8
6
0A
5
0
C3
19
D
B9
4C
3
A
40 B,E
97
D
8D
9
AC
67
A
A4
2
35
32
E
0
7E
19
DD
B9
4C
A
8D
C
C0
B9
29
67
EA
02
35
7E
CA
C3
19
DD
9
A
D
C0
29
B
9
54
7
E8
EA
A4
DB
2
C3
96
Q.3 A. Find the clusters for the following dataset using a single link technique. Use Euclidean
CA
A0
D3
97
B9
91
C0
A
4D
67
8
9E
02
7E
0A
3
AC
9
DD
4C
A
8D
91
9
Sample No. X Y
AC
32
67
9E
A4
B
35
7E
AC
19
DD
4C
2B
8D
P1 0.40 0.53
C0
9
9
32
9E
A4
DB
35
A0
CA
AC
P2 0.22 0.38
2B
8D
C0
9
67
4D
54
32
9E
A0
E
19
0A
D3
AC
P3 0.35 0.32
4C
2B
29
7
AC
8
96
9E
35
A0
C3
7E
P4 0.26 0.19
91
u
C
2B
8D
29
7
_m
DB
4
96
9E
35
A0
C3
P5 0.08 0.41
es
91
97
4D
2B
8D
7
EA
ot
DB
32
96
0A
A0
7E
gn
P6 0.45 0.30
9
C
91
D
2B
in
29
7
A
A4
DB
96
er
9E
A0
C3
91
ne
0
2B
AC
EA
A4
DB
gi
96
A0
en
B9
C
91
0
D
54
AC
67
@
A4
DB
02
D3
19
:-
C
7A
0
am
9
54
AC
A4
DB
96
3
gr
C
8D
91
0
4D
le
54
AC
DB
Te
7E
0A
D3
38449 Page 1 of 2
C
29
in
4D
54
AC
E8
Jo
C3
0A
3
97
C
8D
54
AC
32
7E
3
AC
B9EAC3297E8D354CAC0A4DDB91967A02
4C
8D
29
B9
54
67
A4
CA
Paper / Subject Code: 31924 / Data Warehousing & Mining
D3
19
DD
C0
B9
54
E8
A4
6
CA
D3
19
DD
97
C0
B9
54
32
E8
A4
CA
D3
AC
DD
97
C0
54
Q.3.B. The college wants to record the Marks for the courses completed by students using the
E8
9E
4
C3
A
D3
97
dimensions: I) Course, II)Student, III) Time & a measure Aggregate marks .
4C
2B
C0
A
8
E
4
5
A0
E
Create a cube and describe following OLAP operations :
0A
9
D3
C
4C
2B
9
67
EA
I) Slice II) Dice III) Roll up IV) Drill Down V)Pivot (10)
AC
32
E8
35
A0
19
97
4C
B
8D
B9
A
2
32
6
35
A0
Q.4.A. What is dimensional modeling? Design the data warehouse dimensional model for a
E
9
DD
AC
91
7
2B
8D
29
67
wholesale furniture Company. The data warehouse has to analyze the company’s situation at least
A4
9E
35
0
C3
E
19
DD
A
C0
97
with respect to the Furniture, Customer and Time. Moreover, the company needs to analyze: The
D
9
A
4
DB
02
32
E8
96
CA
E
A
furniture with respect to its type, category and material. The customer with respect to their spatial
B9
7A
C
1
C0
97
D
9
54
EA
location, by considering at least cities, regions and states. The company is interested in learning
DB
02
32
96
A
0A
D3
9
C
7A
AC
91
the quantity, income and discount of its sales.. (10)
4D
2B
54
C
E8
DB
6
A
E
A
0
D3
C3
9
97
B9
4C
7A
1
0
B9
C
32
E8
EA
Q.4 B. A data sample is given below. Find whether Patient X has flu or not using Naïve Bayes
A4
02
35
96
CA
AC
DD
7
classifier.
B9
A
8D
91
C0
9
4
2
67
9E
A4
DB
02
5
3
7E
CA
3
AC
19
2B
7A
8D
If X= (chills=Y, runny nose=N, headache=Mild, fever=Y, flu=?) (10)
C0
9
4D
B9
54
32
E
A0
96
E
A
9
3
C
DD
7
C
2B
91
0
9
67
AC
2
8
9E
DB
35
0
E
19
0A
A
AC
Y N Mild Y N
97
4C
B
8D
9
67
4D
DB
C
2
Y Y No N Y
E
35
0
E
19
CA
0A
9
7A
AC
97
4D
8D
Y N Strong Y Y
9
DB
54
02
AC
32
96
E
0A
E
N Y Mild Y Y
9
3
7A
C
1
7
4D
4C
2B
D
B9
AC
9
EA
N N No N N
2
E8
6
0A
5
0
C3
19
D
B9
4C
3
A
N Y Strong Y Y
97
D
8D
9
AC
67
A
A4
2
35
32
N Y Strong N N
E
0
7E
19
DD
B9
4C
A
8D
C
C0
|Y Y Mild |Y Y
B9
29
67
EA
A4
02
35
7E
CA
C3
19
DD
9
A
D
C0
29
B
9
54
7
E8
EA
A4
DB
2
C3
96
CA
A0
D3
97
B9
91
4D
54
32
67
8
9E
02
7E
9
DD
4C
A
8D
91
9
AC
32
67
9E
A4
B
35
7E
AC
19
DD
4C
2B
8D
9
32
9E
A4
DB
35
A0
CA
7
2B
8D
C0
9
67
4D
54
32
B. FP Tree
9E
A0
E
19
0A
D3
AC
4C
2B
AC
8
96
9E
35
A0
C3
7E
u
C
2B
8D
29
7
_m
DB
4
96
9E
35
A0
C3
es
91
97
4D
2B
8D
7
EA
ot
DB
32
96
0A
A0
*************************
7E
gn
9
C
91
D
2B
in
29
7
A
A4
DB
96
er
9E
A0
C3
91
ne
0
2B
AC
EA
A4
DB
gi
96
A0
en
B9
C
91
0
D
54
AC
67
@
A4
DB
02
D3
19
:-
C
7A
0
am
9
54
AC
A4
DB
96
3
gr
C
8D
91
0
4D
le
54
AC
DB
Te
7E
0A
D3
38449 Page 2 of 2
C
29
in
4D
54
AC
E8
Jo
C3
0A
3
97
C
8D
54
AC
32
7E
3
AC
B9EAC3297E8D354CAC0A4DDB91967A02
4C
8D
29