0% found this document useful (0 votes)

8 views

Dwm_Complete_Notes______openinapp

The document provides links to various engineering notes for second-year students at Mumbai University, including channels and bots for accessing notes in different subjects. It also outlines specific topics covered in modules related to Data Warehousing and Data Mining, such as dimensional modeling, OLAP operations, and the KDD process. Additionally, it includes questions and tasks for students to complete based on the course material.

Uploaded by

Mayur Shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Dwm_Complete_Notes______openinapp

Uploaded by

Mayur Shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 166

🎯 HERE YOU GET ALL BRANCHES (MUMBAI UNIVERSITY ) ENGINEERING NOTES 🎯

SECOND YEAR ENGINEERING NOTES (MU) 📚

😊 😊
JOIN TELEGRAM

🎗 🎗
👇👇👇👇👇👇👇👇👇👇👇👇
KEEP SHARE THIS LINKS

JOIN US TELEGRAM ALL LINKS FOR NOTES

⭐ 1.CLICK TO JOIN CHANNEL 👉

@engineeringnotes_mu

⭐ 2.CLICK TO JOIN GROUP 👉 @engineering_notes_mu

⭐ 3.CLICK TO JOIN 1ST YEAR NOTES BOT 👉@engineeringnotes_mubot

4.CLICK TO JOIN 2ND YEAR COMPUTER ENGINEERING NOTES BOT 👉 @computerengineeringmu_notes_bot

Module-1
1. Explain the architecture of DW with suitable diagram.
2. Write short note on data warehouse design strategies.
3. Explain dimensional modeling with example.

4. Define metadata with example .Discuss the types of metadata with example.
5. Define & differentiate star, snowflake schema and fact constellation with example.
6. Compare OLAP & OLTP systems. Describe the following OLAP operations using an
Example. (a) Roll up (b) Drill down (c) Slice (d)Dice (e) Pivot
7. Explain the major steps in the ETL process with a suitable diagram and an example.
8. Differentiate Data Warehouse & Data mart.
9. Explain Fact less Fact Table.
10. All electronics company have sales department consider three dimensions namely-
a. Time b) product c)store
The schema contains a central fact table sales with 2 measures
a. Dollars-cost b)units-sold
Using the above example describe the following operations
a)Dice b)Slice c)Roll up d)Drill down
and draw star ,snowflake schema

Module-2
1. Explain the steps in the KDD process with a suitable diagram.
2. What is data mining? what are the techniques and applications of data mining.
3. Explain the architecture of Data mining with suitable diagram.
4. Explain types of attributes and data visualization for data exploration.

5. How to handle noisy data.

u
_m
es

Module-3
ot

1. Differentiate classification & clustering.

gn
in

2. What is tree pruning. Why tree pruning useful in decision tree induction?
er
ne

3. Write short note on decision tree based classification(ID3) approach.

gi
en

4. Create classification model using decision tree for stock market involving only discrete
@
:-

ranges has profit as categorical attribute, with values(up, down) and the training data is-
am
gr
le
Te
in
Jo
5. Why naive Bayesian classification is called “naive”? Apply Naïve Bayes Classifier
algorithm to classify an unknown sample
X(outlook=sunny,Temperature=hot,Humidity=High,Windy=False) the sample data set is

Module-4

1. Explain clustering and its types with suitable example.

2. Explain hierarchical clustering
3. Explain K-means clustering algorithm? Apply K- means algorithm for the following
data set with 2 clusters. Data set=[2,4,10,12,3,20,30,11,25]
4. Find clusters using k-means clustering algorithm, if we have several objects (4 types of
medicines) and each object have two attributes of features as shown in table below. The
intention is to group these objects into k=2 group of medicine based on two features
u
_m

object Attribute1(X) weight index Attribute 2 (Y)ph

es
ot

Medicine A 1 1
gn
in

Medicine B 2 1
er

Medicine C 4 3
ne

Medicine D 5 4
gi
en
@
:-
am

5. Use the data given below. Create adjacency matrix. Use single link or complete link
algorithm to cluster given data set. Draw Dendrogram.
gr
le
Te
in
Jo
Module-5

1. Explain Market Basket Analysis with example.

2. Explain Association Rule& Apriori Algorithm.
3. Write short note on FP tree.
4. Consider the transaction database gIven below,
TID Items

A 1,3,4,6
B 2,3,5 ,7
C 1,2,3,5 ,8
D 2,5 ,9,10
E 1,4

Use Apriori Algorithm with min-support count 30% and min-confidence 75% to find all
frequent item sets and strong association rule.

5. Explain multilevel and multidimensional association rule.

Module-6
1. Explain spatial and web mining.
2. Explain web usage mining, text mining.
3. What is Web structure mining & page rank technique.
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Priority 1 (Questions to prepare first)

Data Warehouse Design Strategy [ Top Down and Bottom Up ]

A data warehouse is a single data repository where a record from multiple data sources is integrated for online
business analytical processing (OLAP). This implies a data warehouse needs to meet the requirements from all
the business stages within the entire organization. Thus, data warehouse design is a hugely complex, lengthy,
and hence error-prone process. Furthermore, business analytical functions change over time, which results in
changes in the requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and
the design process is continuous.

Data warehouse design takes a method different from view materialization in the industries. It sees data
warehouses as database systems with particular needs such as answering management related queries. The
target of the design becomes how the record from multiple data sources should be extracted, transformed, and
loaded (ETL) to be organized in a database as the data warehouse.

There are two approaches

1. "top-down" approach

2. "bottom-up" approach

Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-variant, non-
volatile and integrated data repository for the entire enterprise data from different sources are validated,
reformatted and saved in a normalized (up to 3NF) database as the data warehouse. The data warehouse stores
"atomic" information, the data at the lowest level of granularity, from where dimensional data marts can be
built by selecting the data required for specific business subjects or particular departments. An approach is a
data-driven approach as the information is gathered and integrated first and then business requirements by
subjects for building data marts are formulated. The advantage of this method is which it supports a single
integrated data source. Thus data marts built from it will have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

u
_m
es

Developing new data mart from the data warehouse is very easy.
ot
gn

Disadvantages of top-down design

in
er
ne

This technique is inflexible to changing departmental needs.

gi
en
@

The cost of implementing the project is high.

:-
am
gr
le
Te
in
Jo
Bottom-Up Design Approach
In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical
architecture for query and analysis," term the star schema. In this approach, a data mart is created first to
necessary reporting and analytical capabilities for particular business processes (or subjects). Thus it is needed
to be a business-driven approach in contrast to Inmon's data-driven approach.

Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a normalized
database for the data warehouse, a denormalized dimensional database is adapted to meet the data delivery
u
_m

requirements of data warehouses. Using this method, to use the set of data marts as the enterprise data
es

warehouse, data marts should be built with conformed dimensions in mind, defining that ordinary objects are
ot
gn

represented the same in different data marts. The conformed dimensions connected the data marts to form a
in

data warehouse, which is generally called a virtual data warehouse.

er
ne
gi

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data mart, a data
en

warehouse for a single subject, takes far less time and effort than developing an enterprise-wide data
@

warehouse. Also, the risk of failure is even less. This method is inherently incremental. This method allows the
:-
am

project team to learn and grow.

gr
le
Te
in
Jo
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

AD
u
_m

Disadvantages of bottom-up design

es
ot
gn

the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.
in
er
ne

Differentiate between Top-Down Design Approach and Bottom-Up

gi
en

Design Approach
@
:-
am

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller subproblems. Solves the essential low-level problem and integrates
le
Te

them into a higher one.

Inherently architected- not a union of several data marts. Inherently incremental; can schedule essential data
Jo
marts first.
Single, central storage of information about the content. Departmental information stored.
Centralized rules and control. Departmental rules and control.
It includes redundant information. Redundancy can be removed.
It may see quick results if implemented with repetitions. Less risk of failure, favorable return on investment, and
proof of techniques.

Explain star snowflake and fact constellation schema for multidimensional

database + Sums Based on this

What is Star Schema?

A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.

A star schema is a relational schema where a relational schema whose design represents a multidimensional
data model. The star schema is the explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the dimension tables.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te

Fact Tables
in
Jo
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key of the
fact tables is generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same
level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of
the dimensions table are part of the composite primary keys of the fact table. Dimensional attributes help to
define the dimensional value. They are generally descriptive, textual values. Dimensional tables are usually
small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic region (markets, cities),
clients, products, times, channels.

Characteristics of Star Schema

The star schema is intensely suitable for data warehouse database design because of the following features:

o It creates a DE-normalized database that can quickly provide query responses.

o It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.

o It provides a parallel in design to how end-users typically think of and use the data.

o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

u
_m

Star Schemas are easy for end-users and application to understand and navigate. With a well-designed schema,
the customer can instantly analyze large, multidimensional data sets.
es
ot
gn

The main advantage of star schemas in a decision-support environment are:

in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Query Performance

A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central fact table. When the
two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two
tables. This design feature enforces authentic and consistent query results.

Load performance and administration

u
_m

Structural simplicity also decreases the time required to load large batches of record into a star schema
es

database. By describing facts and dimensions and separating them into the various table, the impact of a load
ot
gn

structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new facts
in

regularly and selectively by appending records to a fact table.

er
ne
gi

Built-in referential integrity

en
@

A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
:-
am

because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
gr

foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
le
Te

dimension cannot be given the correct key value to be retrieved.

in
Jo
Easily Understood

A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts
of the underlying business. Customer can also browse dimension table attributes before constructing a query.

Disadvantage of Star Schema

There is some condition which cannot be meet by star schemas like the relationship between the user, and
bank account cannot describe as star schema as the relationship between them is many to many.

Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city, state,
and country.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am

In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME, ITEM,
gr

BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three columns for
le
Te

BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is significantly reduced.
in
Jo
When we need to change an item, we need only make a single change in the dimension table, instead of
making many changes in the fact table.

What is Snowflake Schema?

A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension tables."

The snowflake schema is an expansion of the star schema where each point of the star explodes into more
points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact table in
the middle.

Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out
into a snowflake pattern.

The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked
to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one level in a hierarchy.

The following diagram shows a snowflake schema with two dimensions, each having three levels. A snowflake
schemas can have any number of dimension, and each dimension can have any number of levels.

u
_m
es
ot
gn
in
er
ne
gi
en
@

Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product, Line,
:-

and Family dimension tables. The Market dimension has two dimension tables with Store as the primary
am

dimension table, and Location as the outrigger dimension table. The product dimension has three dimension
gr
le

tables with Product as the primary dimension table, and the Line and Family table are the outrigger dimension
Te

tables.
in
Jo
A star schema store all attributes for a dimension into one denormalized table. This needed more disk space
than a more normalized snowflake schema. Snowflaking normalizes the dimension by moving attributes with
low cardinality into separate dimension tables that relate to the core dimension table by using foreign keys.
u
_m

Snowflaking for the sole purpose of minimizing disk space is not recommended, because it can adversely
es

impact query performance.

ot
gn

In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged
in
er

into multiple dimension tables.

ne
gi
en

Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include
@

quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the dimension
:-
am

tables.
gr
le
Te
in
Jo
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version now
extends to eleven tables. We will notice that in the snowflake schema, the attributes with low cardinality in each
original dimension tables are removed to form separate tables. These new tables are connected back to the
original dimension table through artificial keys.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
A snowflake schema is designed for flexible querying across more complex dimensions and relationship. It is
suitable for many to many and one to many relationships between dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.

2. It provides greater scalability in the interrelationship between dimension levels and components.

3. No redundancy, so it is easier to maintain.

u
_m

Disadvantage of Snowflake Schema

es
ot
gn

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due
in
er
ne

to the increasing number of lookup tables. It is also known as a multi fact star schema.
gi
en

2. There are more complex queries and hence, difficult to understand.

@
:-

3. More tables more join so more query execution time.

am
gr
le
Te
in
Jo
Difference between Star and Snowflake Schemas
Star Schema

o In a star schema, the fact table will be at the center and is connected to the dimension tables.

o The tables are completely in a denormalized structure.

o SQL queries performance is good as there is less number of joins involved.

o Data redundancy is high and occupies more disk space.

u
_m

Snowflake Schema
es
ot
gn
in

o A snowflake schema is an extension of star schema where the dimension tables are connected to one or
er
ne

more dimensions.
gi
en
@

o The tables are partially denormalized in structure.

:-
am

o The performance of SQL queries is a bit less when compared to star schema as more number of joins
gr
le

are involved.
Te
in

o Data redundancy is low and occupies less disk space when compared to star schema.
Jo
AD

Let's see the differentiate between Star and Snowflake Schema.

S.NO Star Schema Snowflake Schema

1. In star schema, The fact tables and the While in snowflake schema, The fact tables, dimension tables as
dimension tables are contained. well as sub dimension tables are contained.
2. Star schema is a top-down model. While it is a bottom-up model.
u
_m

3. Star schema uses more space. While it uses less space.

4. It takes less time for the execution of While it takes more time than star schema for the execution of
ot

queries. queries.
gn

5. In star schema, Normalization is not used. While in this, Both normalization and denormalization are used.
in

It’s design is very simple. While it’s design is complex.

6.
ne

7. The query complexity of star schema is low. While the query complexity of snowflake schema is higher than
gi

star schema.
en

8. It’s understanding is very simple. While it’s understanding is difficult.

9. It has less number of foreign keys. While it has more number of foreign keys.
:-
am

10. It has high data redundancy. While it has low data redundancy.
gr
le
Te
in
Jo
What is Fact Constellation Schema?
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy
schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation
Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact
Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions, namely, time,
item, branch, and location. The schema contains a fact table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and units_sold. The shipping table has five dimensions, or
keys: item_key, time_key, shipper_key, from_location, and to_location, and two measures: Rupee_cost and
units_shipped.

What is Multi-Dimensional Data Model?

A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and
viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records. For example,
a shop may create a sales data warehouse to keep records of the store's sales for the dimension time, item, and
u

location. These dimensions allow the save to keep track of things, for example, monthly sales of items and the
_m

locations at which the items were sold. Each dimension has a table related to it, called a dimensional table,
es

which describes the dimension further. For example, a dimensional table for an item may contain the attributes
ot
gn

item_name, brand, and type.

in
er
ne

A multidimensional data model is organized around a central theme, for example, sales. This theme is
gi

represented by a fact table. Facts are numerical measures. The fact table contains the names of the facts or
en
@

measures of the related dimensional tables.

:-
am
gr
le
Te
in
Jo
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In this
2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the item
dimension (classified according to the types of an item sold). The fact or measure displayed in rupee_sold (in
thousands).

u
_m
es
ot
gn
in

Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time
er
ne

and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D
gi

data are shown in the table. The 3D data of the table are represented as a series of 2D tables.
en
@
:-
am
gr
le
Te
in
Jo
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:

u
_m
es

Explain ETL Process

ot
gn
in

1. ETL stands for Extract, Transform, Load and it is a process used in data warehousing to
er
ne

extract data from various sources, transform it into a format suitable for loading into a data
gi

warehouse, and then load it into the warehouse. The process of ETL can be broken down into
en

the following three stages:

2. Extract: The first stage in the ETL process is to extract data from various sources such as
:-
am

transactional systems, spreadsheets, and flat files. This step involves reading data from the
gr

source systems and storing it in a staging area.

3. Transform: In this stage, the extracted data is transformed into a format that is suitable for
Te

loading into the data warehouse. This may involve cleaning and validating the data, converting
in
Jo

data types, combining data from multiple sources, and creating new data fields.
4. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
5. The ETL process is an iterative process that is repeated as new data is added to the
warehouse. The process is important because it ensures that the data in the data warehouse is
accurate, complete, and up-to-date. It also helps to ensure that the data is in the format
required for data mining and reporting.
Additionally, there are many different ETL tools and technologies available, such as Informatica,
Talend, DataStage, and others, that can automate and simplify the ETL process.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it in
the staging area, and then finally, loads it into the Data W arehouse system.

Let us understand each step of the ETL process in-depth:

1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files
into the staging area. It is important to extract the data from various source systems and store
u

it into the staging area first and not directly into the data warehouse because the extracted data
_m

is in various formats and can be corrupted also. Hence loading it directly into the data
es

warehouse may damage it and rollback will be much more difficult. Therefore, this is one of the
ot
gn

most important steps of ETL process.

2. Transformation:
er
ne

The second step of the ETL process is transformation. In this step, a set of rules or functions
gi

are applied on the extracted data to convert it into a single standard format. It may involve
en

following processes/tasks:
@

 Filtering – loading only certain attributes into the data warehouse.

:-
am

 Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
gr

States, and America into USA, etc.

 Joining – joining multiple attributes into one.

 Splitting – splitting a single attribute into multiple attributes.

in
Jo
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the data
warehouse very frequently and sometimes it is done after longer but regular intervals. The rate
and period of loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder,
CloverETL, and MarkLogic.
u

Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery,
_m

and Firebolt.
es
ot
gn

ADVANTAGES OR DISADVANTAGES:
in
er
ne
gi

Advantages of ETL process in data warehousing:

en
@
:-

1. Improved data quality: ETL process ensures that the data in the data warehouse is accurate,
am

complete, and up-to-date.

gr
le

2. Better data integration: ETL process helps to integrate data from multiple sources and
Te

systems, making it more accessible and usable.

in
Jo
3. Increased data security: ETL process can help to improve data security by controlling access
to the data warehouse and ensuring that only authorized users can access the data.
4. Improved scalability: ETL process can help to improve scalability by providing a way to
manage and analyze large amounts of data.
5. Increased automation: ETL tools and technologies can automate and simplify the ETL
process, reducing the time and effort required to load and update data in the warehouse.

Disadvantages of ETL process in data warehousing:

1. High cost: ETL process can be expensive to implement and maintain, es pecially for
organizations with limited resources.
2. Complexity: ETL process can be complex and difficult to implement, especially for
organizations that lack the necessary expertise or resources.
3. Limited flexibility: ETL process can be limited in terms of flexibility, as it may not be able to
handle unstructured data or real-time data streams.
4. Limited scalability: ETL process can be limited in terms of scalability, as it may not be able to
handle very large amounts of data.
5. Data privacy concerns: ETL process can raise concerns about data privacy, as large amounts
of data are collected, stored, and analyzed.
Explain Data Mining ? Describe the steps involved in Data Mining when viewed as a process
of Knowledge Discovery.

What is Data Mining?

The process of extracting information to identify patterns, trends, and useful data that would allow the business
to take the data-driven decision from huge sets of data is called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of information to
various perspectives for categorization into useful data, which is collected and assembled in particular areas
such as data warehouses, efficient analysis, data mining algorithm, helping decision making and other data
requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends and patterns
that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data
segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of
u

Data (KDD).
_m
es
ot

Data Mining is a process used by organizations to extract specific data from huge databases to solve business
gn

problems. It primarily turns raw data into useful information.

in
er
ne

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular data set,
gi
en

with an objective. This process includes various types of services such as text mining, web mining, audio and
@

video mining, pictorial data mining, and social media mining. It is done through software that is simple or
:-

highly specific. By outsourcing data mining, all the work can be done faster with low operation costs.
am

Specialized firms can also use new technologies to collect data that is impossible to locate manually. There are
gr
le

tonnes of information available on various platforms, but very little knowledge is accessible. The biggest
Te

challenge is to analyze the data to extract important information that can be used to solve a problem or for
in
Jo
company development. There are many powerful instruments and techniques available to mine data and find
better insight from it.

Types of Data Mining

Data mining can be performed on the following types of data:

Relational Database:

A relational database is a collection of multiple data sets formally organized by tables, records, and columns
from which data can be accessed in various ways without having to recognize the database tables. Tables
convey and share information, which facilitates data searchability, reporting, and organization.

Data warehouses:
u
_m

A Data Warehouse is the technology that collects the data from various sources within the organization to
es

provide meaningful business insights. The huge amount of data comes from multiple places such as Marketing
ot

and Finance. The extracted data is utilized for analytical purposes and helps in decision- making for a business
gn

organization. The data warehouse is designed for the analysis of data rather than transaction processing.
in
er
ne

Data Repositories:
gi
en
@

The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize
:-
am

the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of
gr

databases, where an organization has kept various kinds of information.

le
Te

Object-Relational Database:
in
Jo
A combination of an object-oriented database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap between the Relational
database and the object-oriented model practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has the potential to undo a
database transaction if it is not performed appropriately. Even though this was a unique capability a very long
while back, today, most of the relational database systems support transactional database activities.

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.

o Data mining enables organizations to make lucrative modifications in operation and production.

o Compared with other statistical data applications, data mining is a cost-efficient.

o Data Mining helps the decision-making process of an organization.

o It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.

o It can be induced in the new system as well as the existing platforms.

o It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short
time.

Disadvantages of Data Mining

u
_m

o There is a probability that the organizations may sell useful data of customers to other organizations for
es

money. As per the report, American Express has sold credit card purchases of their customers to other
ot
gn

organizations.
in
er
ne

o Many data mining analytics software is difficult to operate and needs advance training to work on.
gi
en

Different data mining instruments operate in distinct ways due to the different algorithms used in their
@

o
:-

design. Therefore, the selection of the right data mining tools is a very challenging task.
am
gr

The data mining techniques are not precise, so that it may lead to severe consequences in certain
le

o
Te

conditions.
in
Jo
AD

Data Mining Applications

Data Mining is primarily used by organizations with intense consumer demands- Retail, Communication,
Financial, marketing company, determine price, consumer preferences, product positioning, and impact on
sales, customer satisfaction, and corporate profits. Data mining enables a retailer to use point-of-sale records
of customer purchases to develop products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics for
u
_m

better insights and to identify best practices that will enhance health care services and reduce costs. Analysts
es

use data mining approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
ot

computing, and statistics. Data Mining can be used to forecast patients in each category. The procedures
gn

ensure that the patients get intensive care at the right place and at the right time. Data mining also enables
in
er

healthcare insurers to recognize fraud and abuse.

ne
gi
en

Data Mining in Market Basket Analysis:

@
:-
am

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products,
gr

then you are more likely to buy another group of products. This technique may enable the retailer to
le

understand the purchase behavior of a buyer. This data may assist the retailer in understanding the
Te
in
Jo
requirements of the buyer and altering the store's layout accordingly. Using a different analytical comparison of
results between various stores, between customers in different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of educational support, and promoting
learning science. An organization can use data mining to make precise decisions and also to predict the results
of the student. With the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial to
find patterns in a complex manufacturing process. Data mining can be used in system-level designing to obtain
the relationships between product architecture, product portfolio, and data needs of the customers. It can also
be used to forecast the product development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit time
consuming and sophisticated. Data mining provides meaningful patterns and turning data into information. An
ideal fraud detection system should protect the data of all the users. Supervised methods consist of a collection
of sample records, and these records are classified as fraudulent or non-fraudulent. A model is constructed
using this data, and the technique is made to identify whether the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task. Law
enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
u
_m

communications, etc. This technique includes text mining also, and it seeks meaningful patterns in data, which
es

is usually unstructured text. The information collected from the previous investigations is compared, and a
ot

model for lie detection is constructed.

gn
in
er

Data Mining Financial Banking:

ne
gi
en

The Digitalization of the banking system is supposed to generate an enormous amount of data with every new
@

transaction. The data mining technique can help bankers by solving business-related problems in banking and
:-
am

finance by identifying trends, casualties, and correlations in business information and market costs that are not
gr

instantly evident to managers or executives because the data volume is too large or are produced too rapidly
le

on the screen by experts. The manager may find these data for better targeting, acquiring, retaining,
Te

segmenting, and maintain a profitable customer.

in
Jo
Challenges of Implementation in Data mining
Although data mining is very powerful, it faces many challenges during its execution. Various challenges could
be related to performance, data, methods, and techniques, etc. The process of data mining becomes effective
when the challenges or problems are correctly recognized and adequately resolved.

Incomplete and noisy data:

u
_m

The process of extracting useful data from large volumes of data is data mining. The data in the real-world is
es

heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or unreliable. These
ot
gn

problems may occur due to data measuring instrument or because of human errors. Suppose a retail chain
in

collects phone numbers of customers who spend more than $ 500, and the accounting employees put the
er
ne

information into their system. The person may make a digit mistake when entering the phone number, which
gi
en

results in incorrect data. Even some customers may not be willing to disclose their phone numbers, which
@

results in incomplete data. The data could get changed due to human or system error. All these consequences
:-

(noisy and incomplete data)makes data mining challenging.

am
gr
le

Data Distribution:
Te
in
Jo
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might be in
a database, individual systems, or even on the internet. Practically, It is a quite tough task to make all the data
to a centralized data repository mainly due to organizational and technical concerns. For example, various
regional offices may have their servers to store their data. It is not feasible to store, all the data from all the
offices on a central server. Therefore, data mining requires the development of tools and algorithms that allow
the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools, and methodologies would have to
be refined to obtain specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and techniques used. If
the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining process
will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example, if a
retailer analyzes the details of the purchased items, then it reveals data about buying habits and preferences of
the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary method that shows the
output to the user in a presentable way. The extracted data should convey the exact meaning of what it intends
to express. But many times, representing the information to the end-user in a precise and easy way is difficult.
The input data and the output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

Sums Based on OLAP Operations

u
_m

Sums Based on Naive Based Classification

https://www.youtube.com/watch?v=XzSlEA4ck2I&ab_channel=MaheshHuddar
ot
gn
in
er

Sums Based on Decision Tree

ne
gi

https://www.youtube.com/watch?v=coOTEc-0OGw&ab_channel=MaheshHuddar
en
@
:-

Explain K Mean Clustering + Sum based on it.

am
gr
le
Te
in
Jo
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in
machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the
algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups
in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.

o Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

u
_m

AD
es
ot

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.

o We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the below image:

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in

AD
Jo
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find new
centroids as below:

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right
to the line. So, these three points will be assigned to new centroids.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.

o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as
shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
o We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am

How to choose the value of "K number of clusters" in K-means

gr
le

Clustering?
Te
in
Jo
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But
choosing the optimal number of clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate method to find the number of clusters or
value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid
within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance
or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).

o For each value of K, calculates the WCSS value.

o Plots a curve between calculated WCSS values and the number of clusters K.

o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the
best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The
graph for the elbow method looks like the below image:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Note: We can choose the number of clusters equal to the given data points. If we choose the number of
clusters equal to the data points, then the value of WCSS becomes zero, and that will be the endpoint of the
plot.
AD

Python Implementation of K-means Clustering Algorithm

In the above section, we have discussed the K-means algorithm, now let's see how it can be implemented
using Python.

Before implementation, let's understand what type of problem we will solve here. So, we have a dataset
of Mall_Customers, which is the data of customers who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score (which is
the calculated value of how much a customer has spent in the mall, the more the value, the more he has spent).
From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't know what
to calculate exactly.
u
_m
es

The steps to be followed for the implementation are given below:

ot
gn
in

o Data Pre-processing
er
ne

Finding the optimal number of clusters using the elbow method

o
en
@

o Training the K-means algorithm on the training dataset

:-
am

o Visualizing the clusters

gr
le
Te
in
Jo
Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of Regression and Classification.
But for the clustering problem, it will be different from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our model, which is part of data pre-
processing. The code is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

In the above code, the numpy we have imported for the performing mathematics calculation, matplotlib is for
plotting the graph, and pandas are for managing the dataset.

o Importing the Dataset:

Next, we will import the dataset that we need to use. So here, we are using the Mall_Customer_data.csv
dataset. It can be imported using the below code:

1. # Importing the dataset

2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the Spyder IDE. The dataset looks like the
below image: u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as it is a clustering problem, and we
have no idea about what to determine. So we will just add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

u
_m

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot to visualize the model,
es

and some features are not required, such as customer_id.

ot
gn
in

Step-2: Finding the optimal number of clusters using the elbow method
er
ne

In the second step, we will try to find the optimal number of clusters for our clustering problem. So, as
gi
en

discussed above, here we are going to use the elbow method for this purpose.
@
:-
am

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS values on the Y-axis
gr

and the number of clusters on the X-axis. So we are going to calculate the value for WCSS for different k values
le

ranging from 1 to 10. Below is the code for it:

Te
in
Jo
1. #finding optimal number of clusters using the elbow method
2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

As we can see in the above code, we have used the KMeans class of sklearn. cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is used to contain the value of
wcss computed for different values of k ranging from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value of k ranging from 1 to 10; since
for loop in Python, exclude the outbound limit, so it is taken as 11 to include 10 th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the model on a matrix of
features and then plotted the graph between the number of clusters and WCSS.

Output: After executing the above code, we will get the below output:
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
From the above plot, we can see the elbow point is at 5. So the number of clusters here will be 5.

Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on the dataset.
u
_m

To train the model, we will use the same two lines of code as we have used in the above section, but here
es

instead of using i, we will use 5, as we know there are 5 clusters that need to be formed. The code is given
ot
gn

below:
in
er
ne

1. #training the K-means model on a dataset

gi
en

2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)

@
:-

3. y_predict= kmeans.fit_predict(x)
am
gr
le

The first line is the same as above for creating the object of KMeans class.
Te
in

In the second line of code, we have created the dependent variable y_predict to train the model.
Jo
By executing the above lines of code, we will get the y_predict variable. We can check it under the variable
explorer option in the Spyder IDE. We can now compare the values of y_predict with our original dataset.
Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and so on.

Step-4: Visualizing the Clusters

u
_m

The last step is to visualize the clusters. As we have 5 clusters for our model, so we will visualize each cluster
es

one by one.
ot
gn
in

To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
er
ne
gi
en

1. #visulaizing the clusters

@
:-

2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
am
gr

3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluste
le
Te

r
in
Jo
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster

7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')

8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()

In above lines of code, we have written code for each clusters, ranging from 1 to 5. The first coordinate of the
mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the showing the matrix of features values, and
the y_predict is ranging from 0 to 1.

Output:

u
_m

The output image is clearly showing the five different clusters with different colors. The clusters are formed
es

between two parameters of the dataset; Annual income of customer and Spending. We can change the colors
ot

and labels as per the requirement or choice. We can also observe some points from the above patterns, which
gn

are given below:

in
er
ne
gi

o Cluster1 shows the customers with average salary and average spending so we can categorize these
en
@

customers as
:-
am

o Cluster2 shows the customer has a high income but low spending, so we can categorize them
gr
le

as careful.
Te
in

Cluster3 shows the low income and also low spending so they can be categorized as sensible.
Jo

o
o Cluster4 shows the customers with low income with very high spending so they can be categorized
as careless.

o Cluster5 shows the customers with high income and high spending so they can be categorized as
target, and these customers can be the most profitable customers for the mall owner.

K-means clustering Sums

Sum based on Apriori Algorithm

Explain Web Structure Mining +Explain Page Rank Algorithm in Web Mining in Details

What is Web Structure Mining?

The challenge for Web structure mining is to deal with the structure of the hyperlinks within the web itself. Link
analysis is an old area of research. However, with the growing interest in Web mining, the research of structure
analysis has increased. These efforts resulted in a newly emerging research area called Link Mining, which is
located at the intersection of the work in link analysis, hypertext, web mining, relational learning, inductive logic
programming, and graph mining.

Web structure mining uses graph theory to analyze a website's node and connection structure. According to
the type of web structural data, web structure mining can be divided into two kinds:

o Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects
the web page to a different location.

o Mining the document structure: analysis of the tree-like structure of page structures to describe
HTML or XML tag usage.

The web contains a variety of objects with almost no unifying structure, with differences in the authoring style
and content much greater than in traditional collections of text documents. The objects in the WWW are web
u
_m

pages, and links are in, out, and co-citation (two pages linked to by the same page). Attributes include HTML
es

tags, word appearances, and anchor texts. Web structure mining includes the following terminology, such as:
ot
gn
in
er

o Web graph:directed graph representing web.

ne
gi

Node: web page in the graph.

o
@

o Edge: hyperlinks.
:-
am

In degree: the number of links pointing to a particular node.

o
le
Te

o Out degree: number of links generated from a particular node.

in
Jo
An example of a technique of web structure mining is the PageRank algorithm used by Google to rank search
results. A page's rank is decided by the number and quality of links pointing to the target node.

Link mining had produced some agitation on some traditional data mining tasks. Below we summarize some of
these possible tasks of link mining which are applicable in Web structure mining, such as:

1. Link-based Classification: The most recent upgrade of a classic data mining task to linked Domains.
The task is to predict the category of a web page based on words that occur on the page, links between
pages, anchor text, html tags, and other possible attributes found on the web page.

2. Link-based Cluster Analysis: The data is segmented into groups, where similar objects are grouped
together, and dissimilar objects are grouped into different groups. Unlike the previous task, link-based
cluster analysis is unsupervised and can be used to discover hidden patterns from data.

3. Link Type: There is a wide range of tasks concerning predicting the existence of links, such as
predicting the type of link between two entities or predicting the purpose of a link.

4. Link Strength: Links could be associated with weights.

5. Link Cardinality: The main task is to predict the number of links between objects. page categorization
used to

o Finding related pages.

o Finding duplicated websites and finding out the similarity between them.

Page Rank Algorithm

The page rank algorithm is applicable to web pages. The page rank algorithm is used by Google Search
to rank many websites in their search engine results. The page rank algorithm was named after Larry
Page, one of the founders of Google. We can say that the page rank algorithm is a way of measuring the
importance of website pages. A web page basically is a directed graph which is having two components
namely Nodes and Connections. The pages are nodes and hyperlinks are connections.
Let us see how to solve Page Rank Algorithm. Compute page rank at every node at the end of the
second iteration. use teleportation factor = 0.8
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am

6.
7.
gr
le
Te

So the formula is,

in
Jo
PR(A) = (1-β) + β * [PR(B) / Cout(B) + PR(C) / Cout(C)+ ...... + PR(N) / Cout(N)]

HERE, β is teleportation factor i.e. 0.8

NOTE: we need to solve atleast till 2 iteration max.

Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141

Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2

= 0.3

So, what have we done here is for node A we will see how many incoming signals are there so here we
have PR(B) and PR(C). And for each of the incoming signals, we will see the outgoing signals from
that particular incoming signal i.e. for PR(B) we have 4 outgoing signals and for PR(C) we have 2
outgoing signals. The same procedure will be applicable for the remaining nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32

PR(C) = (1-0.8) + 0.8 * PR(A)/2

u
_m

= (1-0.8) + 0.8 * 0.3/2

= 0.32
ot
gn

PR(D) = (1-0.8) + 0.8 * PR(B)/4

in
er

= (1-0.8) + 0.8 * 0.32/4

ne
gi

= 0.264
en
@

PR(E) = (1-0.8) + 0.8 * PR(B)/4

:-
am

= (1-0.8) + 0.8 * 0.32/4

gr
le

= 0.264
Te

PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2

in
Jo
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)

= 0.392

This was for iteration 1, now let us calculate iteration 2.

Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2

= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)

= 0.392

NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.

PR(B) = (1-0.8) + 0.8 * PR(A)/2

= (1-0.8) + 0.8 * 0.392/2

= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2

= (1-0.8) + 0.8 * 0.392/2

= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4

= 0.2714

PR(E) = (1-0.8) + 0.8 * PR(B)/4

= (1-0.8) + 0.8 * 0.3568/4

= 0.2714

PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2

= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)

= 0.4141

So, the final PAGE RANK for the above-given question is,
NODES ITERATION 0 ITERATION 1 ITERATION 2
u

A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 0.32 0.3568

es
ot

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714

E 1/6 = 0.16 0.264 0.2714

er
ne

F 1/6 = 0.16 0.392 0.4141

gi
en
@
:-

Priority 2
am
gr
le

Meta Data and It’s 3 Types

Te
in
Jo
What is Meta Data?
Metadata is data about the data or documentation about the information which is required by the users. In
data warehousing, metadata is one of the essential aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.

2. Names, definitions, structures, and content of data-warehouse and end-users views.

3. Identification of authoritative data sources.

4. Integration and transformation rules used to populate data.

5. Integration and transformation rules used to deliver information to end-user analytical tools.

6. Subscription information for information delivery to analysis subscribers.

7. Metrics used to analyze warehouses usage and performance.

8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow users
access to help understand the content and find data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata consists of several predefined
components representing specific attributes of a resource, and each item can have one or more values.
These components could be the name of the author, the name of the document, the publisher's name,
the publication date, and the methods to which it belongs.

2. The table of content and the index in a book may be treated metadata for the book.

3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
u
_m

person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the data
es

is 80.
ot
gn

4. Another examples of metadata are data about the tables and figures in a report like this book. A table
in
er
ne

(which is a record) has a name (e.g., table titles), and there are column names of the tables that may be
gi
en

treated metadata. The figures also have titles or names.

@
:-
am

Why is metadata necessary in a data warehouses?

gr
le
Te

o First, it acts as the glue that links all parts of the data warehouses.
in
Jo
o Next, it provides information about the contents and structures to the developers.

o Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.

Metadata is Like a Nerve Center. Various processes during the building and administering of the data
warehouse generate parts of the data warehouse metadata. Another uses parts of metadata generated by one
process. In the data warehouse, metadata assumes a key position and enables communication among various
methods. It acts as a nerve centre in the data warehouse.

Figure shows the location of metadata within the data warehouse.

u
_m
es

Types of Metadata
ot
gn
in

Metadata in a data warehouse fall into three major parts:

er
ne
gi
en

o Operational Metadata
@
:-

o Extraction and Transformation Metadata

am
gr

End-User Metadata
le

o
Te
in

AD
Jo
AD

Operational Metadata

As we know, data for the data warehouse comes from various operational systems of the enterprise. These
source systems include different data structures. The data elements selected for the data warehouse have
various fields lengths and data types.

In selecting information from the source systems for the data warehouses, we divide records, combine factor of
documents from different source files, and deal with multiple coding schemes and field lengths. When we
deliver information to the end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.

Extraction and Transformation Metadata

Extraction and transformation metadata include data about the removal of data from the source systems,
namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this
category of metadata contains information about all the data transformation that takes place in the data
staging area.

End-User Metadata

The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find data
from the data warehouses. The end-user metadata allows the end-users to use their business terminology and
look for the information in those ways in which they usually think of the business.

Metadata Interchange Initiative

The metadata interchange initiative was proposed to bring industry vendors and user together to address a
variety of severe problems and issues concerning exchanging, sharing, and managing metadata. The goal of
metadata interchange standard is to define an extensible mechanism that will allow the vendor to exchange
standard metadata as well as carry along "proprietary" metadata. The founding members agreed on the
following initial goals:

1. Creating a vendor-independent, industry-defined, and maintained standard access mechanisms and

application programming interfaces (API) for metadata.
u
_m
es

2. Enabling users to control and manage the access and manipulation of metadata in their unique
ot
gn

environment through the use of interchange standards-compliant tools.

in
er

3. Users are allowed to build tools that meet their needs and also will enable them to adjust accordingly to
ne
gi

those tools configurations.

en
@

4. Allowing individual tools to satisfy their metadata requirements freely and efficiently within the content
:-
am

of an interchange model.
gr
le
Te
in
Jo
5. Describing a simple, clean implementation infrastructure which will facilitate compliance and speed up
adoption by minimizing the amount of modification.

6. To create a procedure and process not only for maintaining and establishing the interchange standard
specification but also for updating and extending it over time.

Metadata Interchange Standard Framework

Interchange standard metadata model implementation assumes that the metadata itself may be stored in
storage format of any type: ASCII files, relational tables, fixed or customized formats, etc.

It is a framework that is based on a framework that will translate an access request into the standard
interchange index.

Several approaches have been proposed in metadata interchange coalition:

o Procedural Approach

o ASCII Batch Approach

o Hybrid Approach

In a procedural approach, the communication with API is built into the tool. It enables the highest degree of
flexibility.

In ASCII Batch approach, instead of relying on ASCII file format which contains information of various
metadata items and standardized access requirements that make up the interchange standards metadata
model.

In the Hybrid approach, it follows a data-driven model.

Components of Metadata Interchange Standard Frameworks

1) Standard Metadata Model: It refers to the ASCII file format, which is used to represent metadata that is
u
_m

being exchanged.
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
2) The standard access framework that describes the minimum number of API functions.

3) Tool profile, which is provided by each tool vendor.

4) The user configuration is a file explaining the legal interchange paths for metadata in the user's environment.

Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform
u
_m

the data, generate code for data transformation, and to move data to the warehouse.
es
ot

AD
gn
in
er

Benefits of Metadata Repository

ne
gi
en
@

1. It provides a set of tools for enterprise-wide metadata management.

:-
am

2. It eliminates and reduces inconsistency, redundancy, and underutilization.

gr
le

3. It improves organization control, simplifies management, and accounting of information assets.

Te
in

4. It increases coordination, understanding, identification, and utilization of information assets.

Jo
5. It enforces CASE development standards with the ability to share and reuse metadata.

6. It leverages investment in legacy systems and utilizes existing applications.

7. It provides a relational model for heterogeneous RDBMS to share information.

8. It gives useful data administration tool to manage corporate information assets with the data dictionary.

9. It increases reliability, control, and flexibility of the application development process.

OLAP vs OLTP
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)
Definition It is well-known as an online database query It is well-known as an online database modifying
management system. system.
Data source Consists of historical data from various Consists of only operational current data.
Databases.
Method used It makes use of a data warehouse. It makes use of a standard database management
system (DBMS).
Application It is subject-oriented. Used for Data Mining, It is application-oriented. Used for business tasks.
Analytics, Decisions making, etc.
Normalized In an OLAP database, tables are not In an OLTP database, tables are normalized (3NF).
normalized.
Usage of data The data is used in planning, problem-solving, The data is used to perform day-to-day fundamental
and decision-making. operations.
Task It provides a multi-dimensional view of It reveals a snapshot of present business tasks.
different business tasks.
Purpose It serves the purpose to extract information for It serves the purpose to Insert, Update, and Delete
analysis and decision-making. information from the database.
Volume of data A large amount of data is stored typically in The size of the data is relatively small as the
TB, PB historical data is archived in MB, and GB.
Queries Relatively slow as the amount of data involved Very Fast as the queries operate on 5% of the data.
is large. Queries may take hours.
Update The OLAP database is not often updated. As a The data integrity constraint must be maintained in
result, data integrity is unaffected. an OLTP database.
Backup and It only needs backup from time to time as The backup and recovery process is maintained
Recovery compared to OLTP. rigorously
Processing time The processing of complex queries can take a It is comparatively fast in processing because of
lengthy time. simple and straightforward queries.
Types of users This data is generally managed by CEO, MD, This data is managed by clerksForex and managers.
u
_m

and GM.
es

Operations Only read and rarely write operations. Both read and write operations.
ot

Updates With lengthy, scheduled batch operations, data The user initiates data updates, which are brief and
gn

is refreshed on a regular basis. quick.

in
er

Nature of The process is focused on the customer. The process is focused on the market.
ne

audience
gi
en

Database Design Design with a focus on the subject. Design that is focused on the application.
@

Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
:-
am

Steps in Data Pre-Processing

gr
le
Te
in
Jo
Data Preprocessing in Machine learning
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model.
It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and formatted
data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for
this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be
directly used for machine learning models. Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine
learning model.

It involves below steps:

o Getting the dataset

o Importing libraries

o Importing datasets

o Finding Missing Data

o Encoding Categorical Data

o Splitting dataset into training and test set

o Feature scaling
u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

 (b). Noisy Data:

u
_m

Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
es

due to faulty data collection, data entry errors etc. It can be handled in following ways :
ot

1. Binning Method:
gn

This method works on sorted data in order to smooth it. The whole data is divided into
in
er

segments of equal size and then various methods are performed to complete the task.
ne

Each segmented is handled separately. One can replace all data in a segment by its mean
gi
en

or boundary values can be used to complete the task.

@
:-

2. Regression:
am

Here data can be made smooth by fitting it to a regression function.The regression used
gr
le

may be linear (having one independent variable) or multiple (having multiple independent
Te

variables).
in
Jo
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate f orms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used
to reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points wi th a representative
u
_m

centroid. It can be done using techniques such as k-means, hierarchical clustering, and density-
es

based clustering.
ot

Compression: This involves compressing the dataset while preserving the important information.
gn

Compression is often used to reduce the size of the dataset for storage and transmission
in
er

purposes. It can be done using techniques such as wavelet compression, JPEG compression, and
ne

gzip compression.
gi
en
@
:-
am

Sl.No Data Warehouse Data Mart

gr
le

1. Data warehouse is a Centralised system. While it is a decentralised system.

2. In data warehouse, lightly denormalization takes place. While in Data mart, highly denormalization takes
in

place.
Jo
3. Data warehouse is top-down model. While it is a bottom-up model.
4. To built a warehouse is difficult. While to build a mart is easy.
5. In data warehouse, Fact constellation schema is used. While in this, Star schema and snowflake schema are
used.
6. Data Warehouse is flexible. While it is not flexible.
7. Data Warehouse is the data-oriented in nature. While it is the project-oriented in nature.
8. Data Ware house has long life. While data-mart has short life than warehouse.
9. In Data Warehouse, Data are contained in detail form. While in this, data are contained in summarized form.
10. Data Warehouse is vast in size. While data mart is smaller than warehouse.
11. The Data Warehouse might be somewhere between 100 The Size of Data Mart is less than 100 GB.
GB and 1 TB+ in size.
12. The time it takes to implement a data warehouse might The Data Mart deployment procedure is time-limited
range from months to years. to a few months.
13. It uses a lot of data and has comprehensive operational Operational data are not present in Data Mart.
data.
14. It collects data from various data sources. It generally stores data from a data warehouse.
15. Long time for processing the data because of large data. Less time for processing the data because of handling
only a small amount of data.
16. Complicated design process of creating schemas and Easy design process of creating schemas and views.
views.

Hierarchical Clustering Algorithms

Hierarchical clustering in data mining

Hierarchical clustering refers to an unsupervised learning procedure that determines successive clusters based
on previously defined clusters. It works via grouping data into a tree of clusters. Hierarchical clustering stats by
treating each data points as an individual cluster. The endpoint refers to a different set of clusters, where each
cluster is different from the other cluster, and the objects within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering

o Divisive Clustering
u
_m

Agglomerative hierarchical clustering

es
ot
gn

Agglomerative clustering is one of the most common types of hierarchical clustering used to group similar
in

objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting). In

er
ne

agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
gi

grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters are
en

combined with different clusters until one cluster is formed.

@
:-
am

Agglomerative hierarchical clustering algorithm

gr
le
Te

1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
in
Jo
2. Consider each data point as an individual cluster.

3. Combine similar clusters.

4. Recalculate the proximity matrix for each cluster.

5. Repeat step 3 and step 4 until you get a single cluster.

Let’s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work. Here no
calculation has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:
u
_m

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the individual
es
ot

cluster from all other clusters.

gn
in
er

Step 2:
ne
gi
en

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to each
@

other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST), (V)]
:-
am

Step 3:
gr
le
Te

Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)] together
in

to form new clusters as [(P), (QR), (STV)]

Jo
Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined together to form a new
cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]

Divisive Hierarchical Clustering

Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, all the data points are considered an individual cluster, and in every iteration, the data
points that are not similar are separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.

u
_m
es
ot

Advantages of Hierarchical clustering

gn
in

It is simple to implement and gives the best output in some cases.

o
ne
gi

o It is easy and results in a hierarchy, a structure that contains more information.

en
@

o It does not need us to pre-specify the number of clusters.

:-
am
gr

AD
le
Te
in
Jo
Disadvantages of hierarchical clustering
o It breaks the large clusters.

o It is Difficult to handle different sized clusters and convex shapes.

o It is sensitive to noise and outliers.

o The algorithm can never be changed or deleted once it was done previously.

Design the data warehouse for wholesale furniture [10] Company. The data warehouse has to allow to
analyze the company’s situation at least with respect to the Furniture, Customer and Time. Moreover, the
company needs to analyze: The furniture with respect to its type, category and material. The customer with
respect to their spatial location, by considering at least cities, regions and states. The company is interested
in learning the quantity, income and discount

Dimensional model is used to analyze data/business facts with respect to business dimensions example
customers, products, etc.
DW for furniture company.
Dimensions:
1. Furniture : type, category, material.
2. Customer : Name, street, city, state, min.
3. Time : date, day of week, day of month, week, month, quarter, half year.
Facts :
Quantity, income and % discount.

u
_m
es
ot
gn
in
er
ne
gi
en
@
:-
am
gr
le
Te
in
Jo
Linear Regression

Linear regression is the type of regression that forms a relationship between the target variable and one or
more independent variables utilizing a straight line. The given equation represents the equation of linear
regression

Y = a + b*X + e.

Where,

a represents the intercept

b represents the slope of the regression line

e represents the error

X and Y represent the predictor and target variables, respectively.

If X is made up of more than one variable, termed as multiple linear equations.

In linear regression, the best fit line is achieved utilizing the least squared method, and it minimizes the total
sum of the squares of the deviations from each data point to the line of regression. Here, the positive and
negative deviations do not get canceled as all the deviations are squared.

What is Data Visualization?

Data visualization is a graphical representation of quantitative information and data by using visual elements
like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to understand and process for
humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and trends in the data.
u

In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts of
_m

information.
es
ot
gn

Data visualizations are common in your everyday life, but they always appear in the form of graphs and charts.
in
er

The combination of multiple visualizations and bits of information are still referred to as Infographics.
ne
gi
en

Data visualizations are used to discover unknown facts and trends. You can see visualizations in the form of line
@

charts to display change over time. Bar and column charts are useful for observing relationships and making
:-

comparisons. A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
am

geographical data visually.

gr
le
Te
in
Jo
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel spreadsheet,
which displays the data in more sophisticated ways such as dials and gauges, geographic maps, heat maps, pie
chart, and fever chart.

What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide. Data visualizations
did right key insights into complicated data sets into meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations consist of ?complex
ideas communicated with clarity, precision, and efficiency.

u
_m
es
ot

To craft an effective data visualization, you need to start with clean data that is well-sourced and complete.
gn

After the data is ready to visualize, you need to pick the right chart.
in
er
ne

After you have decided the chart type, you need to design and customize your visualization to your liking.
gi
en

Simplicity is essential - you don't want to add any elements that distract from the data.
@
:-
am

History of Data Visualization

gr
le
Te

The concept of using picture was launched in the 17th century to understand the data from the maps and
in

graphs, and then in the early 1800s, it was reinvented to the pie chart.
Jo
Several decades later, one of the most advanced examples of statistical graphics occurred when Charles
Minard mapped Napoleon's invasion of Russia. The map represents the size of the army and the path of
Napoleon's retreat from Moscow - and that information tied to temperature and time scales for a more in-
depth understanding of the event.

Computers made it possible to process a large amount of data at lightning-fast speeds. Nowadays, data
visualization becomes a fast-evolving blend of art and science that certain to change the corporate landscape
over the next few years.

Importance of Data Visualization

Data visualization is important because of the processing of information in human brains. Using graphs and
charts to visualize a large amount of the complex data sets is more comfortable in comparison to studying the
spreadsheet and reports.
u
_m

Data visualization is an easy and quick way to convey concepts universally. You can experiment with a different
es

outline by making a slight adjustment.

ot
gn

Data visualization have some more specialties such as:

in
er
ne
gi

AD
en
@

o Data visualization can identify areas that need improvement or modifications.

:-
am

o Data visualization can clarify which factor influence customer behavior.

gr
le
Te

o Data visualization helps you to understand which products to place where.

in
Jo
o Data visualization can predict sales volumes.

Data visualization tools have been necessary for democratizing data, analytics, and making data-driven
perception available to workers throughout an organization. They are easy to operate in comparison to earlier
versions of BI software or traditional statistical analysis software. This guide to a rise in lines of business
implementing data visualization tools on their own, without support from IT.

Why Use Data Visualization?

1. To make easier in understand and remember.

2. To discover unknown facts, outliers, and trends.

3. To visualize relationships and patterns quickly.

4. To ask a better question and make better decisions.

5. To competitive analyze.

6. To improve insights.

What is the relationship between data warehousing and data replication? Which form of [10] replication
(synchronous or asynchronous) is better suited for data warehousing? Why? Explain with appropriate
example.
Data warehousing and data replication can be used together to improve the performance, reliability, and
scalability of the data warehousing environment.
Data replication is the process of creating and maintaining multiple copies of the same data in different
locations or on different systems to improve fault tolerance, data availability, and disaster recovery capabilities.
It can be done in many ways, including synchronous replication and asynchronous replication.
Synchronous replication involves replicating changes to data in real-time as soon as they occur, ensuring that
multiple copies of the data are always synchronized with each other. Asynchronous replication, on the other
hand, involves replicating data changes on a scheduled or periodic basis, resulting in some delay between
updates to the original data and the replicated copies.
In a data warehousing environment, asynchronous replication can be a better choice because it allows for a
more flexible and scalable architecture. Since data warehouses are often subject to large volumes of data and
u
_m

complex data transformations, synchronous replication can result in performance issues or delay in data
es

transformation processes. In contrast, asynchronous replication allows for a more staggered data
ot

transformation process that can take advantage of off-peak hours or idle processing time to transform and
gn

replicate data.
in
er

For example, a retail company might use data replication to keep multiple copies of their sales data across
ne

multiple locations to ensure that all stores have access to the same information. Asynchronous replication can
gi
en

be used in this case because it allows the company to collect and transform sales data at each store and
@

periodically replicate it to the central data warehouse without affecting daily operations in the stores.
:-
am
gr
le
Te
in
Jo
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gin
ee
rin
gn
ote
s_
m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
Jo
in
Te
le
gr
am
:-
@
en
gi
ne
er
in
gn
ot
es
_m
u
B9
54

67
A4
CA
Paper / Subject Code: 31924 / Data Warehousing & Mining

19
DD
C0

B9
54
E8

6
CA
D3

19
DD
97

B9
54
32

A4
CA
D3
AC

DD
97

C0
54
Duration:(3 Hours) [80 Marks]

E8
9E

4
C3

A
D3
97

4C
2B

C0
A

8
N.B. 1) Question No. 1 is compulsory.

4
5
A0

0A
9

D3
C

4C
2B
2) Attempt any Three questions out of the remaining.

9
67

AC
32

35
A0
3) Assume suitable data wherever necessary and state them clearly.

4C
B

8D
B9

A
2

32
6

35
A0

E
9
DD

9
Q.1 Solve any four of the following (20)

AC
91

7
2B

8D
29
67
A4

B
A. Compare OLTP vs OLAP systems.

35
0

E
19
DD

A
C0

97
B

D
9
B. Explain the KDD process of data mining.

A
4

E8
96
CA

E
A

B9
C. Explain any two methods of evaluating the accuracy of a Classifier.

C
1
C0

97
D

9
54

EA
4

32
96
A
D. Explain K-means clustering algorithm and draw flowchart.

0A
D3

9
C

AC
91
4D

2B
E. Explain multilevel association rule mining with example.
54

C
E8

6
A

E
A

0
D3

C3
9
97

F. Write a short note on web usage mining.

B9
4C

7A
1
0

B9
C
32

EA
A4

02
35

96
CA
AC

DD
7

B9
A
8D

91
C0
9

Q.2 A. Consider the following transaction database with minimum support 50% and minimum
4
2

67
9E

02
5
3

CA
3
AC

19
confidence 66%. Find the frequent patterns and strong association rules. (10)
2B

7A
8D

C0
9

B9
54
32
E
A0

96
E

A
Tid Items

A
9

3
C

DD
7

C
2B

91
0
9
67

AC
2

8
9E

DB
35
0

E
19

10 A,C,D

0A
A

4C
B

8D
9

4D
DB

C
2
E

35
0

E
19

CA
20 B,C,E

0A
9
7A

97
4D

8D
9
DB

54
02

AC
32
96

E
0A

E
9

3
30 A,B,C,E
7A

C
1

7
4D

4C
2B

D
B9
AC

9
EA

E8
6
0A

5
0

C3
19
D

B9
4C

3
A

40 B,E
97
D

8D
9
AC

A
A4

2
35

32
E
0

7E
19
DD

B9
4C

A
8D

C
C0

29
67

Q.2 B. Explain different steps involved in data preprocessing. (10)

02
35
7E

C3
19
DD

9
A
D

C0
29

B
9
54

7
E8

EA
A4

2
C3

Q.3 A. Find the clusters for the following dataset using a single link technique. Use Euclidean
CA

A0
D3
97

B9
91
C0
A

distance and draw the dendrogram. (10)

54
32

67
8
9E

02
7E

0A
3
AC

9
DD
4C

A
8D

91
9

Sample No. X Y
AC
32

67
9E

B
35
7E
AC

19
DD
4C
2B

P1 0.40 0.53
C0
9

9
32
9E

DB
35
A0

CA
AC

P2 0.22 0.38
2B

C0
9
67

4D
54
32
9E
A0

E
19

0A
D3
AC

P3 0.35 0.32
4C
2B

29
7

AC
8
96

35
A0

P4 0.26 0.19
91

u
C
2B

8D
29
7

_m
DB

4
96

35
A0

P5 0.08 0.41
es
91

97
4D

8D
7

ot
DB

32
96
0A

P6 0.45 0.30
9

C
91
D

in
29
7

A
A4

er
9E
A0

C3
91

ne
0

2B
AC

EA
A4

gi
96

en
B9
C

91
0

D
54

@
A4

02
D3

:-
C

7A
0

am
9
54

96
3

gr
C
8D

91
0

le
54

Te
7E

0A
D3

38449 Page 1 of 2
C
29

in
4D
54

AC
E8

Jo
C3

0A
3
97

C
8D

AC
32

3
AC

B9EAC3297E8D354CAC0A4DDB91967A02
4C
8D
29
B9
54

67
A4
CA
Paper / Subject Code: 31924 / Data Warehousing & Mining

19
DD
C0

B9
54
E8

6
CA
D3

19
DD
97

B9
54
32

A4
CA
D3
AC

DD
97

C0
54
Q.3.B. The college wants to record the Marks for the courses completed by students using the

E8
9E

4
C3

A
D3
97
dimensions: I) Course, II)Student, III) Time & a measure Aggregate marks .

4C
2B

C0
A

8
E

4
5
A0

E
Create a cube and describe following OLAP operations :

0A
9

D3
C

4C
2B

9
67

EA
I) Slice II) Dice III) Roll up IV) Drill Down V)Pivot (10)

AC
32

35
A0
19

4C
B

8D
B9

A
2

32
6

35
A0
Q.4.A. What is dimensional modeling? Design the data warehouse dimensional model for a

E
9
DD

AC
91

7
2B

8D
29
67
wholesale furniture Company. The data warehouse has to analyze the company’s situation at least

35
0

E
19
DD

A
C0

97
with respect to the Furniture, Customer and Time. Moreover, the company needs to analyze: The

D
9

A
4

E8
96
CA

E
A
furniture with respect to its type, category and material. The customer with respect to their spatial

B9
7A

C
1
C0

97
D

9
54

EA
location, by considering at least cities, regions and states. The company is interested in learning

32
96
A

0A
D3

9
C

AC
91
the quantity, income and discount of its sales.. (10)

2B
54

C
E8

6
A

E
A

0
D3

C3
9
97

B9
4C

7A
1
0

B9
C
32

EA
Q.4 B. A data sample is given below. Find whether Patient X has flu or not using Naïve Bayes

02
35

96
CA
AC

DD
7

classifier.

B9
A
8D

91
C0
9

4
2

67
9E

02
5
3

CA
3
AC

19
2B

7A
8D
If X= (chills=Y, runny nose=N, headache=Mild, fever=Y, flu=?) (10)

C0
9

B9
54
32
E
A0

96
E

A
9

3
C

DD
7

C
2B

91
0
9
67

chills Runny nose headache fever Flu

AC
2

8
9E

DB
35
0

E
19

0A
A

Y N Mild Y N
97

4C
B

8D
9

4D
DB

C
2

Y Y No N Y
E

35
0

E
19

0A
9
7A

97
4D

8D
Y N Strong Y Y
9
DB

54
02

AC
32
96

E
0A

E
N Y Mild Y Y
9

3
7A

C
1

7
4D

4C
2B

D
B9
AC

9
EA

N N No N N
2

E8
6
0A

5
0

C3
19
D

B9
4C

3
A

N Y Strong Y Y
97
D

8D
9
AC

A
A4

2
35

N Y Strong N N
E
0

7E
19
DD

B9
4C

A
8D

C
C0

|Y Y Mild |Y Y
B9

29
67

EA
A4

02
35
7E

C3
19
DD

9
A
D

C0
29

B
9
54

7
E8

EA
A4

2
C3

96
CA

A0
D3
97

B9
91

Q.5 A.Explain Page Rank algorithm with example. (10)

C0
A

4D
54
32

67
8
9E

02
7E

B. Explain different data visualization techniques. (10)

0A
3
AC

9
DD
4C

A
8D

91
9

AC
32

67
9E

B
35
7E
AC

19
DD
4C
2B

Q.6. Write short notes on following: (20)

C0
9

9
32
9E

DB
35
A0

A. Applications of Data Mining.

7
2B

C0
9
67

4D
54
32

B. FP Tree
9E
A0

E
19

0A
D3
AC

4C
2B

C. Web content Mining

29
7

AC
8
96

35
A0

D. Techniques of data Loading

u
C
2B

8D
29
7

_m
DB

4
96

35
A0

es
91

97
4D

8D
7

ot
DB

32
96
0A

*************************
7E

gn
9

C
91
D

in
29
7

A
A4

er
9E
A0

C3
91

ne
0

2B
AC

EA
A4

gi
96

en
B9
C

91
0

D
54

@
A4

02
D3

:-
C

7A
0

am
9
54

96
3

gr
C
8D

91
0

le
54

Te
7E

0A
D3

38449 Page 2 of 2
C
29

in
4D
54

AC
E8

Jo
C3

0A
3
97

C
8D

AC
32

3
AC

B9EAC3297E8D354CAC0A4DDB91967A02
4C
8D
29

Data Warehosing and Data Mining
No ratings yet
Data Warehosing and Data Mining
15 pages
Data Mining 1
No ratings yet
Data Mining 1
13 pages
Data Mining Assignment
0% (1)
Data Mining Assignment
11 pages
Unit 2 Data Science Good
No ratings yet
Unit 2 Data Science Good
22 pages
Data Warehousing and Data Minining Answer Key - Anna University (16M & 2M With Answers)
No ratings yet
Data Warehousing and Data Minining Answer Key - Anna University (16M & 2M With Answers)
139 pages
Introduction to Data Warehouse
No ratings yet
Introduction to Data Warehouse
17 pages
List Data Warehouse Models With Example
No ratings yet
List Data Warehouse Models With Example
19 pages
Question BankDWM
No ratings yet
Question BankDWM
2 pages
Data Warehousingdata Mining
No ratings yet
Data Warehousingdata Mining
86 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
Dataware House Design and Modeling
No ratings yet
Dataware House Design and Modeling
5 pages
Data Mining v3
No ratings yet
Data Mining v3
54 pages
MODULE 3 DM
No ratings yet
MODULE 3 DM
9 pages
DWDM
No ratings yet
DWDM
2 pages
Ecommerce Big Data Computeing Platform System Based On Distribuded Computing
No ratings yet
Ecommerce Big Data Computeing Platform System Based On Distribuded Computing
10 pages
??? ????????? ???
No ratings yet
??? ????????? ???
21 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
Gita Autonomous College, Bhubaneswar Question Bank Subject
No ratings yet
Gita Autonomous College, Bhubaneswar Question Bank Subject
27 pages
MCA_301_Data_Mining_Notes
No ratings yet
MCA_301_Data_Mining_Notes
6 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
dwh
No ratings yet
dwh
34 pages
Data Mining UNIT - 2 (Data Warehouse Architecture)
No ratings yet
Data Mining UNIT - 2 (Data Warehouse Architecture)
3 pages
Data Warehousing U1&2 Notes
No ratings yet
Data Warehousing U1&2 Notes
56 pages
DW&DM Syllabus
No ratings yet
DW&DM Syllabus
2 pages
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
No ratings yet
Data Warehousing - Data Mining CSE - IT (4th Year) Engineering Lecture Notes, Ebook PDF Download
146 pages
Recent Trends in IT
No ratings yet
Recent Trends in IT
7 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
31 pages
Gujarat Technological University: Page 1 of 3
No ratings yet
Gujarat Technological University: Page 1 of 3
3 pages
Part A Aim: Prerequisite: Database Outcome: To Impart Knowledge of Data Warehouse and Data Mining Theory
No ratings yet
Part A Aim: Prerequisite: Database Outcome: To Impart Knowledge of Data Warehouse and Data Mining Theory
4 pages
Data warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages
pyq DMDW
No ratings yet
pyq DMDW
8 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
No ratings yet
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
15 pages
Viva Questions For Data Mining and Warehousing: Q1. Ans.
No ratings yet
Viva Questions For Data Mining and Warehousing: Q1. Ans.
13 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
9 pages
Contact Me To Get Fully Solved Smu Assignments/Project/Synopsis/Exam Guide Paper
No ratings yet
Contact Me To Get Fully Solved Smu Assignments/Project/Synopsis/Exam Guide Paper
7 pages
Unit 2 - Data Science
No ratings yet
Unit 2 - Data Science
21 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
86 pages
DWDM All Units
No ratings yet
DWDM All Units
102 pages
INFO408 Database
No ratings yet
INFO408 Database
6 pages
DM Mod1 PDF
No ratings yet
DM Mod1 PDF
16 pages
Designing A Data Warehouse: Issues in DW Design
No ratings yet
Designing A Data Warehouse: Issues in DW Design
33 pages
Module 1 DMDW
No ratings yet
Module 1 DMDW
64 pages
Designing A Data Warehouse: Issues in DW Design
No ratings yet
Designing A Data Warehouse: Issues in DW Design
33 pages
Sem 5 Syllabus - 230519 - 202057
No ratings yet
Sem 5 Syllabus - 230519 - 202057
36 pages
From Enterprise Models To Dimensional Models - A Methodology For Data Warehouse and Data Mart Design
No ratings yet
From Enterprise Models To Dimensional Models - A Methodology For Data Warehouse and Data Mart Design
12 pages
ALL YOU NEED Data_Mining_and_Warehousing
No ratings yet
ALL YOU NEED Data_Mining_and_Warehousing
42 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Unit 2_V2_Data Science
No ratings yet
Unit 2_V2_Data Science
23 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Data Warehousing
No ratings yet
Data Warehousing
176 pages
Data Warehouse Subject Topic for Preparing PPT
No ratings yet
Data Warehouse Subject Topic for Preparing PPT
1 page
DWM Notes
No ratings yet
DWM Notes
19 pages
MCAS2220 New
No ratings yet
MCAS2220 New
3 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
29 pages
32mm Ballscrew Bearing Pack Retrofit Kit Installation AD0461
No ratings yet
32mm Ballscrew Bearing Pack Retrofit Kit Installation AD0461
10 pages
M16 Rifle
0% (1)
M16 Rifle
45 pages
Electrochemistry - English
No ratings yet
Electrochemistry - English
6 pages
Biology Marking Scheme
No ratings yet
Biology Marking Scheme
3 pages
Mayorkas Articles
No ratings yet
Mayorkas Articles
20 pages
BNSS MCQ Ch-IV (1)
No ratings yet
BNSS MCQ Ch-IV (1)
5 pages
Astm A249
No ratings yet
Astm A249
10 pages
CV
No ratings yet
CV
2 pages
Deadlands Adventures
100% (3)
Deadlands Adventures
108 pages
Brand Elements & Positioning of BMW: "The Ultimate Driving Machine"
No ratings yet
Brand Elements & Positioning of BMW: "The Ultimate Driving Machine"
12 pages
Tip Sheet 2 Preparing Gender Action Plan
No ratings yet
Tip Sheet 2 Preparing Gender Action Plan
4 pages
U World Practice Questions
No ratings yet
U World Practice Questions
3 pages
Sample Letter of Thesis
No ratings yet
Sample Letter of Thesis
1 page
[FREE PDF sample] Jazz in Search of Itself 1St Edition Edition Larry Kart ebooks
100% (11)
[FREE PDF sample] Jazz in Search of Itself 1St Edition Edition Larry Kart ebooks
67 pages
Dunne_Para-functionality-The Aesthetics of Use
No ratings yet
Dunne_Para-functionality-The Aesthetics of Use
25 pages
Mark Cunningham's Scripts' Structures
100% (1)
Mark Cunningham's Scripts' Structures
3 pages
TQM
100% (1)
TQM
4 pages
Maximus MBX Manual B
No ratings yet
Maximus MBX Manual B
130 pages
UG CE-2017 (Sec-A) : Programme - Fall Semester 2019 (Wef 23 Sep 2019)
No ratings yet
UG CE-2017 (Sec-A) : Programme - Fall Semester 2019 (Wef 23 Sep 2019)
3 pages
Curriculum Vitae: Personal Information
No ratings yet
Curriculum Vitae: Personal Information
6 pages
Some Important Mnemonics For UPSC CSE - by Anonymous - Medium
No ratings yet
Some Important Mnemonics For UPSC CSE - by Anonymous - Medium
12 pages
Arbitration - 20 Recent SC Cases - July 2019 (Livelaw - In)
No ratings yet
Arbitration - 20 Recent SC Cases - July 2019 (Livelaw - In)
9 pages
Hospital Medical Equipment Inventory
No ratings yet
Hospital Medical Equipment Inventory
7 pages
Selfies-Living in The Era of Filtered Photographs
No ratings yet
Selfies-Living in The Era of Filtered Photographs
2 pages
Eiger Dreams Jon Krakauer instant download
100% (1)
Eiger Dreams Jon Krakauer instant download
47 pages
5 Habits of Successful People Leaders PDF
No ratings yet
5 Habits of Successful People Leaders PDF
16 pages
List CSP Chhattisgarh Rajya Gramin Bank
No ratings yet
List CSP Chhattisgarh Rajya Gramin Bank
11 pages
Notes Skull and Face
No ratings yet
Notes Skull and Face
1 page
IFT - Export Documentation 1
No ratings yet
IFT - Export Documentation 1
33 pages
Command Terms and Assessment Criteria MYP 5
No ratings yet
Command Terms and Assessment Criteria MYP 5
8 pages

Dwm_Complete_Notes______openinapp

Uploaded by

Dwm_Complete_Notes______openinapp

Uploaded by

🎯 HERE YOU GET ALL BRANCHES (MUMBAI UNIVERSITY ) ENGINEERING NOTES 🎯

SECOND YEAR ENGINEERING NOTES (MU) 📚

JOIN US TELEGRAM ALL LINKS FOR NOTES

⭐ 1.CLICK TO JOIN CHANNEL 👉

⭐ 2.CLICK TO JOIN GROUP 👉 @engineering_notes_mu

⭐ 3.CLICK TO JOIN 1ST YEAR NOTES BOT 👉@engineeringnotes_mubot

4.CLICK TO JOIN 2ND YEAR COMPUTER ENGINEERING NOTES BOT 👉 @computerengineeringmu_notes_bot

5.CLICK TO JOIN CODING CHANNEL 👉@codingnewbeginners

5. How to handle noisy data.

1. Differentiate classification & clustering.

3. Write short note on decision tree based classification(ID3) approach.

1. Explain clustering and its types with suitable example.

object Attribute1(X) weight index Attribute 2 (Y)ph

1. Explain Market Basket Analysis with example.

5. Explain multilevel and multidimensional association rule.

Data Warehouse Design Strategy [ Top Down and Bottom Up ]

There are two approaches

Top-down Design Approach

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

data warehouse, which is generally called a virtual data warehouse.

project team to learn and grow.

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

Disadvantages of bottom-up design

Differentiate between Top-Down Design Approach and Bottom-Up

Top-Down Design Approach Bottom-Up Design Approach

them into a higher one.

Explain star snowflake and fact constellation schema for multidimensional

What is Star Schema?

Characteristics of Star Schema

o It creates a DE-normalized database that can quickly provide query responses.

o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema

The main advantage of star schemas in a decision-support environment are:

Load performance and administration

regularly and selectively by appending records to a fact table.

Built-in referential integrity

dimension cannot be given the correct key value to be retrieved.

Disadvantage of Star Schema

What is Snowflake Schema?

impact query performance.

into multiple dimension tables.

Advantage of Snowflake Schema

3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

2. There are more complex queries and hence, difficult to understand.

3. More tables more join so more query execution time.

o The tables are completely in a denormalized structure.

o SQL queries performance is good as there is less number of joins involved.

o Data redundancy is high and occupies more disk space.

o The tables are partially denormalized in structure.

Let's see the differentiate between Star and Snowflake Schema.

S.NO Star Schema Snowflake Schema

3. Star schema uses more space. While it uses less space.

It’s design is very simple. While it’s design is complex.

8. It’s understanding is very simple. While it’s understanding is difficult.

Example: A fact constellation schema is shown in the figure below.

What is Multi-Dimensional Data Model?

item_name, brand, and type.

measures of the related dimensional tables.

Explain ETL Process

the following three stages:

source systems and storing it in a staging area.

Let us understand each step of the ETL process in-depth:

most important steps of ETL process.

 Filtering – loading only certain attributes into the data warehouse.

States, and America into USA, etc.

 Joining – joining multiple attributes into one.

 Splitting – splitting a single attribute into multiple attributes.

Advantages of ETL process in data warehousing:

complete, and up-to-date.

systems, making it more accessible and usable.