0% found this document useful (0 votes)
37 views17 pages

Assignment of DMDW kg11

DMDW assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views17 pages

Assignment of DMDW kg11

DMDW assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Q.1 Why is it essential to clean data before analysis?

Ans: Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:

● Removal of errors when multiple sources of data are at play.


● Fewer errors make for happier clients and less-frustrated employees.
● Ability to map the different functions and what your data is intended to do.
● Monitoring errors and better reporting to see where errors are coming from, making it easier to
fix incorrect or corrupt data for future applications.
● Using tools for data cleaning will make for more efficient business practices and quicker decision-
making.

Q.2 How does the STAR schema differ from the snowflake schema?

Ans: Difference between STAR schema and snowflake schema

Parameters Star Schema Snowflake Schema

Definition and A star schema contains both A snowflake schema contains all three-
Meaning dimension tables and fact tables in dimension tables, fact tables, and sub-
it. dimension tables.

Type of Model It is a top-down model type. It is a bottom-up model type.

Space Occupied It makes use of more allotted It makes use of less allotted space.
space.

Time Taken for With the Star Schema, the process With the Snowflake Schema, the process of
Queries of execution of queries takes less execution of queries takes more time.
time.

Use of The Star Schema does not make The Snowflake Schema makes use of both
Normalization use of normalization. Denormalization as well as Normalization.

Complexity of The design of a Star Schema is very The designing of a Snowflake Schema is very
Design simple. complex.

Query Complexity It is very low in the case of a Star It is comparatively much higher in the case
Schema. of a Snowflake Schema.

Complexity of It is very easy to understand a Star It is comparatively more difficult to


Understanding Schema. understand a Snowflake Schema.

Total Number of The total number of foreign keys is The total number of foreign keys is more in
Foreign Keys less in the case of a Star Schema. the case of a Snowflake Schema.

Data Redundancy Data redundancy is comparatively Data redundancy is comparatively lower in


higher in Star Schema. Snowflake Schema.

Q.3 Illustrate the process of knowledge discovery in database.

Ans: KDD Process


KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown,
and potentially valuable information from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the data.The following steps are
included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse). Data integration using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the
data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and
Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task
relevant data into patterns, and decides purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given
measures. It find interestingness score of each pattern, and uses summarization and Visualization to make data
understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make decisions.
Note: KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data
can be integrated and transformed in order to get different and more appropriate results.Preprocessing of
databases consists of Data cleaning and Data Integration.

Q.4 Explain the difference between “Explorative data mining” and “predictive data mining”.
Ans: Exploratory Data Analysis: it is a data analytics process to understand the data in depth and learn the
different data characteristics, often with visual means. This allows you to get a better feel of your data and
find useful patterns in it.

It is crucial to understand it in depth before you perform data analysis and run your data through an
algorithm. You need to know the patterns in your data and determine which variables are important and
which do not play a significant role in the output. Further, some variables may have correlations with other
variables. You also need to recognize errors in your data.

All of this can be done with Exploratory Data Analysis. It helps you gather insights and make better sense
of the data, and removes irregularities and unnecessary values from data.

● Helps you prepare your dataset for analysis.

● Allows a machine learning model to predict our dataset better.

● Gives you more accurate results.


● It also helps us to choose a better machine learning model.

Predictive Data Mining:


The main goal of this mining is to say something about future results not of current behaviour. It uses the
supervised learning functions which are used to predict the target value. The methods come under this type
of mining category are called classification, time-series analysis and regression. Modelling of data is the
necessity of the predictive analysis, and it works by utilizing a few variables of the present to predict the
future not known data values for other variables.
Examples of predictive data mining include regression analysis, decision trees, and neural networks.
Regression analysis involves predicting a continuous outcome variable based on one or more predictor
variables. Decision trees involve building a tree-like model to make predictions based on a set of rules. Neural
networks involve building a model based on the structure of the human brain to make predictions.

Q.5 What is the relation between data mining and data warehousing?

Ans: Both data mining and data warehousing are business intelligence tools that are used to turn information
(or data) into actionable knowledge. The important distinctions between the two tools are the methods and
processes each uses to achieve this goal.

Data mining is a process of statistical analysis. Analysts use technical tools to query and sort through
terabytes of data looking for patterns. Usually, the analyst will develop a hypothesis, such as customers who
buy product X usually buy product Y within six months. Running a query on the relevant data to prove or
disprove this theory is data mining. Businesses then use this information to make better business decisions
based on how they understand their customers' and suppliers' behaviors.

Data warehousing describes the process of designing how the data is stored in order to improve reporting and
analysis. Data warehouse experts consider that the various stores of data are connected and related to each
other conceptually as well as physically. A business's data is usually stored across a number of databases.
However, to be able to analyze the broadest range of data, each of these databases needs to be connected in
some way. This means that the data within them need a way of being related to other relevant data and that
the physical databases themselves have a connection so their data can be looked at together for reporting
purposes.

Q.6 What type of benefit you might hope to get from data mining?

Ans: Data mining can offer numerous benefits across various domains. Here are some potential benefits:

1. **Pattern Discovery:** Data mining helps identify patterns and trends within large datasets that might
not be immediately apparent. This can lead to insights that can inform decision-making processes.
2. **Prediction and Forecasting:** By analyzing historical data, data mining algorithms can make
predictions about future trends or events. This can be valuable for businesses in predicting customer
behavior, market trends, or resource demands.

3. **Customer Segmentation:** Understanding customer segments based on their behavior, preferences, and
demographics can help businesses tailor their marketing strategies, product offerings, and customer service to
specific groups, ultimately improving customer satisfaction and retention.

4. **Anomaly Detection:** Data mining techniques can detect unusual patterns or outliers in data, which
may indicate fraud, errors, or other anomalies. This can be particularly useful in industries such as finance,
cybersecurity, and healthcare.

5. **Optimization:** Data mining can help optimize processes and operations by identifying
inefficiencies, bottlenecks, or areas for improvement. This can lead to cost savings, increased productivity,
and better resource allocation.

6. **Personalization:** By analyzing large amounts of customer data, businesses can personalize their
products, services, and marketing efforts to better meet individual customer needs and preferences.

7. **Risk Management:** Data mining can assist in assessing and managing risks by analyzing historical data
to identify potential risk factors and predict future risks. This is particularly important in industries such as
insurance, finance, and healthcare.

8. **Research and Innovation:** Data mining can uncover new insights and knowledge in various fields,
leading to advancements in science, technology, healthcare, and other domains.

Q.7 Explain Naïve Baye’s classification?

o Ans: Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes
to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

Q.8 Explain decision tree over other classification methods?

Ans: A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification
and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal
nodes and leaf nodes.
As you can see from the diagram below, a decision tree starts with a root node, which does not have any
incoming branches. The outgoing branches from the root node then feed into the internal nodes, also known
as decision nodes. Based on the available features, both node types conduct evaluations to form
homogenous subsets, which are denoted by leaf nodes, or terminal nodes. The leaf nodes represent all the
possible outcomes within the dataset.

As an example, let’s imagine that you were trying to assess whether or not you should go surf, you may use
the following decision rules to make a choice:
Q.9 Explain clustering and it’s all types with example.?
Ans: Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can
be defined as "A way of grouping the data points into different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that has less or no similarities with another
group."

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.

Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.
Types of Clustering Methods

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-
based method. The most common example of partitioning clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture
Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement
of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters can
be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster.
Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.

Q.10 Explain FP-Growth algorithm?

Ans: The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The core of this
method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the item
set association information.

This algorithm works as follows:

o First, it compresses the input database creating an FP-tree instance to represent frequent items.
o After this first step, it divides the compressed database into a set of conditional databases, each
associated with one frequent pattern.
o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then
concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem
is to partition the database into a set of smaller databases (called projected databases) and then construct an
FP-tree from each of these smaller databases.

Q.11 Write APRIORI algorithm with example?

Ans: Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects.
It means how two or more objects are related to one another. In other words, we can say that the apriori
algorithm is an association rule leaning that analyzes that people who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule between different objects. The
association rule describes how two or more objects are related to one another. Apriori algorithm is also called
frequent pattern mining. Generally, you operate the Apriori algorithm on a database that consists of a huge
number of transactions. Let's understand the apriori algorithm with the help of an example; suppose you go to
Big Bazar and buy different products. It helps the customers buy their products with ease and increases the sales
performance of the Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no of transactions. Suppose you
have 4000 customers transactions in a Big Bazar. You have to calculate the Support, Confidence, and Lift for two
products, and you may say Biscuits and Chocolate. This is because customers frequently buy these two items
together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out the support, confidence,
and lift.

Support

Support refers to the default popularity of any product. You find the support as a quotient of the division of
the number of transactions comprising that product by the total number of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.

Confidence

Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So, you
need to divide the number of transactions that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.

Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving Biscuits)

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift

Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more than that
of purchasing the biscuits alone. If the lift value is below one, it requires that the people are unlikely to buy both
the items together. Larger the value, the better is the combination.

Q.12 Explain whether association rule mining is supervised or unsupervised type of learning
and differentiate supervised and unsupervised machine learning techniques ?

Ans: Association rule mining is typically considered an unsupervised learning technique.


Here's why:

### Association Rule Mining:


Association rule mining is a data mining technique used to discover interesting relationships or associations
between variables in large datasets. It identifies patterns or rules that describe how items are associated with
each other in transactions or events. For example, in a retail dataset, association rule mining might discover
that customers who buy bread are also likely to buy butter.

#### Unsupervised Learning Characteristics:


1. **No labeled data:** Unsupervised learning algorithms work with unlabeled data, meaning there are no
predefined labels or outcomes that the algorithm needs to predict. In association rule mining, the algorithm
discovers patterns and relationships without the need for labeled data.

2. **Goal of finding structure:** Unsupervised learning aims to find hidden structure or patterns within the
data without guidance on what specific patterns to look for. In association rule mining, the goal is to
discover interesting associations or relationships between items without specifying them beforehand.

### Supervised vs. Unsupervised Learning:

Supervised Learning:
- **Labeled Data:** Supervised learning algorithms require labeled training data, where each data point
is associated with a known outcome or label.

- **Prediction:** The goal of supervised learning is to learn a mapping from input features to output labels,
allowing the algorithm to make predictions on new, unseen data.

- **Examples:** Classification and regression are common tasks in supervised learning. For
example, predicting whether an email is spam (classification) or predicting house prices (regression).

#### Unsupervised Learning:


- **Unlabeled Data:** Unsupervised learning algorithms work with unlabeled data, where there are
no predefined output labels.
-**Discovering Patterns:** The goal of unsupervised learning is to discover hidden patterns, structures, or
relationships within the data.

- **Examples:** Clustering, dimensionality reduction, and association rule mining are examples of
unsupervised learning tasks. Clustering algorithms group similar data points together, dimensionality
reduction techniques simplify data while retaining important information, and association rule mining
discovers relationships between variables.

Q.13 explain the architecture of data warehouse.

Ans: A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
1. Top-down approach:

The essential components are discussed below:

1. External Sources –
External source is a source from where data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.

2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is a
need to validate this data to load into datawarehouse. For this purpose, it is recommended to
use ETL tool.
● E(Extracted): Data is extracted from External data source.

● T(Transform): Data is transformed into the standard format.

● L(Load): Data is loaded into datawarehouse after transforming it into the standard format.

3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually stores the meta
data and the actual data gets stored in the data marts. Note that datawarehouse stores the data in
its purest form in this top-down approach.

4. Data Marts –
Data mart is also a part of storage component. It stores the information of a particular function of an
organisation which is handled by single authority. There can be as many number of data marts in an
organisation depending upon the functions. We can also say that data mart contains subset of the data
stored in datawarehouse.

5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is used to find the
hidden patterns that are present in the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organisation and data marts are created from it after the complete datawarehouse has been created.

Q.14 Explain differences between OLAP and OLTP .

Ans:

Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)

It is well-known as an online database It is well-known as an online database


Definition
query management system. modifying system.

Consists of historical data from various


Data source Consists of only operational current data.
Databases.

It makes use of a standard database


Method used It makes use of a data warehouse.
management system (DBMS).

It is subject-oriented. Used for Data It is application-oriented. Used for business


Application
Mining, Analytics, Decisions making, etc. tasks.

In an OLAP database, tables are not In an OLTP database, tables are normalized
Normalized
normalized. (3NF).

The data is used in planning, problem- The data is used to perform day-to-day
Usage of data
solving, and decision-making. fundamental operations.

It provides a multi-dimensional view of It reveals a snapshot of present business


Task
different business tasks. tasks.

It serves the purpose to extract It serves the purpose to Insert, Update, and
Purpose
information for analysis and decision- Delete information from the database.
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)

making.

A large amount of data is stored typically The size of the data is relatively small as the
Volume of data
in TB, PB historical data is archived in MB, and GB.

Relatively slow as the amount of data


Very Fast as the queries operate on 5% of the
Queries involved is large. Queries may take
data.
hours.

The OLAP database is not often updated. The data integrity constraint must be
Update
As a result, data integrity is unaffected. maintained in an OLTP database.

Backup and It only needs backup from time to time The backup and recovery process is
Recovery as compared to OLTP. maintained rigorously

The processing of complex queries can It is comparatively fast in processing because


Processing time
take a lengthy time. of simple and straightforward queries.

This data is generally managed by CEO, This data is managed by clerksForex and
Types of users
MD, and GM. managers.

Operations Only read and rarely write operations. Both read and write operations.

With lengthy, scheduled batch


The user initiates data updates, which are
Updates operations, data is refreshed on a regular
brief and quick.
basis.

Nature of
The process is focused on the customer. The process is focused on the market.
audience

Database
Design with a focus on the subject. Design that is focused on the application.
Design

Improves the efficiency of business


Productivity Enhances the user’s productivity.
analysts.

Q.15 Explain the differences between “classification” and “clustering”.

Ans:
Parameter CLASSIFICATION CLUSTERING

Type used for supervised learning used for unsupervised learning

process of classifying the input instances grouping the instances based on their
Basic
based on their corresponding class labels similarity without the help of class labels

it has labels so there is need of training


there is no need of training and testing
Need and testing dataset for verifying the
dataset
model created

Complexity more complex as compared to clustering less complex as compared to classification

k-means clustering algorithm, Fuzzy c-means


Example Logistic regression, Naive Bayes classifier,
clustering algorithm, Gaussian (EM)
Algorithms Support vector machines, etc.
clustering algorithm, etc.
SESSION : 2023 – 2024
NAME – Ketan Gupta

COURSE NAME - B.TECH – CSE

ENROLL NO. -2102305016

SEMESTER- 6TH SEM

SUBMITTED BY –
Ketan Gupta
SUBMITTED TO–
MR.Deepak
Sir DATE -

You might also like