Assignment of DMDW kg11
Assignment of DMDW kg11
Ans: Having clean data will ultimately increase overall productivity and allow for the highest quality
information in your decision-making. Benefits include:
Q.2 How does the STAR schema differ from the snowflake schema?
Definition and A star schema contains both A snowflake schema contains all three-
Meaning dimension tables and fact tables in dimension tables, fact tables, and sub-
it. dimension tables.
Space Occupied It makes use of more allotted It makes use of less allotted space.
space.
Time Taken for With the Star Schema, the process With the Snowflake Schema, the process of
Queries of execution of queries takes less execution of queries takes more time.
time.
Use of The Star Schema does not make The Snowflake Schema makes use of both
Normalization use of normalization. Denormalization as well as Normalization.
Complexity of The design of a Star Schema is very The designing of a Snowflake Schema is very
Design simple. complex.
Query Complexity It is very low in the case of a Star It is comparatively much higher in the case
Schema. of a Snowflake Schema.
Total Number of The total number of foreign keys is The total number of foreign keys is more in
Foreign Keys less in the case of a Star Schema. the case of a Snowflake Schema.
Q.4 Explain the difference between “Explorative data mining” and “predictive data mining”.
Ans: Exploratory Data Analysis: it is a data analytics process to understand the data in depth and learn the
different data characteristics, often with visual means. This allows you to get a better feel of your data and
find useful patterns in it.
It is crucial to understand it in depth before you perform data analysis and run your data through an
algorithm. You need to know the patterns in your data and determine which variables are important and
which do not play a significant role in the output. Further, some variables may have correlations with other
variables. You also need to recognize errors in your data.
All of this can be done with Exploratory Data Analysis. It helps you gather insights and make better sense
of the data, and removes irregularities and unnecessary values from data.
Q.5 What is the relation between data mining and data warehousing?
Ans: Both data mining and data warehousing are business intelligence tools that are used to turn information
(or data) into actionable knowledge. The important distinctions between the two tools are the methods and
processes each uses to achieve this goal.
Data mining is a process of statistical analysis. Analysts use technical tools to query and sort through
terabytes of data looking for patterns. Usually, the analyst will develop a hypothesis, such as customers who
buy product X usually buy product Y within six months. Running a query on the relevant data to prove or
disprove this theory is data mining. Businesses then use this information to make better business decisions
based on how they understand their customers' and suppliers' behaviors.
Data warehousing describes the process of designing how the data is stored in order to improve reporting and
analysis. Data warehouse experts consider that the various stores of data are connected and related to each
other conceptually as well as physically. A business's data is usually stored across a number of databases.
However, to be able to analyze the broadest range of data, each of these databases needs to be connected in
some way. This means that the data within them need a way of being related to other relevant data and that
the physical databases themselves have a connection so their data can be looked at together for reporting
purposes.
Q.6 What type of benefit you might hope to get from data mining?
Ans: Data mining can offer numerous benefits across various domains. Here are some potential benefits:
1. **Pattern Discovery:** Data mining helps identify patterns and trends within large datasets that might
not be immediately apparent. This can lead to insights that can inform decision-making processes.
2. **Prediction and Forecasting:** By analyzing historical data, data mining algorithms can make
predictions about future trends or events. This can be valuable for businesses in predicting customer
behavior, market trends, or resource demands.
3. **Customer Segmentation:** Understanding customer segments based on their behavior, preferences, and
demographics can help businesses tailor their marketing strategies, product offerings, and customer service to
specific groups, ultimately improving customer satisfaction and retention.
4. **Anomaly Detection:** Data mining techniques can detect unusual patterns or outliers in data, which
may indicate fraud, errors, or other anomalies. This can be particularly useful in industries such as finance,
cybersecurity, and healthcare.
5. **Optimization:** Data mining can help optimize processes and operations by identifying
inefficiencies, bottlenecks, or areas for improvement. This can lead to cost savings, increased productivity,
and better resource allocation.
6. **Personalization:** By analyzing large amounts of customer data, businesses can personalize their
products, services, and marketing efforts to better meet individual customer needs and preferences.
7. **Risk Management:** Data mining can assist in assessing and managing risks by analyzing historical data
to identify potential risk factors and predict future risks. This is particularly important in industries such as
insurance, finance, and healthcare.
8. **Research and Innovation:** Data mining can uncover new insights and knowledge in various fields,
leading to advancements in science, technology, healthcare, and other domains.
o Ans: Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes
to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
Ans: A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification
and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal
nodes and leaf nodes.
As you can see from the diagram below, a decision tree starts with a root node, which does not have any
incoming branches. The outgoing branches from the root node then feed into the internal nodes, also known
as decision nodes. Based on the available features, both node types conduct evaluations to form
homogenous subsets, which are denoted by leaf nodes, or terminal nodes. The leaf nodes represent all the
possible outcomes within the dataset.
As an example, let’s imagine that you were trying to assess whether or not you should go surf, you may use
the following decision rules to make a choice:
Q.9 Explain clustering and it’s all types with example.?
Ans: Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can
be defined as "A way of grouping the data points into different clusters, consisting of similar data points.
The objects with the possible similarities remain in a group that has less or no similarities with another
group."
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are grouping documents according to the
topic.
Types of Clustering Methods
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-
based method. The most common example of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture
Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement
of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters can
be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster.
Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm.
Ans: The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-and-conquer strategy. The core of this
method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the item
set association information.
o First, it compresses the input database creating an FP-tree instance to represent frequent items.
o After this first step, it divides the compressed database into a set of conditional databases, each
associated with one frequent pattern.
o Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short patterns and then
concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with this problem
is to partition the database into a set of smaller databases (called projected databases) and then construct an
FP-tree from each of these smaller databases.
Ans: Apriori algorithm refers to the algorithm which is used to calculate the association rules between objects.
It means how two or more objects are related to one another. In other words, we can say that the apriori
algorithm is an association rule leaning that analyzes that people who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the association rule between different objects. The
association rule describes how two or more objects are related to one another. Apriori algorithm is also called
frequent pattern mining. Generally, you operate the Apriori algorithm on a database that consists of a huge
number of transactions. Let's understand the apriori algorithm with the help of an example; suppose you go to
Big Bazar and buy different products. It helps the customers buy their products with ease and increases the sales
performance of the Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.
We have already discussed above; you need a huge database containing a large no of transactions. Suppose you
have 4000 customers transactions in a Big Bazar. You have to calculate the Support, Confidence, and Lift for two
products, and you may say Biscuits and Chocolate. This is because customers frequently buy these two items
together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600 transactions
include a 200 that includes Biscuits and chocolates. Using this data, we will find out the support, confidence,
and lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient of the division of
the number of transactions comprising that product by the total number of transactions. Hence, we get
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So, you
need to divide the number of transactions that comprise both biscuits and chocolates by the total number of
transactions to get the confidence.
Hence,
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you sell
biscuits. The mathematical equations of lift are given below.
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more than that
of purchasing the biscuits alone. If the lift value is below one, it requires that the people are unlikely to buy both
the items together. Larger the value, the better is the combination.
Q.12 Explain whether association rule mining is supervised or unsupervised type of learning
and differentiate supervised and unsupervised machine learning techniques ?
2. **Goal of finding structure:** Unsupervised learning aims to find hidden structure or patterns within the
data without guidance on what specific patterns to look for. In association rule mining, the goal is to
discover interesting associations or relationships between items without specifying them beforehand.
Supervised Learning:
- **Labeled Data:** Supervised learning algorithms require labeled training data, where each data point
is associated with a known outcome or label.
- **Prediction:** The goal of supervised learning is to learn a mapping from input features to output labels,
allowing the algorithm to make predictions on new, unseen data.
- **Examples:** Classification and regression are common tasks in supervised learning. For
example, predicting whether an email is spam (classification) or predicting house prices (regression).
- **Examples:** Clustering, dimensionality reduction, and association rule mining are examples of
unsupervised learning tasks. Clustering algorithms group similar data points together, dimensionality
reduction techniques simplify data while retaining important information, and association rule mining
discovers relationships between variables.
Ans: A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
1. Top-down approach:
1. External Sources –
External source is a source from where data is collected irrespective of the type of data. Data can be
structured, semi structured and unstructured as well.
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is a
need to validate this data to load into datawarehouse. For this purpose, it is recommended to
use ETL tool.
● E(Extracted): Data is extracted from External data source.
● L(Load): Data is loaded into datawarehouse after transforming it into the standard format.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually stores the meta
data and the actual data gets stored in the data marts. Note that datawarehouse stores the data in
its purest form in this top-down approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the information of a particular function of an
organisation which is handled by single authority. There can be as many number of data marts in an
organisation depending upon the functions. We can also say that data mart contains subset of the data
stored in datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is used to find the
hidden patterns that are present in the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organisation and data marts are created from it after the complete datawarehouse has been created.
Ans:
In an OLAP database, tables are not In an OLTP database, tables are normalized
Normalized
normalized. (3NF).
The data is used in planning, problem- The data is used to perform day-to-day
Usage of data
solving, and decision-making. fundamental operations.
It serves the purpose to extract It serves the purpose to Insert, Update, and
Purpose
information for analysis and decision- Delete information from the database.
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)
making.
A large amount of data is stored typically The size of the data is relatively small as the
Volume of data
in TB, PB historical data is archived in MB, and GB.
The OLAP database is not often updated. The data integrity constraint must be
Update
As a result, data integrity is unaffected. maintained in an OLTP database.
Backup and It only needs backup from time to time The backup and recovery process is
Recovery as compared to OLTP. maintained rigorously
This data is generally managed by CEO, This data is managed by clerksForex and
Types of users
MD, and GM. managers.
Operations Only read and rarely write operations. Both read and write operations.
Nature of
The process is focused on the customer. The process is focused on the market.
audience
Database
Design with a focus on the subject. Design that is focused on the application.
Design
Ans:
Parameter CLASSIFICATION CLUSTERING
process of classifying the input instances grouping the instances based on their
Basic
based on their corresponding class labels similarity without the help of class labels
SUBMITTED BY –
Ketan Gupta
SUBMITTED TO–
MR.Deepak
Sir DATE -