Data Warehouse
Data Warehouse
OLAP stands for Online Analytical Processing, and OLAP queries refer to the type of
various sources within an organization. It is designed to support business intelligence a. Aggregation: Hierarchies are computed using aggregation methods, where data at queries used to perform multidimensional analysis on data stored in a data
(BI) activities, data analysis, and reporting by providing a consolidated view of higher levels is summarized from data at lower levels. For example, in a time hierarchy, warehouse or a multidimensional database. OLAP enables users to gain insights and
historical and current data. The primary goal of a data warehouse is to provide quarterly data is computed by aggregating monthly data, and yearly data is computed explore data from different perspectives, allowing for more complex and interactive
decision-makers and analysts with easy access to consistent, reliable, and relevant by aggregating quarterly data. analysis compared to traditional relational database queries.
information for making informed business decisions. b. Roll-Up: Roll-up is the process of moving from a detailed level to a higher-level OLAP queries are typically used for analytical purposes, such as data exploration,
Having a separate data warehouse offers several advantages compared to relying aggregation. It involves combining data from child members to their parent members. trend analysis, aggregation, and drilling down into detailed data.
solely on operational databases or other data storage solutions. Here are some key For instance, rolling up monthly sales data to quarterly or yearly sales. "snowflakes" refer to a type of data modeling technique used to organize data
reasons for having a dedicated data warehouse: c. Drill-Down: Drill-down is the process of moving from a higher-level aggregation to a within a dimensional data warehouse. The snowflake schema is an extension of the
Performance and Scalability: Operational databases are designed for transactional more detailed level. It involves breaking down aggregated data into its component more commonly used "star schema" and is designed to optimize storage and
processing, which means they prioritize speedy data insertion, updates, and parts. For example, drilling down from yearly sales to monthly or daily sales. improve data normalization.The main characteristic of a snowflake schema is that it
deletions. Data Consolidation and Integration: Data warehouses consolidate data d. Consolidation: Consolidation is the process of aggregating data from multiple breaks down some of the dimensions into multiple related tables, resulting in a
from multiple sources across the organization, including different departments and hierarchies or dimensions to a higher level. It is used when data is distributed across more normalized structure.
systems. Historical Data Storage: Data warehouses retain historical data over time, different dimensions and needs to be combined. K-means is a popular and widely used clustering algorithm in machine learning and
providing an opportunity for trend analysis and long-term insights. Support for e. Slicing and Dicing: Slicing involves selecting a single dimension from a data mining. It is an unsupervised learning technique used to group similar data
Complex Analytics: Data warehouses are designed to support complex analytical multidimensional dataset to view a subset of data. Dicing involves selecting specific points together into clusters based on their features or attributes. The objective of
operations like OLAP (Online Analytical Processing), data mining, and advanced combinations of dimension values to focus on a specific subset of data. the K-means algorithm is to minimize the variance within each cluster and maximize
reporting. Separation of Workloads: By having a separate data warehouse, What do you mean by OLAP operations? the variance between different clusters.
organizations can segregate analytical workloads from operational workloads. Data OLAP (Online Analytical Processing) operations refer to a set of analytical operations Mining text databases, also known as text mining or text analytics, is the process of
Transformation and Cleansing: Data warehouses often involve an ETL (Extract, that are performed on multidimensional data in a data warehouse or OLAP cube. These extracting valuable insights, patterns, and knowledge from large collections of
Transform, Load) process, where data is extracted from various sources, transformed operations allow users to interactively analyze and explore data from various unstructured text data. Unstructured text data includes documents, emails, social
into a consistent format, and loaded into the warehouse. Decision Support and perspectives, facilitating data-driven decision-making and business intelligence. OLAP media posts, web pages, customer reviews, news articles, and other textual content.
Business Intelligence: Data warehouses serve as a foundation for business operations are designed to handle complex queries and aggregations efficiently, The main goal of mining text databases is to transform unstructured text into a
intelligence (BI) initiatives. providing users with the ability to slice, dice, pivot, drill-down, and roll-up data. structured format that can be analyzed using various natural language processing
What do you mean by enterprise warehouse? What do you mean by operations in the multidimensional data model (OLEP)? (NLP) and machine learning techniques.
An enterprise data warehouse (EDW) is a type of data warehouse that serves as the In the multidimensional data model (OLAP - Online Analytical Processing), operations MOLAP stands for "Multidimensional Online Analytical Processing." It is a type of
central repository for integrated data from various sources across an entire refer to a set of analytical functions that are performed on multidimensional data to data storage and analysis technology used in data warehousing and business
organization. It goes beyond departmental or individual data warehouses and aims to facilitate interactive and complex data analysis. The multidimensional data model intelligence (BI) systems. MOLAP provides a specialized and efficient way to store
provide a comprehensive view of an enterprise's data assets. The primary goal of an organizes data into a data cube, where each dimension represents an attribute or and process multidimensional data for fast and interactive analytical queries.
enterprise warehouse is to support strategic decision-making, business intelligence, category, and measures represent the quantitative data associated with these In MOLAP, data is organized and stored in a multidimensional cube format, where
and analytics for the entire organization. dimensions. The OLAP operations allow users to manipulate and explore the data in each axis of the cube represents a different dimension or attribute of the data.
What do you mean by data mart? the cube to gain insights, perform data analysis, and make data-driven decisions. Regression is a statistical technique used in machine learning and statistics to model
A data mart is a subset or specialized version of a data warehouse that focuses on the relationship between a dependent variable and one or more independent
providing data for a specific business unit, department, or functional area within an variables. The main objective of regression analysis is to find the best-fitting
organization. While a data warehouse serves as a centralized repository for mathematical model that describes how the independent variables influence the
integrated data from various sources across the entire enterprise, data marts are dependent variable.
designed to meet the specific analytical and reporting needs of individual teams or
user groups
What do you mean by virtual warehouse? Data warehouse usages: Data mining query languages are specialized languages or extensions of existing
A virtual data warehouse (VDW) is a concept in the realm of data management and Business Intelligence: Data warehouses are the foundation for business intelligence languages that allow users to interact with data mining systems and perform various
analytics that provides the functionalities of a traditional data warehouse without (BI) tools and applications. They provide a consolidated and reliable data source for data mining operations. These languages are designed to support the extraction of
physically storing all the data in a centralized repository. Instead, a virtual warehouse creating reports, dashboards, and visualizations, empowering business users with data- knowledge and patterns from large datasets using data mining techniques. Data
allows users to access and analyze data from various sources in real-time, without driven insights. mining query languages provide a higher level of abstraction than traditional
the need to replicate or move the data to a single location. Data Analysis: Data warehouses facilitate complex data analysis and data mining tasks. database query languages (e.g., SQL). They are specifically tailored for data mining
Different between operational database system and data warehouse? Users can perform multidimensional analysis, drill-down, slice, dice, and other OLAP tasks and typically offer functionalities for data preprocessing, model building,
Operational Database operations to gain deeper insights into business data. model evaluation, and result interpretation.
1. Operational systems are designed to support high-volume transaction processing. Decision Making: With a data warehouse, organizations can make informed and data- Text mining, also known as text analytics, is the process of extracting valuable
2. Operational systems are usually concerned with current data. driven decisions based on historical and real-time data. Decision-makers can identify information, patterns, and knowledge from unstructured text data. Unstructured
3. Data within operational systems are mainly updated regularly according to need. trends, patterns, and correlations to guide strategic planning. text data includes documents, emails, social media posts, web pages, customer
4. It is designed for real-time business dealing and processes. Trend and Pattern Identification: Data warehouses enable trend analysis and pattern reviews, news articles, and any other textual content that does not have a
5. It is optimized for a simple set of transactions, generally adding or retrieving a identification by storing historical data. This information is crucial for identifying long- predefined data model.Text mining involves using various natural language
single row at a time per table. term trends and making predictions. processing (NLP) and machine learning techniques to transform unstructured text
Data warehouse Data Consistency: Data warehouses provide a single source of truth, ensuring that all into a structured format that can be analyzed and interpreted by computers.
1. Data warehousing systems are typically designed to support high-volume users across the organization access consistent and standardized data. What do you mean by data mining task
analytical processing (i.e., OLAP). 2. Data warehousing systems are usually concerned Performance Monitoring: Data warehouses enable organizations to monitor and Data mining refers to the process of discovering patterns, relationships, or useful
with historical data. 3. Non-volatile, new data may be added regularly. Once Added analyze key performance indicators (KPIs) and track business performance over time. information from large sets of data. A data mining task refers to a specific objective
rarely changed. 4. It is designed for analysis of business measures by subject area, Regulatory Compliance: Data warehouses assist in compliance with data regulations or goal that involves applying various data mining techniques and algorithms to
categories, and attributes. 5. It is optimized for extent loads and high, complex, and auditing requirements by providing a centralized, controlled environment for data extract valuable insights or knowledge from the data.
unpredictable queries that access many rows per table. management. What do you mean by data mining trends
What do you mean by data cube? Online Analytical Processing (OLAP): Data mining trends refer to the emerging patterns, technologies, and methodologies
A data cube, also known as a multidimensional cube or OLAP (Online Analytical OLAP is a technology that enables users to interactively analyze and explore data from that gain traction within the data mining and machine learning community. These
Processing) cube, is a data structure used in data warehouses and business multiple dimensions and hierarchies. It is based on the multidimensional data model, trends reflect the current advancements and developments in the field and often
intelligence systems to facilitate multidimensional analysis of data. It represents data represented by data cubes, where data is organized into dimensions, measures, and shape the direction of research, industry applications, and best practices
in a way that allows users to perform complex analytical queries and gain insights hierarchies. OLAP operations such as slicing, dicing, drilling down, and rolling up allow What do you mean by data mining issues
across multiple dimensions simultaneously.The concept of a data cube is derived users to navigate and analyze data efficiently. OLAP is primarily used for Data mining issues refer to the challenges and obstacles that practitioners and
from the idea of "hypercube" in geometry, where each axis represents a different multidimensional reporting, data visualization, and ad-hoc analysis, providing insights researchers encounter when applying data mining techniques to extract valuable
dimension. into historical data and facilitating informed decision-making. insights from large datasets. These issues can arise due to the complexity of the
What do you mean by data warehouse design and usages? Data Mining:Data mining is a process that involves discovering patterns, relationships, data, the limitations of data mining algorithms, ethical concerns, and other practical
Data warehouse design refers to the process of creating and structuring a data and insights from large datasets using various statistical, machine learning, and considerations.
warehouse to meet the specific analytical and reporting needs of an organization. It artificial intelligence techniques. Unlike OLAP, data mining is not focused on interactive
involves designing the architecture, data models, ETL (Extract, Transform, Load) exploration but on uncovering hidden patterns and knowledge that might not be
processes, and other components required to build a comprehensive and efficient readily apparent in the data. Data mining goes beyond simple querying and allows for
data warehousing solution. more sophisticated analyses, including classification, clustering, regression
Conceptual modelling of data warehouse? Data warehouse implementation What do you mean by the Apriori algorithm
Conceptual modeling of a data warehouse involves creating a high-level, abstract Data warehouse implementation is the process of designing, building, and populating a The Apriori algorithm is a classic and fundamental algorithm used in association rule
representation of the data warehouse's structure, content, and relationships data warehouse to support business intelligence (BI) and analytical reporting needs. A mining, a data mining task that aims to discover interesting relationships (or
between different components. It focuses on understanding and defining the data warehouse is a central repository that integrates data from various sources, associations) between items in a transactional database. These associations are
business requirements and objectives of the data warehouse before delving into transforming it into a structured and easily accessible format for analysis and decision- commonly expressed in the form of "if-then" rules, such as "If a customer buys item
technical details. The main goal of conceptual modeling is to create a clear and making. A, then they are likely to buy item B." The Apriori algorithm is designed to efficiently
comprehensive blueprint of the data warehouse that aligns with the organization's The implementation process typically involves the following steps: find frequent itemsets in the dataset, where an itemset is a collection of one or
needs. Requirements Gathering, Data Source Identification, Data Extraction, more items that often co-occur together. Frequent itemsets are crucial for
Concept hierarchies Concept hierarchies are a fundamental component of data Transformation, and Loading, Data Modeling, Data Storage, Data Quality generating meaningful association rules.
warehousing and multidimensional data modeling. They define the relationships and Assurance, Metadata Management, Security and Access Control, Indexing and What do you mean by fp growth algorithm
levels of granularity within dimensions, allowing data to be organized and analyzed Performance Optimization, Testing and Validation, User Training and Adoption, The FP-Growth (Frequent Pattern Growth) algorithm is another popular and efficient
at various levels of detail. Concept hierarchies provide a way to drill down or roll up Deployment and Maintenance. data mining algorithm used for frequent itemset mining and association rule
through data to gain insights and perform analytical operations effectively. They are What is data mining ?process of knowledge discovery? generation, just like the Apriori algorithm. However, the FP-Growth algorithm uses a
commonly used in OLAP (Online Analytical Processing) cubes and other Data mining is the process of extracting valuable and meaningful patterns, insights, different approach that makes it more scalable and faster, especially for large
multidimensional databases. and knowledge from large volumes of data. It involves using various techniques and datasets.The FP-Growth algorithm uses a data structure called the FP-Tree to store
Categorization of Hierarchies: algorithms to analyze and discover hidden relationships, trends, and patterns within the frequent itemsets compactly. The key idea behind FP-Growth is to compress the
a. Balanced vs. Unbalanced Hierarchies: In a balanced hierarchy, each level has an the data, which can then be used for decision-making and gaining a competitive original dataset into an FP-Tree and then recursively mine frequent itemsets from
equal number of child members. For example, a balanced time hierarchy might have advantage. this tree without the need for generating candidate itemsets as done in the Apriori
four quarters for each year. In contrast, an unbalanced hierarchy has a varying The process of knowledge discovery: algorithm.
number of child members at each level. For instance, a product hierarchy may have Data Selection: In this initial stage, relevant data is selected from various data sources Application of association rule learning
different numbers of subcategories for different product categories. b. Single-Level such as databases, data warehouses, spreadsheets, or even raw data files. The association rule learning is the important technique of machine learning, and it
vs. Multi-Level Hierarchies: A single-level hierarchy has only one level of granularity. Data Preprocessing: Raw data often contains noise, inconsistencies, missing values, is employed in Market Basket analysis, Web usage mining, Manufacturing and
For instance, a simple list of countries would be a single-level hierarchy. In contrast, a and irrelevant information. Data Exploration: Data exploration aims to gain a Supply Chain Optimization: Market Basket Analysis: One of the most well-known
multi-level hierarchy has multiple levels of granularity, as seen in time hierarchies preliminary understanding of the data, its distributions, and the relationships between applications of association rule learning is in market basket analysis. Retailers and
with years, quarters, months, and days. variables. Feature Selection: In some cases, the dataset may contain a large number supermarkets use association rules to identify products that are often purchased
c. Ragged vs. Regular Hierarchies: In a regular hierarchy, all branches have the same of features (attributes or variables). Data Transformation: Data transformation together. Web Usage Mining: Web usage data can be analyzed using association
depth, meaning all leaves are at the same level. A ragged hierarchy, however, has involves converting the data into a suitable format for analysis. rule learning to understand the browsing behavior of website visitors. This
different depths for different branches, resulting in some leaves being at different Model Building: In this stage, various data mining algorithms are applied to the information aids in website optimization, content recommendation, and
levels. preprocessed and transformed data to build models that capture patterns and personalization. Manufacturing and Supply Chain Optimization: Association rules
d. Natural vs. Unnatural Hierarchies: Natural hierarchies are based on intrinsic relationships within the data. Model Evaluation: The models built in the previous can be applied to analyze manufacturing processes and supply chain data to identify
relationships in the data, such as time hierarchies or geographical hierarchies. step are evaluated to assess their accuracy and performance. Knowledge relationships between production variables, optimize processes, and reduce costs.
Unnatural hierarchies are derived based on specific requirements or business rules. Representation: Once a satisfactory model is obtained, the knowledge discovered
during the data mining process is represented in a human-readable form.
What do you mean by K-means clustering issues regarding classification in supervised learning type of multiclass classification
K-means clustering is a popular unsupervised machine learning algorithm used for Imbalanced Data: Imbalanced class distributions occur when one class Single Label Multiclass Classification: The task is to predict the correct class label
partitioning data into K clusters based on their similarities. The goal of K-means is to significantly outnumbers the other(s) in the training data. among several possible classes for each data point. Multilabel Classification:
group data points with similar features into clusters, where each cluster is Overfitting: Overfitting occurs when a model learns to perform well on Multilabel classification occurs when each data instance can belong to multiple
represented by its centroid (the mean of the data points in the cluster). It is an the training data but fails to generalize to new, unseen data. classes simultaneously. Ordinal Classification: Ordinal classification deals with
iterative algorithm that aims to minimize the variance within each cluster and Underfitting: Underfitting happens when a model is too simplistic to ordered or ranked classes. The classes have a specific order, and the task is to
maximize the separation between clusters. capture the underlying patterns in the data. Feature Selection and predict the correct rank or order for each data instance. Hierarchical
What do you mean by K-medoids clustering Engineering: Selecting the right set of features and engineering new Classification: The classification process proceeds in multiple levels, where classes
K-medoids clustering is a variant of the K-means clustering algorithm that addresses informative features is crucial for effective classification. Curse of at higher levels represent broader categories, and classes at lower levels represent
some of the limitations of K-means, especially when dealing with non-Euclidean Dimensionality: High-dimensional feature spaces can lead to the curse more specific subcategories. One-vs-All (OvA) or One-vs-Rest (OvR) Classification:
distance metrics and noisy data. In K-medoids clustering, instead of using the mean of dimensionality, where the dataset becomes sparse and increases In OvA classification, each class is treated as a positive class, and all other classes
(average) of data points as the cluster centroid, it uses an actual data point from the the risk of overfitting. Noise and Outliers: Noisy data and outliers can are combined into a single negative class. One-vs-One (OvO) Classification:
dataset as the medoid, which is the most centrally located point within a cluster. negatively affect the performance of classifiers. Computational During prediction, each binary classifier votes for a class, and the class with the
Outlier detection and analysis Complexity: Some classification algorithms can be computationally most votes is chosen as the final prediction. Error-Correcting Output Codes
Outlier detection, also known as anomaly detection, is the process of identifying and expensive, especially for large datasets or complex models. (ECOC): ECOC is a coding scheme that represents each multiclass classification
analyzing data points that deviate significantly from the majority of the data. These Interpretablity vs. Performance: Complex models like deep neural problem as a binary classification problem.
unusual data points are called outliers and can arise due to various reasons such as networks may offer high performance but can be challenging to Bayesian Classification also know as naïve bayes
measurement errors, data corruption, or rare events. Outlier analysis is essential in interpret. Data Bias and Fairness: Biases present in the training data Bayesian Classification, also known as Naïve Bayes, is a simple probabilistic
various fields to understand data quality, detect anomalies, and make informed can lead to biased predictions and unfair decision-making. Concept classifier based on Bayes' theorem. It is widely used for classification tasks in
decisions. Drift: In dynamic environments, the underlying data distribution can machine learning and natural language processing due to its efficiency and
Classification of supervised learining change over time, causing concept drift. effectiveness, especially with high-dimensional feature spaces.The classifier is
1.Email spam classification: Predicting whether an email is spam or not. types of binary classification called "naïve" because it makes a strong assumption that the features are
2.Image classification: Identifying objects or objects in images (e.g., cat vs. dog). Balanced Binary Classification: In balanced binary classification, both conditionally independent given the class label. This assumption simplifies the
3.Sentiment analysis: Determining the sentiment (positive, negative, neutral) in a classes have a relatively equal number of instances in the training calculations and allows the classifier to handle a large number of features
piece of text. dataset. Imbalanced Binary Classification: Imbalanced binary efficiently. Though this assumption is rarely satisfied in real-world datasets, Naïve
4.Medical diagnosis: Classifying patients into different disease categories based on classification occurs when one class significantly outnumbers the other Bayes still often performs surprisingly well in practice.
symptoms and test results. in the training dataset. Positive vs. Negative Classification: In positive Rule based classifier
Classification algorithms include decision trees, random forests, support vector vs. negative classification, one class is considered the positive class, A rule-based classifier is a type of classification model that makes predictions
machines, logistic regression, and neural networks. while the other is the negative class. Anomaly Detection: Anomaly based on a set of predefined rules. These rules are derived from the patterns and
Association based classification detection is a type of binary classification where the focus is on relationships found in the training data, and they represent decision boundaries
Association-based classification, also known as associative classification or identifying rare and abnormal instances (anomalies) that deviate that separate different classes. Each rule typically consists of a combination of
classification based on association rules, is a combination of association rule mining significantly from the normal or majority class. One-Class conditions on the input features (attributes) and an associated class label.
and supervised classification techniques. It leverages the knowledge of frequent Classification: One-class classification is a special type of binary Hierarchical clustering is a popular technique used in unsupervised machine
itemsets and association rules to build a classification model for predicting the class classification where the goal is to detect instances belonging to a learning to group data points into a hierarchy of clusters. It is a bottom-up or
labels of new data instances. specific class (the target class) in the absence of examples from the agglomerative approach, meaning that it starts with each data point as its own
other class. Sensitivity vs. Specificity Trade-off: cluster and then iteratively merges clusters based on their similarity, forming a
In some binary classification problems, there is a trade-off between tree-like structure known as a dendrogram.
sensitivity (true positive rate) and specificity (true negative rate).
Graph-based clustering Different between data mart and data cube Advantage of k-means algorithm
Graph-based clustering is a type of unsupervised learning technique that uses the Data mart 1. Relatively efficient and easy to implement Terminates at local optimum.
concept of graph theory to partition data points into clusters based on their similarity 1. It contains a focused collection of data that is relevant to the 2. Apply even large data sets
or proximity. In this approach, data points are represented as nodes in a graph, and particular requirements of the intended users. 3. The clusters are non-hierarchical and they do not overlap
edges are used to represent the relationships or connections between the nodes. The 2. Each data mart caters to the data needs of the users in that Disadvantage of k-means algorithm
goal is to identify densely connected subgraphs, which correspond to clusters in the particular area. 1. Sensitive to Initialization
data. 3. They are designed to support specific reporting and analytical needs 2. The selection of the initial centroids is random
Different between KDD and data mining of the business unit they serve. 3. Limiting case of fixed data
KDD 4. It focuses on providing ready-to-use data for analytical tasks. Advantage of K-medoids algorithm
1.KDD is a multi-step process that encourages the conversion of data to useful Data cube 1. K-medoids is less sensitive to outliers compared to K-means.
information. 1. It allows users to perform complex analysis and aggregate data 2. The cluster centers in K-medoids are real data points from the input dataset.
2. It aims to discover useful and valuable knowledge from large datasets. along multiple dimensions to gain insights and answer analytical 3. It can be run for different K values, and the optimal K can be chosen based on
3. The KDD process involves multiple components, including data cleaning, data queries efficiently. various evaluation metrics.
integration, data transformation. 2. They provide a comprehensive view of the data and enable users to Disadvantage of K-medoids algorithm
4. KDD encompasses a broader range of methods and techniques beyond data analyze it from different perspectives. 1. K-medoids can be computationally more expensive than K-means
mining, such as data warehousing. 3. Data cubes, on the other hand, use a multidimensional model to 2. The performance of K-medoids can be sensitive to the initial selection of
Data mining organize data in a more efficient way for OLAP operations medoids.
1.Data mining is one among the steps of Knowledge Discovery in Databases(KDD). 4. They provide a more interactive and exploratory approach to data 3. As a result, the algorithm might not find the globally optimal clustering solution.
2. It focuses on using algorithms and techniques to extract patterns, relationships, analysis. Describe the different components of data warehouse
or knowledge from the data, such as association rules, clusters, classifications, or How different schema are used to model data warehouse? In data There are mainly 5 components of Data Warehouse Architecture: 1) Database :
regression models. warehousing, different schema types are used to model and organize is an organized collection of structured information, or data, typically stored
3. Data mining is a subset of the KDD process and specifically refers to the the data to meet specific analytical requirements and optimize query electronically in a computer system. 2) ETL Tools: enable data integration
application of data mining algorithms to extract patterns or models from the data. performance. Star Schema: Star schema is one of the most commonly strategies by allowing companies to gather data from multiple data sources and
4. Data mining focuses solely on the application of algorithms for pattern discovery, used schema types in data warehousing. It features a central fact table consolidate it into a single, centralized location. 3) Meta Data: the data providing
including clustering connected to multiple dimension tables, forming a star-like structure. information about one or more aspects of the data; it is used to summarize basic
What is the purpose of cluster analysis in data mining Snowflake Schema: The snowflake schema is an extension of the star information about data that can make tracking and working with specific data
The purpose of cluster analysis in data mining is to identify groups, or clusters, of schema, where dimension tables are normalized, leading to reduced easier 4) Query Tools: is a powerful, feature-rich environment that allows you to
similar objects within a dataset. Cluster analysis is an unsupervised learning data redundancy. Galaxy Schema: The galaxy schema, also known as execute arbitrary SQL commands and review the result set. You can access the
technique that helps uncover patterns, similarities, and structures in the data the constellation schema, is a combination of multiple star schemas Query Tool via the Query Tool menu option on the Tools menu, or through the
without any predefined class labels or target variables. It aims to discover inherent that share dimension tables. Fact Constellation Schema: The fact context menu of select nodes of the Object explorer control. 5) Data Marts: is a
relationships or groupings among data points based on their similarities or constellation schema is used when multiple fact tables share the same simple form of a data warehouse that is focused on a single subject or line of
dissimilarities. dimension tables, but the dimension tables themselves are not directly business, such as sales, finance, or marketing. Given their focus, data marts draw
related. Hybrid Schema: The hybrid schema is a combination of two or data from fewer sources than data warehouses.
more schema types, created to address specific analytical
requirements.
describe the types of data used in data mining How classification plays significance role in data mining Write down the two measure of association rule
Structured Data: Structured data refers to data that is organized and stored in a fixed Classification plays a significant role in data mining as it is one of the In association rule mining, there are two essential measures used to evaluate the
format, typically in tabular form, where each attribute or field has a predefined data fundamental techniques used to organize and categorize data into significance and interestingness of association rules
type. predefined classes or categories based on patterns and characteristics. 1.Support: Support measures the frequency or prevalence of an itemset in the
Unstructured Data: Unstructured data lacks a predefined data model or structure. It It enables predictive modeling and decision-making by assigning class dataset, indicating how often the itemset appears in the transactions. It is
does not fit into traditional rows and columns of a relational database. labels to new, unseen data based on its similarity to previously labeled calculated as the ratio of the number of transactions containing the itemset to the
Semi-Structured Data: Semi-structured data lies between structured and data. Here are some key ways classification is significant in data total number of transactions in the database. A high support value suggests that
unstructured data. It has some organizational structure but is not fully defined in a mining: predictive modelling, pattern recognition, customer the itemset is widely occurring in the dataset.
tabular format like structured data. segmentation, fraud detection, medical diagnosis, sentiment analysis, 2.Confidence: Confidence measures the strength of the association between the
Time-Series Data: Time-series data consists of data points collected over successive image and speech recognition, recommendation system, risk antecedent (X) and consequent (Y) of an association rule. It represents the
points in time, with each observation associated with a specific timestamp. assessment, quality control likelihood of finding the consequent in a transaction given that the antecedent is
Spatial Data: Spatial data includes geographical or location-based information. This How multidimensional data model helps in retrieving information present. A high confidence value indicates a strong correlation between the items
type of data is relevant in fields like geographic information system. A multidimensional data model plays a crucial role in retrieving in the antecedent and the consequent.
Transactional Data: Transactional data records events or transactions, such as sales information efficiently and effectively. It is specifically designed to What are the three type regression
transactions, website interactions, customer purchases, and more. organize and represent data in a format optimized for analytical Linear Regression: Linear regression is one of the most commonly used regression
Graph Data: Graph data represents entities and their relationships as nodes and processing and querying. The multidimensional data model is techniques. It models the relationship between the dependent variable and one or
edges in a graph structure. commonly used in data warehousing and business intelligence more independent variables as a linear equation.
Sensor Data: It is widely used in Internet of Things (IoT) applications and monitoring systems. Here's how it helps in retrieving information: Logistic Regression: Logistic regression is used when the dependent variable is
systems, where data mining helps identify anomalies, patterns, and trends in sensor Simplified Data Representation, Fast Query Performance, Support binary or categorical. It models the probability of the occurrence of a certain event
readings. for OLAP Operations, Flexible Data Slicing and Dicing, Hierarchical or category, given the values of the independent variables.
Different between data warehouse and operational database Data Navigation, Easy Report Generation, Enhanced Data Analysis, Polynomial Regression: Polynomial regression is an extension of linear regression
Data warehouse Support for Complex Business Metrics that allows for fitting data with a curvilinear relationship between the dependent
1. A central location which stores consolidated data from multiple databases Explain the use of frequent item set generation process and independent variables. Instead of using a straight line, polynomial regression
2. Contains summarized data The frequent itemset generation process is a fundamental step in fits a higher-degree polynomial equation to the data.
3. Uses Online Analytical Processing (OLAP) association rule mining, which is a data mining technique used to List down the functionality meta data
4. Helps to analyze the business discover interesting relationships and patterns within large 1.The location and descriptions of warehouse systems and components.
5. Tables and joins are simple because they are denormalized transactional databases. Frequent itemset generation aims to identify 2. Names, definitions, structures, and content of data-warehouse and end-users
Operational database sets of items that frequently co-occur together in transactions. These views.
1. An organized collection of related data which stores data in a tabular format sets are known as frequent itemsets, and they are crucial for 3.Identification of authoritative data sources.
2. Contains detailed data subsequent steps in association rule mining.The process involves 4. Integration and transformation rules used to populate data.
3. Uses Online Transactional Processing (OLTP) finding all the itemsets that meet a predefined minimum support
4. Helps to perform fundamental operations of a business threshold. The support of an itemset is defined as the proportion of
5. Tables and joins are complex because they are normalized transactions in the database that contain that itemset. The minimum
support threshold is set by the data analyst or data miner and typically
represents the minimum frequency required for an itemset to be
considered "frequent." Higher support thresholds lead to more
selective and relevant frequent itemsets, while lower thresholds yield
a larger number of frequent itemsets, some of which may be less
significant.