0% found this document useful (0 votes)

16 views6 pages

Viva Preparation Notes

Uploaded by

Mansi Khanvilkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views6 pages

Viva Preparation Notes

Uploaded by

Mansi Khanvilkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Viva Preparation Notes

1. Data Warehouse and OLAP (9 hours)

Data Warehousing:

A data warehouse is a centralized repository for storing large amounts of structured and historical

data. It supports decision-making by providing insights into trends and patterns.

Example: A retail company may store years of sales data to analyze long-term trends.

Dimensional Modeling:

Organizes data into facts (numerical data such as sales, revenue) and dimensions (contextual

information like time, geography, product).

Example: In a sales scenario, the fact table could store sales amounts, while dimension tables

would store information about time, product, and customer.

OLAP (Online Analytical Processing):

Allows multidimensional data analysis. It enables operations such as:

- Roll-up: Aggregating data (e.g., monthly sales to yearly sales).

- Drill-down: Breaking data down (e.g., yearly sales to monthly sales).

- Slicing and Dicing: Filtering specific aspects of data (e.g., sales in 2022 for region X).

Example: An OLAP cube for a company could have dimensions like Time, Region, and Product to

analyze sales figures from multiple angles.

OLTP vs. OLAP:

OLTP (Online Transaction Processing) handles day-to-day operations like transactions (e.g., bank

withdrawals), while OLAP deals with complex queries for data analysis.
Data Warehouse Schemas:

- Star Schema: A simple schema where a central fact table is connected to dimension tables.

Example: A sales database with a central sales fact table and dimension tables for Time, Product,

and Customer.

- Snowflake Schema: An extension of the star schema where dimension tables are normalized into

multiple related tables.

Example: A more detailed product dimension table may be split into Product and Supplier.

- Fact Constellation Schema: Multiple fact tables share dimension tables, suitable for more complex

data models.

Example: One fact table for Sales and another for Returns, both sharing the same Product and

Customer dimensions.

ETL Process (Extract, Transform, Load):

Extract: Pulling data from different sources.

Transform: Cleaning, filtering, and aggregating data.

Load: Loading data into the warehouse.

Example: Extracting sales data from different branches, transforming it by cleaning duplicates, and

loading it into a central warehouse.

2. Introduction to Data Mining, Data Exploration, and Data Preprocessing (8 hours)

Data Mining:

The process of discovering patterns, trends, and useful information from large datasets. It involves

techniques like clustering, classification, and association.

Example: Mining customer transaction data to discover which products are frequently bought

together (Market Basket Analysis).

Data Preprocessing:

- Data Cleaning: Handling missing values, removing duplicates, and correcting errors.

Example: Filling missing customer age data with the average age in the dataset.

- Data Integration: Combining data from different sources into a unified view.

- Data Reduction: Reducing data volume through techniques like attribute subset selection

(choosing relevant attributes) and sampling.

- Data Transformation: Converting data into a suitable format for mining (e.g., normalizing data).

Example: Normalizing sales data so that all values fall between 0 and 1 for uniformity.

Data Visualization:

Used to explore data distributions and patterns using graphs like histograms, bar charts, scatter

plots, etc.

Example: A scatter plot showing the relationship between customer age and spending.

3. Classification (6 hours)

Basic Concepts:

Classification is a predictive modeling task where a model assigns a category label (class) to an

observation based on its attributes.

Example: Classifying whether an email is "spam" or "not spam" based on its content.

Decision Tree:

A popular classification algorithm that uses a tree-like model of decisions. Each internal node

represents a test on an attribute, each branch represents the outcome of a test, and each leaf node

represents a class label.

Example: A decision tree to classify if a customer will buy a product based on age, income, and

previous purchases.
Bayesian Classification:

Uses Bayes' theorem to calculate the probability that an observation belongs to a certain class.

Example: A Naive Bayes classifier could predict if a person will enjoy a movie based on previous

preferences.

Regression Analysis:

Predicts continuous outcomes rather than categorical. Types include:

- Simple Linear Regression: Predicts an outcome based on one predictor variable.

- Multiple Linear Regression: Predicts an outcome based on multiple predictor variables.

Example: Predicting house prices based on factors like area, number of rooms, and locality.

Evaluation Metrics:

- Accuracy: Percentage of correctly classified instances.

- Precision: True positives / (True positives + False positives).

- Recall: True positives / (True positives + False negatives).

- F1-Score: Harmonic mean of precision and recall, used when classes are imbalanced.

4. Clustering (4 hours)

Clustering:

Unsupervised learning technique to group similar data points into clusters based on features.

Example: Grouping customers into clusters based on their buying behavior.

K-Means Clustering:

A partitioning method where the data is divided into K clusters. The algorithm aims to minimize the

sum of distances between points and their respective cluster centroids.

Example: Segmenting a customer base into 5 clusters based on their purchasing frequency and

total spend.

Hierarchical Clustering:

Builds a tree of clusters (dendrogram) either by agglomerating small clusters into larger ones

(agglomerative) or dividing large clusters into smaller ones (divisive).

Density-Based Clustering (DBSCAN):

Forms clusters based on density. It works well for datasets with varying shapes and can detect

outliers.

Example: Identifying regions in geographic data where accidents frequently occur.

5. Frequent Pattern Mining (8 hours)

Frequent Pattern Mining:

The task of discovering itemsets that frequently occur together in a dataset.

Example: Finding that customers who buy bread also frequently buy butter (Market Basket

Analysis).

Apriori Algorithm:

A popular algorithm for discovering frequent itemsets. It uses a bottom-up approach where frequent

itemsets are extended one item at a time (candidate generation) and then pruned.

Example: Identifying frequent itemsets like {milk, bread, butter} in supermarket data.

Association Rules:

Rules that express relationships between itemsets, like "If a customer buys A, they are likely to buy

B."
Example: "If a customer buys a laptop, they are 70% likely to buy a mouse."

Lift:

A measure of how much more likely two items are to be bought together compared to if they were

independent.

6. Web Mining (4 hours)

Web Mining:

The process of extracting useful information from the web, including web content, structure, and

usage.

Web Content Mining:

Focuses on extracting useful information from the content of web pages like text, images, and

videos.

Example: Analyzing the text of reviews on an e-commerce site to extract customer sentiment.

Web Structure Mining:

Analyzes the structure of hyperlinks between web pages to discover the relationship between them.

Example: Using PageRank to rank web pages based on the number and quality of links pointing to

them.

Web Usage Mining:

Involves analyzing web server logs to understand user behavior, such as which pages they visit or

how long they stay on a website.

Example: Discovering that users who visit the homepage of an e-commerce website are more likely

to navigate to the sale section.

Big Data Analytics in Operation Management
100% (1)
Big Data Analytics in Operation Management
35 pages
Introductory Econometrics Test Bank 5th Edi
100% (1)
Introductory Econometrics Test Bank 5th Edi
140 pages
SPSS 23 Step by Step Answers To Selected Exercises
No ratings yet
SPSS 23 Step by Step Answers To Selected Exercises
75 pages
Data Mining
No ratings yet
Data Mining
48 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
Data Mining
No ratings yet
Data Mining
3 pages
DWM Notes
No ratings yet
DWM Notes
19 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
Data Mining
No ratings yet
Data Mining
4 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
Gujarat Technological University: Subject Name: Elective I - Data Warehousing & Data Mining (DWDM) Subject Code: 640005
No ratings yet
Gujarat Technological University: Subject Name: Elective I - Data Warehousing & Data Mining (DWDM) Subject Code: 640005
5 pages
Resume 1
100% (1)
Resume 1
106 pages
DM & W SQ
No ratings yet
DM & W SQ
15 pages
Unit Iii
No ratings yet
Unit Iii
10 pages
Introduction To Data Mining and Data Warehousing
No ratings yet
Introduction To Data Mining and Data Warehousing
2 pages
MCA 301 Data Mining Notes
No ratings yet
MCA 301 Data Mining Notes
6 pages
DWM Q Bank
No ratings yet
DWM Q Bank
16 pages
Ba Unit 2 Imp
No ratings yet
Ba Unit 2 Imp
9 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
Data Warehousing and Mining Simplified Guide
No ratings yet
Data Warehousing and Mining Simplified Guide
4 pages
ISS - Module 3
No ratings yet
ISS - Module 3
11 pages
7dm Midterm Reviewer
No ratings yet
7dm Midterm Reviewer
10 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
Data Mining Notes
No ratings yet
Data Mining Notes
297 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
17 pages
DW&DM Syllabus
No ratings yet
DW&DM Syllabus
2 pages
DWDM
No ratings yet
DWDM
2 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Final Term Paper
No ratings yet
Final Term Paper
24 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Mining Frequent Patterns and Data Mining Topics Cleaned
No ratings yet
Mining Frequent Patterns and Data Mining Topics Cleaned
3 pages
Rapport Bi
No ratings yet
Rapport Bi
94 pages
Data Mining Revision Sheet
No ratings yet
Data Mining Revision Sheet
5 pages
Data Mining Slbs
No ratings yet
Data Mining Slbs
1 page
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
CE0716-Data Warehouse and Mining - Compulsory
No ratings yet
CE0716-Data Warehouse and Mining - Compulsory
5 pages
DWDM
No ratings yet
DWDM
11 pages
A Data Warehouse Is A Centralized Repository For Enterprise Data
No ratings yet
A Data Warehouse Is A Centralized Repository For Enterprise Data
5 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
8 Data Mining Algorithms
No ratings yet
8 Data Mining Algorithms
8 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Unit-5 DM
No ratings yet
Unit-5 DM
18 pages
Ba Important
No ratings yet
Ba Important
13 pages
Data Warehouse and Data Mining Syllabus
No ratings yet
Data Warehouse and Data Mining Syllabus
5 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
9 pages
Data Mining Theory Syllabus
No ratings yet
Data Mining Theory Syllabus
2 pages
DWDM Syllabus
No ratings yet
DWDM Syllabus
2 pages
A4629ac494 Syllabus
No ratings yet
A4629ac494 Syllabus
3 pages
Data Mining 1 2 and 3
No ratings yet
Data Mining 1 2 and 3
20 pages
DM Module1 Notes
No ratings yet
DM Module1 Notes
25 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
The Official Supply Chain Dictionary: 8000 Researched Definitions for Industry Best-Practice Globally
From Everand
The Official Supply Chain Dictionary: 8000 Researched Definitions for Industry Best-Practice Globally
SCHUB International
4/5 (4)
Mastering New Customer Acquisition
From Everand
Mastering New Customer Acquisition
Clifford Woods
No ratings yet
Exp1 NLP
No ratings yet
Exp1 NLP
2 pages
EXPERIMENT
No ratings yet
EXPERIMENT
16 pages
WC Viva
No ratings yet
WC Viva
7 pages
ML Viva
No ratings yet
ML Viva
4 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
4 pages
AI VivaQuestions
No ratings yet
AI VivaQuestions
2 pages
PL 300t00 Power Bi Data Analyst Training
No ratings yet
PL 300t00 Power Bi Data Analyst Training
5 pages
Research Content Parental Support in The Inclusion of Learners With Visual Impairment
No ratings yet
Research Content Parental Support in The Inclusion of Learners With Visual Impairment
42 pages
Sample Project Report
100% (1)
Sample Project Report
26 pages
1ST ICONICS Book of Abstract
No ratings yet
1ST ICONICS Book of Abstract
75 pages
Beniga Ma 102 Pre-Test Exam
67% (3)
Beniga Ma 102 Pre-Test Exam
6 pages
Data Driven Decision Regression
100% (1)
Data Driven Decision Regression
3 pages
UltimateGuidetoDataScienceInterviews 2
100% (4)
UltimateGuidetoDataScienceInterviews 2
87 pages
Basics of Structural Equation Modeling
100% (2)
Basics of Structural Equation Modeling
328 pages
Statistical Data Analysis Procedure
No ratings yet
Statistical Data Analysis Procedure
2 pages
Thesis Activity Updated
No ratings yet
Thesis Activity Updated
1 page
Association Rule in Data Mining
No ratings yet
Association Rule in Data Mining
4 pages
ARI 2101 Introduction To Statistics and Data Analysis
No ratings yet
ARI 2101 Introduction To Statistics and Data Analysis
5 pages
L6 - Biostatistics - Linear Regression and Correlation
No ratings yet
L6 - Biostatistics - Linear Regression and Correlation
121 pages
Dinda Dewi Aisyah - Landscape Integrated Pest Management As A Tool To Determine The Risk of Production of Rice Farming in Pliken Village Banyumas Regency
No ratings yet
Dinda Dewi Aisyah - Landscape Integrated Pest Management As A Tool To Determine The Risk of Production of Rice Farming in Pliken Village Banyumas Regency
13 pages
Big Data Smart Cities
0% (1)
Big Data Smart Cities
52 pages
Vertical Ii - Data Science
No ratings yet
Vertical Ii - Data Science
19 pages
Project Impact of Car Features
No ratings yet
Project Impact of Car Features
9 pages
DOE For Method Development and Validation 2122014
No ratings yet
DOE For Method Development and Validation 2122014
9 pages
Capitalism, Socialism, and The Physical Quality Life: Original Article Social Systems and Health
No ratings yet
Capitalism, Socialism, and The Physical Quality Life: Original Article Social Systems and Health
16 pages
Weather Patterns Analysis and Prediction
No ratings yet
Weather Patterns Analysis and Prediction
17 pages
SW Research PRACTICE TEST
100% (1)
SW Research PRACTICE TEST
14 pages
Final Examination: Sample
No ratings yet
Final Examination: Sample
7 pages
Unit Iv BRM
No ratings yet
Unit Iv BRM
15 pages
Module 5
No ratings yet
Module 5
53 pages
Computers & Education: Reynol Junco
No ratings yet
Computers & Education: Reynol Junco
10 pages
Anova (Assignment) - Khabab
No ratings yet
Anova (Assignment) - Khabab
16 pages
Notes For Quantitative and Qualitative Research
No ratings yet
Notes For Quantitative and Qualitative Research
4 pages

Viva Preparation Notes

Uploaded by

Viva Preparation Notes

Uploaded by

Viva Preparation Notes

1. Data Warehouse and OLAP (9 hours)

data. It supports decision-making by providing insights into trends and patterns.

information like time, geography, product).

would store information about time, product, and customer.

OLAP (Online Analytical Processing):

Allows multidimensional data analysis. It enables operations such as:

- Roll-up: Aggregating data (e.g., monthly sales to yearly sales).

- Drill-down: Breaking data down (e.g., yearly sales to monthly sales).

analyze sales figures from multiple angles.

OLTP vs. OLAP:

multiple related tables.

ETL Process (Extract, Transform, Load):

Extract: Pulling data from different sources.

Transform: Cleaning, filtering, and aggregating data.

Load: Loading data into the warehouse.

loading it into a central warehouse.

2. Introduction to Data Mining, Data Exploration, and Data Preprocessing (8 hours)

techniques like clustering, classification, and association.

together (Market Basket Analysis).

(choosing relevant attributes) and sampling.

observation based on its attributes.

represents a class label.

Predicts continuous outcomes rather than categorical. Types include:

- Simple Linear Regression: Predicts an outcome based on one predictor variable.

- Multiple Linear Regression: Predicts an outcome based on multiple predictor variables.

- Accuracy: Percentage of correctly classified instances.

- Precision: True positives / (True positives + False positives).

- Recall: True positives / (True positives + False negatives).

Example: Grouping customers into clusters based on their buying behavior.

sum of distances between points and their respective cluster centroids.

(agglomerative) or dividing large clusters into smaller ones (divisive).

Density-Based Clustering (DBSCAN):

Example: Identifying regions in geographic data where accidents frequently occur.

5. Frequent Pattern Mining (8 hours)

Frequent Pattern Mining:

The task of discovering itemsets that frequently occur together in a dataset.

6. Web Mining (4 hours)

Web Content Mining:

Web Structure Mining:

Web Usage Mining:

how long they stay on a website.

to navigate to the sale section.

You might also like