Viva Preparation Notes
Viva Preparation Notes
Data Warehousing:
A data warehouse is a centralized repository for storing large amounts of structured and historical
Example: A retail company may store years of sales data to analyze long-term trends.
Dimensional Modeling:
Organizes data into facts (numerical data such as sales, revenue) and dimensions (contextual
Example: In a sales scenario, the fact table could store sales amounts, while dimension tables
- Slicing and Dicing: Filtering specific aspects of data (e.g., sales in 2022 for region X).
Example: An OLAP cube for a company could have dimensions like Time, Region, and Product to
OLTP (Online Transaction Processing) handles day-to-day operations like transactions (e.g., bank
withdrawals), while OLAP deals with complex queries for data analysis.
Data Warehouse Schemas:
- Star Schema: A simple schema where a central fact table is connected to dimension tables.
Example: A sales database with a central sales fact table and dimension tables for Time, Product,
and Customer.
- Snowflake Schema: An extension of the star schema where dimension tables are normalized into
Example: A more detailed product dimension table may be split into Product and Supplier.
- Fact Constellation Schema: Multiple fact tables share dimension tables, suitable for more complex
data models.
Example: One fact table for Sales and another for Returns, both sharing the same Product and
Customer dimensions.
Example: Extracting sales data from different branches, transforming it by cleaning duplicates, and
Data Mining:
The process of discovering patterns, trends, and useful information from large datasets. It involves
Example: Mining customer transaction data to discover which products are frequently bought
- Data Cleaning: Handling missing values, removing duplicates, and correcting errors.
Example: Filling missing customer age data with the average age in the dataset.
- Data Integration: Combining data from different sources into a unified view.
- Data Reduction: Reducing data volume through techniques like attribute subset selection
- Data Transformation: Converting data into a suitable format for mining (e.g., normalizing data).
Example: Normalizing sales data so that all values fall between 0 and 1 for uniformity.
Data Visualization:
Used to explore data distributions and patterns using graphs like histograms, bar charts, scatter
plots, etc.
Example: A scatter plot showing the relationship between customer age and spending.
3. Classification (6 hours)
Basic Concepts:
Classification is a predictive modeling task where a model assigns a category label (class) to an
Example: Classifying whether an email is "spam" or "not spam" based on its content.
Decision Tree:
A popular classification algorithm that uses a tree-like model of decisions. Each internal node
represents a test on an attribute, each branch represents the outcome of a test, and each leaf node
Example: A decision tree to classify if a customer will buy a product based on age, income, and
previous purchases.
Bayesian Classification:
Uses Bayes' theorem to calculate the probability that an observation belongs to a certain class.
Example: A Naive Bayes classifier could predict if a person will enjoy a movie based on previous
preferences.
Regression Analysis:
Example: Predicting house prices based on factors like area, number of rooms, and locality.
Evaluation Metrics:
- F1-Score: Harmonic mean of precision and recall, used when classes are imbalanced.
4. Clustering (4 hours)
Clustering:
Unsupervised learning technique to group similar data points into clusters based on features.
K-Means Clustering:
A partitioning method where the data is divided into K clusters. The algorithm aims to minimize the
total spend.
Hierarchical Clustering:
Builds a tree of clusters (dendrogram) either by agglomerating small clusters into larger ones
Forms clusters based on density. It works well for datasets with varying shapes and can detect
outliers.
Example: Finding that customers who buy bread also frequently buy butter (Market Basket
Analysis).
Apriori Algorithm:
A popular algorithm for discovering frequent itemsets. It uses a bottom-up approach where frequent
itemsets are extended one item at a time (candidate generation) and then pruned.
Example: Identifying frequent itemsets like {milk, bread, butter} in supermarket data.
Association Rules:
Rules that express relationships between itemsets, like "If a customer buys A, they are likely to buy
B."
Example: "If a customer buys a laptop, they are 70% likely to buy a mouse."
Lift:
A measure of how much more likely two items are to be bought together compared to if they were
independent.
Web Mining:
The process of extracting useful information from the web, including web content, structure, and
usage.
Focuses on extracting useful information from the content of web pages like text, images, and
videos.
Example: Analyzing the text of reviews on an e-commerce site to extract customer sentiment.
Analyzes the structure of hyperlinks between web pages to discover the relationship between them.
Example: Using PageRank to rank web pages based on the number and quality of links pointing to
them.
Involves analyzing web server logs to understand user behavior, such as which pages they visit or
Example: Discovering that users who visit the homepage of an e-commerce website are more likely