0% found this document useful (0 votes)
16 views6 pages

Viva Preparation Notes

Uploaded by

Mansi Khanvilkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Viva Preparation Notes

Uploaded by

Mansi Khanvilkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Viva Preparation Notes

1. Data Warehouse and OLAP (9 hours)

Data Warehousing:

A data warehouse is a centralized repository for storing large amounts of structured and historical

data. It supports decision-making by providing insights into trends and patterns.

Example: A retail company may store years of sales data to analyze long-term trends.

Dimensional Modeling:

Organizes data into facts (numerical data such as sales, revenue) and dimensions (contextual

information like time, geography, product).

Example: In a sales scenario, the fact table could store sales amounts, while dimension tables

would store information about time, product, and customer.

OLAP (Online Analytical Processing):

Allows multidimensional data analysis. It enables operations such as:

- Roll-up: Aggregating data (e.g., monthly sales to yearly sales).

- Drill-down: Breaking data down (e.g., yearly sales to monthly sales).

- Slicing and Dicing: Filtering specific aspects of data (e.g., sales in 2022 for region X).

Example: An OLAP cube for a company could have dimensions like Time, Region, and Product to

analyze sales figures from multiple angles.

OLTP vs. OLAP:

OLTP (Online Transaction Processing) handles day-to-day operations like transactions (e.g., bank

withdrawals), while OLAP deals with complex queries for data analysis.
Data Warehouse Schemas:

- Star Schema: A simple schema where a central fact table is connected to dimension tables.

Example: A sales database with a central sales fact table and dimension tables for Time, Product,

and Customer.

- Snowflake Schema: An extension of the star schema where dimension tables are normalized into

multiple related tables.

Example: A more detailed product dimension table may be split into Product and Supplier.

- Fact Constellation Schema: Multiple fact tables share dimension tables, suitable for more complex

data models.

Example: One fact table for Sales and another for Returns, both sharing the same Product and

Customer dimensions.

ETL Process (Extract, Transform, Load):

Extract: Pulling data from different sources.

Transform: Cleaning, filtering, and aggregating data.

Load: Loading data into the warehouse.

Example: Extracting sales data from different branches, transforming it by cleaning duplicates, and

loading it into a central warehouse.

2. Introduction to Data Mining, Data Exploration, and Data Preprocessing (8 hours)

Data Mining:

The process of discovering patterns, trends, and useful information from large datasets. It involves

techniques like clustering, classification, and association.

Example: Mining customer transaction data to discover which products are frequently bought

together (Market Basket Analysis).


Data Preprocessing:

- Data Cleaning: Handling missing values, removing duplicates, and correcting errors.

Example: Filling missing customer age data with the average age in the dataset.

- Data Integration: Combining data from different sources into a unified view.

- Data Reduction: Reducing data volume through techniques like attribute subset selection

(choosing relevant attributes) and sampling.

- Data Transformation: Converting data into a suitable format for mining (e.g., normalizing data).

Example: Normalizing sales data so that all values fall between 0 and 1 for uniformity.

Data Visualization:

Used to explore data distributions and patterns using graphs like histograms, bar charts, scatter

plots, etc.

Example: A scatter plot showing the relationship between customer age and spending.

3. Classification (6 hours)

Basic Concepts:

Classification is a predictive modeling task where a model assigns a category label (class) to an

observation based on its attributes.

Example: Classifying whether an email is "spam" or "not spam" based on its content.

Decision Tree:

A popular classification algorithm that uses a tree-like model of decisions. Each internal node

represents a test on an attribute, each branch represents the outcome of a test, and each leaf node

represents a class label.

Example: A decision tree to classify if a customer will buy a product based on age, income, and

previous purchases.
Bayesian Classification:

Uses Bayes' theorem to calculate the probability that an observation belongs to a certain class.

Example: A Naive Bayes classifier could predict if a person will enjoy a movie based on previous

preferences.

Regression Analysis:

Predicts continuous outcomes rather than categorical. Types include:

- Simple Linear Regression: Predicts an outcome based on one predictor variable.

- Multiple Linear Regression: Predicts an outcome based on multiple predictor variables.

Example: Predicting house prices based on factors like area, number of rooms, and locality.

Evaluation Metrics:

- Accuracy: Percentage of correctly classified instances.

- Precision: True positives / (True positives + False positives).

- Recall: True positives / (True positives + False negatives).

- F1-Score: Harmonic mean of precision and recall, used when classes are imbalanced.

4. Clustering (4 hours)

Clustering:

Unsupervised learning technique to group similar data points into clusters based on features.

Example: Grouping customers into clusters based on their buying behavior.

K-Means Clustering:

A partitioning method where the data is divided into K clusters. The algorithm aims to minimize the

sum of distances between points and their respective cluster centroids.


Example: Segmenting a customer base into 5 clusters based on their purchasing frequency and

total spend.

Hierarchical Clustering:

Builds a tree of clusters (dendrogram) either by agglomerating small clusters into larger ones

(agglomerative) or dividing large clusters into smaller ones (divisive).

Density-Based Clustering (DBSCAN):

Forms clusters based on density. It works well for datasets with varying shapes and can detect

outliers.

Example: Identifying regions in geographic data where accidents frequently occur.

5. Frequent Pattern Mining (8 hours)

Frequent Pattern Mining:

The task of discovering itemsets that frequently occur together in a dataset.

Example: Finding that customers who buy bread also frequently buy butter (Market Basket

Analysis).

Apriori Algorithm:

A popular algorithm for discovering frequent itemsets. It uses a bottom-up approach where frequent

itemsets are extended one item at a time (candidate generation) and then pruned.

Example: Identifying frequent itemsets like {milk, bread, butter} in supermarket data.

Association Rules:

Rules that express relationships between itemsets, like "If a customer buys A, they are likely to buy

B."
Example: "If a customer buys a laptop, they are 70% likely to buy a mouse."

Lift:

A measure of how much more likely two items are to be bought together compared to if they were

independent.

6. Web Mining (4 hours)

Web Mining:

The process of extracting useful information from the web, including web content, structure, and

usage.

Web Content Mining:

Focuses on extracting useful information from the content of web pages like text, images, and

videos.

Example: Analyzing the text of reviews on an e-commerce site to extract customer sentiment.

Web Structure Mining:

Analyzes the structure of hyperlinks between web pages to discover the relationship between them.

Example: Using PageRank to rank web pages based on the number and quality of links pointing to

them.

Web Usage Mining:

Involves analyzing web server logs to understand user behavior, such as which pages they visit or

how long they stay on a website.

Example: Discovering that users who visit the homepage of an e-commerce website are more likely

to navigate to the sale section.

You might also like