0% found this document useful (0 votes)

16 views7 pages

DataMining WBSU Solution 1

Data warehousing is a process that involves collecting data from different sources within an organization, storing it in a centralized repository, and organizing it to facilitate analysis, reporting, and business intelligence. It provides benefits like centralized data storage, integration of heterogeneous data sources, support for complex queries and decision-making, and improved data quality and strategic planning.

Uploaded by

oozed12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views7 pages

DataMining WBSU Solution 1

Uploaded by

oozed12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

What is data warehousing ?

[WBSU]

Data warehousing is a process of collecting, storing, managing, and organizing large volumes of data from different
sources within an organization. The purpose of a data warehouse is to provide a centralized and integrated repository of
data that can be used for analysis, reporting, and business intelligence.

---------------------------------------------------------------------------------------------------------------------------------------

What is the need of data warehousing? [WBSU]

• Centralized Data Storage: Data warehousing provides a centralized repository where data from disparate
sources can be integrated and stored in a structured manner.
• Integration of Heterogeneous Data: Different departments and business units often use diverse data formats and
structures. Data warehousing facilitates the integration of heterogeneous data.
• Historical Data Analysis
• Support for Decision-Making : Decision-makers require timely access to accurate and relevant information. Data
warehouses serve as a foundation for business intelligence.
• Complex Query Performance: Data warehousing systems are designed to handle complex queries and analytical
processing efficiently.
• Data Quality Improvement: Data warehouses often involve processes for cleaning, transforming, and
standardizing data before it is stored.
• Strategic Planning: With access to comprehensive and integrated data, organizations can engage in strategic
planning

------------------------------------------------------------------------------------------------------------------------------------------------

What is market basket analysis? [WBSU]

Market Basket Analysis is a data analysis technique used in the field of data mining and business intelligence. It involves
discovering relationships or associations between products or items that are frequently purchased together by
customers.

Key Concepts of Market Basket Analysis

• Association Rules: These rules highlight relationships between different items in a dataset.
• Support: Support measures how frequently an itemset (a combination of items) appears in the dataset. It
indicates the popularity or occurrence of a particular combination of items.
• Confidence: Confidence measures the likelihood that if a customer buys one item (antecedent), they will also buy
another item (consequent). It represents the strength of the association between items.
• Lift: Lift is a measure of how much more likely an item (consequent) is to be bought when another item
(antecedent) is purchased, compared to when the items are bought independently. A lift value greater than 1
indicates a positive association.

If customers frequently buy bread (item A) and butter (item B) together, the association rule might be: {Bread} =>
{Butter}. The support could be the percentage of transactions that contain both bread and butter, the confidence could
be how often customers who buy bread also buy butter, and the lift could show if the purchase of bread influences the
purchase of butter.

EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529

Define support and confidence in Association Rule Mining.[WBSU]

Support

• Definition: Support measures the frequency or occurrence of a particular itemset in the dataset. It indicates how
often a specific combination of items appears together in transactions.
• Formula: Support(x)=Transactions containing x / Total Transactions.
• Example: If you're analyzing the association {A, B}, the support would be the proportion of transactions that
include both A and B.
• Significance: High support values suggest that the itemset is frequently present in transactions, making it a more
significant association.

Confidence

• Definition: Confidence measures the likelihood that an item B is purchased when item A is purchased. In other
words, it quantifies the strength of the association rule.
• Formula: Confidence(A→B)=Support(A∪B)/Support(A)
• Example: If you have an association rule {A} => {B}, the confidence would be the proportion of transactions
containing both A and B relative to the transactions containing A.
• Significance: High confidence values indicate a strong association between items A and B. It represents the
probability that if A is purchased, B will also be purchased.

--------------------------------------------------------------------------------------------------------------------------------------------------------

Write the steps in data preprocessing. [WBSU]

• Data cleaning to remove noise and inconsistent data.

• Data integration where multiple data sources may be combined.
• Data selection where data relevant to the analysis task are retrieved from the database.
• Data transformation where data are transformed and consolidated into forms appropriate for mining by
performing summary or aggregation operations.
• Data mining an essential process where intelligent methods are applied to extract data patterns.
• Pattern evaluation to identify the truly interesting patterns representing knowledge based on interestingness
measures.
• Knowledge presentation where visualization and knowledge representation techniques are used to present
mined knowledge to users.

-----------------------------------------------------------------------------------------------------------------------------------------------------

What is meant by Outlier? How Outliers are detected using Data Mining? [WBSU]

An outlier is an observation or data point that significantly differs from the rest of the dataset. In other words, an outlier
is a value that lies an abnormal distance from other values in a random sample. Outliers can distort statistical analysis,
affect the accuracy of predictive models.

Several techniques and methods are employed in data mining to identify outliers. Here are some common approaches:

• Distance-Based Methods: Distance-based outlier detection algorithms calculate the distance of each data point
from its neighbors. Data points with significantly greater distances are flagged as outliers. Examples include
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and k-Nearest Neighbors (k-NN)
EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529
• Isolation Forest: The Isolation Forest algorithm isolates outliers by constructing decision trees. Outliers are
expected to be isolated in shorter paths through the tree, making them easier to identify.

• Machine Learning Models: Some machine learning models, especially those sensitive to outliers, can indirectly
help in outlier detection. Models like One-Class SVM (Support Vector Machine) are designed to learn the normal
pattern and flag deviations as potential outliers.
• Visualization Techniques: Data visualization, such as box plots, scatter plots, and histograms, can help identify
data points that deviate from the overall pattern.

--------------------------------------------------------------------------------------------------------------------------------------------------------

Differentiate between Clustering and Classification. [WBSU]

Supervision:

• Clustering: Clustering is an unsupervised learning task, meaning that it does not require labeled training data.
The algorithm identifies patterns or groups without prior knowledge of the classes.
• Classification: Classification is a supervised learning task, relying on labeled training data to learn and make
predictions. The model is trained on input-output pairs where the correct classes are provided.

Nature of Output

• Clustering: The output of a clustering algorithm is a grouping or clustering of data points, revealing similarities
within each cluster and dissimilarities between clusters.
• Classification: The output of a classification model is a decision or prediction about the class or label to which
each input belongs.

----------------------------------------------------------------------------------------------------------------------------------------------------------

Differentiate between OLTP and OLAP [WBSU]

OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two types of database systems that
serve different purposes in the world of data management.

Purpose

• OLTP: OLTP systems are designed for transaction-oriented processing. They handle day-to-day operations and
transactions such as inserting, updating, and deleting records in real-time.
• OLAP: OLAP systems are geared towards analytical processing. They are used for complex queries, data analysis,
and reporting.

Database Design

• OLTP: OLTP databases are normalized to minimize redundancy and ensure data consistency. They typically have a
relational database structure with a focus on efficient transaction processing.
• OLAP: OLAP databases are often denormalized to improve query performance. They use a multidimensional
database structure (cubes, dimensions, and measures) that allows for quick and flexible analysis of data.

Examples

• OLTP: Examples include order processing systems, banking systems, and inventory management systems.
• OLAP: Examples include data warehouses, business intelligence systems, and decision support systems.
EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529
How can you check the efficiency of a classifier model? [WBSU]

Evaluating the efficiency of a classifier model is crucial to understand its performance and make informed decisions.
Several metrics are commonly used to assess the effectiveness of a classifier. The choice of metric depends on the nature
of the problem (binary classification, multiclass classification) and the specific requirements of the application.

Accuracy

Definition: The ratio of correctly predicted instances to the total number of instances in the dataset.

Formula: Accuracy=Number of Correct Predictions/Total Number of Predictions

Considerations: Accuracy is a straightforward metric, but it may not be suitable for imbalanced datasets, where one class
significantly outnumbers the others.

Precision

Definition: The ratio of correctly predicted positive observations to the total predicted positives.

Formula: Precision=True Positives/True Positives + False Positives

Considerations: Precision is important when the cost of false positives is high.

------------------------------------------------------------------------------------------------------------------------------------------------------------

Explain the difference between data mining and data warehousing. [WBSU]

1. Focus
• Data mining focuses on extracting patterns and insights from data to make predictions or decisions.
• Data warehousing focuses on the efficient storage, retrieval, and analysis of structured data for reporting and
decision support.
2. Process
• Data mining involves the application of algorithms to discover patterns and knowledge from large datasets.
• Data warehousing involves the collection, storage, and organization of data into a centralized repository.
3. Purpose
• The purpose of data mining is knowledge discovery and predictive modeling.
• The purpose of data warehousing is to provide a unified and efficient platform for reporting and analysis.
4. Techniques vs. Infrastructure
• Data mining involves techniques and algorithms for analyzing data.
• Data warehousing involves the infrastructure and architecture for storing and managing data.

-------------------------------------------------------------------------------------------------------------------------------------------------------------

What is an output of Apriori Algorithm? [WBSU]

The output of the Apriori algorithm is a set of frequent itemsets, which represent combinations of items that frequently
occur together in a dataset. Additionally, the algorithm generates association rules based on these frequent itemsets.
These association rules express relationships between different items and indicate the likelihood of the occurrence of
one item given the presence of another.

For example, if the Apriori algorithm is applied to retail transaction data, it might identify frequent itemsets like {bread,
milk} or {eggs, cheese}. The associated rules could then reveal insights such as "Customers who buy bread are 80% likely
to buy milk as well." These insights can guide business strategies, such as product placement, marketing, and cross-
selling efforts.

---------------------------------------------------------------------------------------------------------------------------------------------------------------
EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529
What do you understand by predictive data mining? [WBSU]

Predictive data mining, also known as predictive analytics, is a subset of data mining that involves the use of statistical
algorithms, machine learning techniques, and modeling to analyze historical data and make predictions about future
events or trends. The primary objective of predictive data mining is to uncover patterns and relationships within data
that can be used to anticipate future outcomes.

Various algorithms and techniques can be used for predictive modeling, depending on the nature of the data and the
prediction task. Common algorithms include linear regression, decision trees, support vector machines, neural networks,
and ensemble methods like random forests.

--------------------------------------------------------------------------------------------------------------------------------------------------------

Explain different OLAP operations with the help of examples. [WBSU]

OLAP (Online Analytical Processing) involves various operations that allow users to interactively analyze
multidimensional data to gain insights. Here are the key OLAP operations along with examples:

Roll-up (Drill-Up)

Definition: Aggregating data from a finer level of granularity to a coarser level.

Example: Consider a sales data cube with dimensions like Time (Year, Quarter, Month) and Product (Category,
Subcategory). Rolling up by Time from Monthly to Quarterly would aggregate monthly sales to quarterly sales.

Drill-down (Roll-Down)

Definition: Breaking down aggregated data into a more detailed level of granularity.

Example: Using the same sales data cube, drilling down by Time from Quarterly to Monthly would break down quarterly
sales into monthly sales.

Pivot (Rotate)

Definition: Changing the orientation of the data cube by rotating it to view it from a different perspective.

Example: For a sales data cube with dimensions like Region, Product, and Time, pivoting could involve changing the
orientation to view sales across different products for each region.

Slice

Definition: Selecting a single value for one dimension to view a "slice" of the cube.

Example: Slicing the sales data cube by selecting a specific month would show sales for all products and regions for that
particular month.

Dice

Definition: Selecting a subcube by choosing specific values for two or more dimensions.

Example: Dicing the sales data cube by selecting a specific region and product category would show sales for that
particular region and product category across all time periods.

Drill Across

Definition: Navigating from one data cube to another to access related information.

Example: If there are separate data cubes for sales and customer information, drilling across might involve moving from
the sales cube to the customer cube to analyze customer details related to specific sales.
EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529
Ranking

Definition: Assigning a rank or order to data based on a measure.

Example: Ranking products based on their sales volume to identify the top-selling products.

Top N / Bottom N

Definition: Displaying the top or bottom N items based on a specified measure.

Example: Showing the top 10 customers based on their total purchase amount.

Swing (Rotation)

Definition: Changing the axis of the cube to view it from a different perspective.

Example: For a sales data cube with dimensions like Product, Time, and Region, swinging or rotating the cube could
involve changing the axis to focus on sales trends over time for each product in a specific region.

------------------------------------------------------------------------------------------------------------------------------------------------------

Explain the different methods of Data Cleaning and Data Transformation. [WBSU]

Data Cleaning

Data cleaning, also known as data cleansing or data scrubbing, involves the process of identifying and correcting errors
or inconsistencies in datasets. Here are different methods used for data cleaning:

Handling Missing Data

• Removal: Eliminate rows or columns with missing values.

• Imputation: Fill in missing values using techniques like mean, median, mode, or more advanced methods such as
regression imputation.

Outlier Detection and Treatment

• Statistical Methods: Identify outliers using statistical measures such as Z-scores or IQR.
• Visualization: Use box plots, scatter plots, or histograms to visually identify outliers.
• Treatment: Decide whether to remove, transform, or impute outlier values based on the nature of the data and
the analysis requirements.

De-duplication

• Identifying and removing duplicate records to ensure data integrity.

• Consideration of partial duplicates or near-duplicates based on specific attributes.

Normalization and Scaling

• Standardization: Transform numerical data to have a mean of 0 and standard deviation of 1.

• Min-Max Scaling: Scale numerical features to a specific range, often [0, 1] or [-1, 1].

Handling Inconsistent Data

• Standardizing Formats: Ensure consistency in date formats, units, and other data representations.
• Correcting Typos: Use algorithms or manual methods to identify and correct typographical errors.

Handling Incomplete Data

EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529
• Interpolation: Estimate missing values based on existing data points.
• Forward or Backward Fill: Use adjacent values to fill missing data in time series.

Addressing Inconsistent Categorical Data

• Standardization: Group similar categories or convert categorical data to a common format.

• Merging or Splitting Categories: Combine or divide categories to achieve a more meaningful representation.

Data Transformation

Data transformation involves modifying the original data to make it more suitable for analysis or modeling. Here are
different methods used for data transformation:

Feature Engineering

Creating new features based on existing ones to enhance model performance.

Example: Combining date and time features into a single datetime feature.

Standardization and Normalization

Standardization: Transform numerical data to have a mean of 0 and standard deviation of 1.

Min-Max Scaling: Scale numerical features to a specific range.

Aggregation

Combine multiple records into a summary representation, often using aggregation functions like sum, mean, or max.

Dummy Variable Creation

Creating binary dummy variables to represent categorical variables in a numerical format.

Principal Component Analysis (PCA)

Dimensionality reduction technique that transforms data into a lower-dimensional space while retaining as much
variance as possible.

Data Discretization

Convert continuous data into discrete bins or categories to simplify analysis.

Both data cleaning and data transformation are essential steps in the data preprocessing pipeline, ensuring that the data
is accurate, consistent, and suitable for analysis or modeling purposes.

EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529

Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
Resume 1
100% (1)
Resume 1
106 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
1.3 What Kind of Data Can Be Mined?
No ratings yet
1.3 What Kind of Data Can Be Mined?
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Content Server 20.3 Administration Guide
No ratings yet
Content Server 20.3 Administration Guide
578 pages
206 Data Mining
No ratings yet
206 Data Mining
28 pages
8 Data Mining Algorithms
No ratings yet
8 Data Mining Algorithms
8 pages
2 Data Mining Tasks A Functionalities
No ratings yet
2 Data Mining Tasks A Functionalities
24 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Pptcs 1661
No ratings yet
Pptcs 1661
38 pages
Nitsuko 124i - 384i Software Hardware Program Features
No ratings yet
Nitsuko 124i - 384i Software Hardware Program Features
1,195 pages
Lec 02
No ratings yet
Lec 02
33 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Data Mining
No ratings yet
Data Mining
14 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Study Material I
No ratings yet
Study Material I
140 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
No ratings yet
Networkjourney CCNP Enterprise 2021 Lab Workbook 1631597584
77 pages
Data Management
No ratings yet
Data Management
36 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Unit 3
No ratings yet
Unit 3
22 pages
Covidien RFA Vein Operator Manual
No ratings yet
Covidien RFA Vein Operator Manual
99 pages
CH 2
No ratings yet
CH 2
37 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Data Mining
No ratings yet
Data Mining
48 pages
Data Mining - 2
No ratings yet
Data Mining - 2
16 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
J-3 Eyebrow Cooling Baffles Drawings and Instructions
50% (2)
J-3 Eyebrow Cooling Baffles Drawings and Instructions
15 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Final Term Paper
No ratings yet
Final Term Paper
24 pages
Vinee
100% (1)
Vinee
28 pages
ModelQB - Part B&C-1
No ratings yet
ModelQB - Part B&C-1
51 pages
BAIS Exam
No ratings yet
BAIS Exam
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
1.3 Tasks of Data Mining
No ratings yet
1.3 Tasks of Data Mining
10 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Unit 4
No ratings yet
Unit 4
27 pages
Davetech Concise Css Note
No ratings yet
Davetech Concise Css Note
4 pages
Bi Short Notes
No ratings yet
Bi Short Notes
15 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
DM UNIT-1 Question and Answer
No ratings yet
DM UNIT-1 Question and Answer
25 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Caladora Laser
No ratings yet
Caladora Laser
2 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
Paper II LDC DMR
No ratings yet
Paper II LDC DMR
9 pages
HP Scitex LX600 & LX 800 Printer Operator Training Guidelines and Checklist
No ratings yet
HP Scitex LX600 & LX 800 Printer Operator Training Guidelines and Checklist
7 pages
Business Plan Pro Rapidshare
No ratings yet
Business Plan Pro Rapidshare
11 pages
Introduction To Data Mining With Case Studies - Sample Index
0% (1)
Introduction To Data Mining With Case Studies - Sample Index
16 pages
Unit 3 BI & Data Science
No ratings yet
Unit 3 BI & Data Science
19 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Data Mining
No ratings yet
Data Mining
4 pages
SolarRiver - 3400TL D 6000TL D Product - Manual V1 2 - EN
No ratings yet
SolarRiver - 3400TL D 6000TL D Product - Manual V1 2 - EN
50 pages
10 Ict Css q3 m1 Css
No ratings yet
10 Ict Css q3 m1 Css
17 pages
ICT 7 2nd PT Wanswer
No ratings yet
ICT 7 2nd PT Wanswer
2 pages
Co Unit3
No ratings yet
Co Unit3
41 pages
6 Searching Algorithms
No ratings yet
6 Searching Algorithms
20 pages
Paper 4
No ratings yet
Paper 4
33 pages
Digital Ethics - FINAL - 160616
No ratings yet
Digital Ethics - FINAL - 160616
36 pages
Digital Therapeutics Apps On Prescription
No ratings yet
Digital Therapeutics Apps On Prescription
12 pages
3 NC 154602
No ratings yet
3 NC 154602
9 pages
5yrty - The AI Search Engine You Control - AI Chat & Apps
No ratings yet
5yrty - The AI Search Engine You Control - AI Chat & Apps
3 pages
Abiyot Kahle Thesis (Recovered) 33
No ratings yet
Abiyot Kahle Thesis (Recovered) 33
75 pages
Project Book Finish
No ratings yet
Project Book Finish
40 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
7 A Activity List
No ratings yet
7 A Activity List
13 pages
EN 840Dsl Safety v48 2018-03
No ratings yet
EN 840Dsl Safety v48 2018-03
124 pages
LED Blinking Using PIC Microcontroller - MPLAB XC8 and MikroC Codes
No ratings yet
LED Blinking Using PIC Microcontroller - MPLAB XC8 and MikroC Codes
35 pages
Unit I Introduction To DevOps and The Culture
No ratings yet
Unit I Introduction To DevOps and The Culture
38 pages
CTS Cheatsheet
No ratings yet
CTS Cheatsheet
3 pages
Arduino Ventures - Mirza Naiem Beg
No ratings yet
Arduino Ventures - Mirza Naiem Beg
128 pages
Output Log 2022-04-23 23-38-53
No ratings yet
Output Log 2022-04-23 23-38-53
14 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Enterprise Supply Chain Management: Integrating Best in Class Processes
From Everand
Enterprise Supply Chain Management: Integrating Best in Class Processes
Vivek Sehgal
No ratings yet
CIW Data Analyst Exam Prep: 500 Practice Questions for Certification Success
From Everand
CIW Data Analyst Exam Prep: 500 Practice Questions for Certification Success
Steve Brown
No ratings yet

DataMining WBSU Solution 1

Uploaded by

DataMining WBSU Solution 1

Uploaded by

What is data warehousing ?

What is the need of data warehousing? [WBSU]

What is market basket analysis? [WBSU]

Key Concepts of Market Basket Analysis

EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529

Write the steps in data preprocessing. [WBSU]

• Data cleaning to remove noise and inconsistent data.

Differentiate between Clustering and Classification. [WBSU]

Differentiate between OLTP and OLAP [WBSU]

Formula: Accuracy=Number of Correct Predictions/Total Number of Predictions

Formula: Precision=True Positives/True Positives + False Positives

Considerations: Precision is important when the cost of false positives is high.

What is an output of Apriori Algorithm? [WBSU]

Explain different OLAP operations with the help of examples. [WBSU]

Definition: Aggregating data from a finer level of granularity to a coarser level.

Definition: Assigning a rank or order to data based on a measure.

Definition: Displaying the top or bottom N items based on a specified measure.

Handling Missing Data

• Removal: Eliminate rows or columns with missing values.

Outlier Detection and Treatment

• Identifying and removing duplicate records to ensure data integrity.

Normalization and Scaling

• Standardization: Transform numerical data to have a mean of 0 and standard deviation of 1.

Handling Inconsistent Data

Handling Incomplete Data

Addressing Inconsistent Categorical Data

• Standardization: Group similar categories or convert categorical data to a common format.

Creating new features based on existing ones to enhance model performance.

Standardization and Normalization

Standardization: Transform numerical data to have a mean of 0 and standard deviation of 1.

Min-Max Scaling: Scale numerical features to a specific range.

Dummy Variable Creation

Creating binary dummy variables to represent categorical variables in a numerical format.

Principal Component Analysis (PCA)

Convert continuous data into discrete bins or categories to simplify analysis.

EduInCS | Computer Science Coaching Institute | Madhyamgram, Kol 700130 | 9163826529

You might also like