BI Unit 4 Final

The document discusses data pre-processing, which includes data cleaning, transformation, validation, and reduction to ensure data accuracy and usability for analysis. It outlines the need for cleaning to address issues like missing values and duplicates, and describes various techniques such as normalization and encoding. Additionally, it covers the importance of exploratory data analysis and different types of statistical analysis, including univariate, bivariate, and multivariate analysis.

Uploaded by

pranavadsure4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views2 pages

BI Unit 4 Final

Uploaded by

pranavadsure4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Q1) Data Pre-processing BI Unit 4.

Q2)Data Cleaning – 1) Data cleaning is the process of removing or correcting inaccurate, incomplete,
Data pre-processing is the step where raw data is cleaned, transformed & prepared before it is used in duplicate or irrelevant data from dataset. # Need for data cleaning – 1) In real-world scenarios, data
analysis, reporting or ML. Pre-processing ensures that data is accurate, well-structured ready for use. often comes from various sources like files, sensors or databases. 2) This raw data may contain errors,
#Need for Pre- processing -1) In BI many we often collect data from sources like databases, sensors or missing values, duplicates or wrong formats. 3) If this unclean data is used in BI or analysis it can lead to
files. 2) This raw data may contain missing values, duplicate records, errors, which can cause wrong incorrect results poor decision-making. 4) That's why data cleaning is necessary. It helps improve
results or errors in reports. 3) That's why data pre-processing is important. 4) It helps in cleaning data, accuracy, quality reliability of data, which in turn improves quality of reports.
handling missing or duplicate values, changing formats & converting data into consistent structure. #Methods of Data Cleaning – 1) Removing Duplicates - It finds and deletes records that appear more
#Data Pre-processing techniques- 1) Data Cleaning- This step fixes or removes incorrect, incomplete, than once. E.g. If same customer is listed twice in database, keep only one. 2) Handling Missing Values-
duplicate or irrelevant data. It also includes removing outliers & making sure column names & labels are It fills missing data using techniques like mean, median, mode. Else it removes rows / columns if too
consistent. E.g. If some entries in table have missing prices, you can fill them using average price. many values are missing. 3) Standardizing Formats- Converts all data into uniform format (like dates in
2) Data Transformation - This change data into right format structure for analysis. Also helps in merging DD-MM- YYYY or currency in INR). E.g. Change "Jan 1, 2025" to "01-01-2025". 4)Outlier Detection &
data from by making formats compatible. E.g. changing all dates format to DD-MM-YYYY Handling - It finds & handle unusual values or extreme pattern that don't fit the pattern. E.g. If product
price is entered as 1,00,000 instead of 1000 it can be flagged or Corrected.
Q3) Data Validation, Incompleteness, Noise, Inconsistency of Quality of input data.
A) Data Validation – 1)Data validation is the process of checking if input data is correct, meaningful & Q5) Data Reduction- 1) Data reduction is the process of minimizing amount of data while still
useful, before it is used in report, analysis. 2) It ensures that data follows rules & formats, like proper maintaining its meaning and usefulness. 2) It is useful when you are working with large datasets that
dates, correct values, no missing fields. 3) Without validation, incorrect data may enter System & cause are hard to store and process or analyse. 3) E.g. Imagine a company has customer data with 100
errors or misleading results in reports. B) Incompleteness- 1) Incomplete data means that some values columns, but only 10 are important for sales analysis. 4) By reducing the data to those 10 useful
are missing in dataset. 2)This can happen due to human errors, data collection issues or system failures. columns, analysis becomes faster, cleaner & focused. #Techniques for Data Reduction: 1)Sampling- i)It
3) If not handled properly, incomplete data can lead to inaccurate reports or misleading insights.. is the basic yet powerful technique where a subset of data is selected from large datasets to represent
4) It's important to either fill missing values or remove incomplete entries during cleaning. the whole. ii) The goal is to perform analysis on this smaller portion which still reflects overall
C) Noise- 1)Noise refers to random or meaningless data that does not follow pattern of other data characteristics of original data. iii) It saves time and computing resources, especially when working with
points. 2)Noise can affect quality of reports & make it hard to find real patterns in data. 3)It should be massive datasets. 2)Feature selection: i) It is the process of identifying and selecting most relevant
identified and removed during data preprocessing. D) Inconsistency in Quality of input data :- attributes in a dataset. ii) Irrelevant or abundant columns are removed. iii) This helps in reducing
1) Inconsistency means data is not uniform or standardized. 2) E.g. Dates written in different formats dimensionality, improves model performance, and simplifies understanding. 3)Principles Component
(like 01-01-2025 & Jan 1, 2025) or product names written differently (like "TV" and "Television") cause Analysis ( PCA ) : i)PCA is a mathematical technique used to transform high-dimensional data into a
confusion. 3) Inconsistent data reduces quality of reports & lead to wrong conclusions. smaller set of variables called principal components. ii) These components are a linear combination of
original features but are uncorrelated and arranged in a way that first few capture most variation in
Q4)Data Transformation:- 1) Data transformation is the process of converting raw data into clean &
data. iii) PCA is very useful when datasets have many correlated variables such as financial, health data.
usable format so it can be used in reports, dashboards. 2) This step is very important because raw data
iv) It helps visualize complex data in 2D or 3D and improve performance in analytics.
collected from various sources is often in different formats. 3)Data transformation helps by
reorganizing, converting or standardizing this data to make it easier to understand & analyse. Q7) Data discretization – 1) Data discretization is the process of converting continuous data into
4)Example - Imagine you have dataset with customer birthdates written as: January 1, 2025', discrete groups or intervals. 2) This is helpful in BI and data mining to simplify the data and make
'01/01/2025', '2025-01-01'. These formats are different, so, using data transformation you convert all patterns easier to detect and improve the performance of algorithms. 3) E.g. Instead of using exact ages
dates into one standard format like DD-MM-YYYY 01-01-2025. # Process:- 1) First, the data is collected like 23, 24, 25..., we can discrete age into ranges like 18-25 = young, 26-35 =adult, 36-60 =senior.
from different sources like databases, CSV, APIs. 2)Next, the data is inspected to find inconsistencies #Methods of data discretization: 1) Binding: is one of the common discretization method. It divides
and errors. 3)Then, appropriate transformation rules are applied such as changing formats, converting continuous variable into fixed number of equal sized bins or based on data distribution
data types. 4) After transformation, the data is stored or loaded into Data Warehouse for analysis. #Working- There are three types of binding:- a) Equal width binding: The range is divided into equal
#Techniques for Data Transformation:- 1) Normalization- It Scales numeric data to a common range sized intervals. E.g. For values 0-100 and 5 bins are 0-20, 21-40, 41-60, 61-80, 81-100. b)Equal
like 0 to 1. It helps remove the effect of different units ( e.g. dollar vs rupees) & improves performance. frequency binding: Each bin has same number of data points regardless of their actual range. E.g. A list
E.g. Income values like 10000, 100000 and 500000 can be scaled to values like 0.1,1.0,5.0. of 10 values is divided into 2 bins. Each bin has 4-5 values. c) Clustering based binding: Similar values
2)Encoding(Categorical to Numerical) - This converts text labels into numbers so they can be analysed are grouped using algorithms like K-Means. 2) Histogram based discretization: In this method, a
or used in models. Text values (like Yes or No) cannot be used directly in calculations, so encoding helps. histogram is used to determine how to divide data into intervals based on its distribution. Unlike simple
E.g. Convert 'Yes' = 1 and 'No' = 0. 3) Data Aggregation- This groups data and calculates summary values binding, this approach looks at how frequently values occur and set boundaries at natural gaps in data.
like totals, average. It helps in understanding trends and making summarized reports. Example for Sales E.g. if the most values are between 0-30, the histogram might create more bins in that range and fewer
Data: Group by Region and Calculate Total Sales by Region. in ranges where the data is sparse.
Q6)Dimensionality Reduction – 1) Dimensionality Reduction aims to reduce number of input variables Q10) Univariate analysis:- 1)Univariate analysis is the simplest form of data analysis. 2) It focuses on
in dataset without losing key information. 2) High dimensional data often cause problems like one variable at a time. 3) The main aim is to understand the basic characteristics of that single variable
overfitting, increased storage cost, and slower processing. 3) By reducing dimensions, we simply data like mean, mode, minimum, maximum. 4) Visualization tools like bar charts, pie charts, histogram, etc.
structure, improve clarity in visualization, and enhance speed of ML processes. 4) Dimensionality are often used. 5) Example- If company wants to analyse ages of customers, Univariate analysis will
Reduction can be supervised or unsupervised. help show age distribution, whether customers are young, middle-aged, or old.
#Data Compression: 1) Data compression is the process of reducing physical size of dataset to save 6)Applications:- Surveys, customer profiling, and when preparing data for further analysis.
space or speed up transformation. 2) Unlike sampling or feature reduction, compression focuses more
Q11) Bivariate analysis:- 1) Bivariate analysis involves analysis of two variables to understand
on efficient storage and data handling. 3) It can be lossless ( original data is preserved) or lossy ( some
relationship or association between them. 2) This analysis is useful in comparison and correlation
information is permanently removed ). 4) Examples Tools like zip for files, ORC formats in data
studies. 3) Techniques like scatter plots, correlation coefficients, etc. are commonly used.
warehousing.
4)E.g.:- business may want to see if there is a connection between marketing spend and sales revenue.
Q8) Data Exploration- 1) Data Exploration is the process of understanding the structure, pattern, and 5) If spending increases and sales also increases, the two variables have positive relationships.
key features of a dataset before going into any deep analysis. 2) It's like getting to know your data, what #Application:- Marketing, Education, Business.
kind of values it contains, how complete and clean it is, etc. 3) This step is also called Exploratory Data #Need/importance of Bivariate: 1) 2Bivariate analysis is important because it reveals how two variables
Analysis and is a very important part of BI. 4) During data exploration, we look at:- a)What types of interact. 2) This helps businesses and researchers make better decisions. 3) It also helps in identifying
data are present. b)How data is distributed. c)Missing values or duplicates. d)Relationships between cause-effect relationships and predictive patterns.
variables and outliers. 5) It often involves summary statistics and visualizations such as histograms, #Types of Bivariate Analysis: 1) Numerical vs. Numerical: In this, both variables are numbers and we
scatter plots, etc. to make data patterns easier to see. 6) Example:- Suppose you are analysing study their relationship using tools like scatter plots and correlation. E.g. height vs. weight.
superstore sales data in Power BI, you might begin exploration like i) Understand data columns: 2)Numerical vs. Categorical: In this, one variable is numerical and other is category. We compare
columns like order ID, customer name, region, sales, profit, etc. ii)Check summary statistics: average numeric values across different groups using bar charts /box plots. E.g. salary across job roles.
sales=5000, maximum profit=15000,etc iii) Detect missing values: you notice that some rows have 3)Categorical vs. Categorical: In this, both variables are categories. We check their association using
missing shipping date. iv)Visualize patterns: you create a bar chart showing that technology category contingency tables or key square test. E.g. gender vs. product preference.
has higher sales. A scatter plot shows that higher sales often lead to higher profits but not always.
Q12) Multivariate Analysis:- 1) Multivariate analysis deals with three or more variables at the same
v)Spot outliers: A few orders have huge sales amount but zero profit, this could indicate an error. Data
time. 2) It helps in understanding complex relationships among the several variables and is used in
exploration acts as a bridge between raw data and meaningful business decisions. It ensures that you
advanced analytics like ML , deep data insights, etc. 3) Techniques include multiple regression, PCA,
are not just working with data but working with right and clean data.
cluster analysis. 4) Example, a business might analyse how age, income, and education higher together
Q9) Contingency Table & Marginal Distribution . affect likelihood of purchasing product. #Applications - Customer segmentation, Risk modelling.
A) Contingency Table – 1) Contingency table is a type of table used in statistics to organize and display predicting disease, etc.
relationship between two or more categorical variables. 2) It shows the frequency of observation that
Q13) Mean , Median, Mode.
fall into each combination of categories. 3) This helps in understanding how one variable may be related
A) Mean:- The mean is the sum of all values divided by the number of values.
to another. 4) Contingency tables are often used in data analysis, surveys, and reports to identify
#Formula:- Σ f. x / Σ f . where x is midpoint of each class.
patterns, trends, and relationships. 5) Example:- Suppose company wants to analyse relationship
B) Median:- The median is the middle value when the data is arranged in order.
between Gender ( Male / Female ) & Product preference ( Electronics / Clothing ). The data might be
#Formula:- Median= L + (N/2 – CF/ f) . h, Where, L= lower interval. N=Total frequency. CF= Cumulative
arranged in a 2x2 contingency table showing how many males & females prefer each product category.
frequence before median. F= frequence of median class. h= Difference between upper interval- lower.
This gives clear view of how preferences differ across gender & can help make business decision.
C) Mode:- The mode is the number that appears most often. #Formula:- L + ( F1- F0 / 2F1-F0-F2) . h
B)Marginal distribution- 1) It refers to totals of each row and column in a contingency table. 2) It shows
Where, L = lower interval, F1= modal class frequency, F0= Previous class frequency. F2= next class
the distribution of only one variable, ignoring the effect of others. 3) Marginal distributions help
frequency. h= Difference between upper interval- lower
summarize data and give an overview of how categories of single variables are spread across datasets.
4) These totals are usually found in margins of tables, that's why it's called marginal. Q14) How Mean , Median, Mode are used during data cleaning.
5) Example:- Using the same example of gender and product reference. The marginal distribution of 1)Mean:- Used to: Replace missing numeric values if data is normally distributed (no outliers). Risk:
gender is male=30 and female=40. The marginal distribution of product is electronics=35 and Affected by outliers, which can skew the data. 2) Median:- Used to: Replace missing values in skewed
clothing=35. These marginal totals are useful for calculating probabilities and percentage. distributions or when outliers are present. Advantage: Not affected by outliers — gives a better central
value in such cases. 3)Mode:- Used to: Fill in missing categorical values (like gender, country, etc.).
Detect and fix inconsistent labeling (e.g., "USA", "usa", "United States").

Harsha Full Stakc Java Developer
No ratings yet
Harsha Full Stakc Java Developer
10 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Orange IP065 11 QP
100% (1)
Orange IP065 11 QP
7 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Postgres BDR PDF
No ratings yet
Postgres BDR PDF
22 pages
DWDM PDF
No ratings yet
DWDM PDF
21 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
Industrial Training Report: Dream Team Fantasy Cricket
No ratings yet
Industrial Training Report: Dream Team Fantasy Cricket
9 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Unit 3
No ratings yet
Unit 3
18 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
20PMHS012 RH
No ratings yet
20PMHS012 RH
32 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Mining
No ratings yet
Data Mining
22 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
BA-Unit 2
No ratings yet
BA-Unit 2
31 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Document
No ratings yet
Document
29 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Week 3
No ratings yet
Week 3
23 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Chap 3
No ratings yet
Chap 3
26 pages
Working With Geosoft Databases in Oasis Montaj
No ratings yet
Working With Geosoft Databases in Oasis Montaj
8 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Bi Unit 4
No ratings yet
Bi Unit 4
19 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
DWM
No ratings yet
DWM
14 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Fundamentals of Database Systems 6th Edition by Ramez Elmasri
No ratings yet
Fundamentals of Database Systems 6th Edition by Ramez Elmasri
317 pages
AZURE
No ratings yet
AZURE
314 pages
21csc205p Dbms Unit I
No ratings yet
21csc205p Dbms Unit I
154 pages
Migrating Oracle E-Business Suite On AWS
No ratings yet
Migrating Oracle E-Business Suite On AWS
26 pages
Sample
No ratings yet
Sample
5 pages
Pipe Data Table
No ratings yet
Pipe Data Table
9 pages
Rdbmsexp 6
No ratings yet
Rdbmsexp 6
6 pages
Query
No ratings yet
Query
13 pages
1.4. Project Go-WPS Office
No ratings yet
1.4. Project Go-WPS Office
7 pages
Rahul Doc Main 1
No ratings yet
Rahul Doc Main 1
80 pages
Smart Inventory Management System
No ratings yet
Smart Inventory Management System
23 pages
Sorcecode
No ratings yet
Sorcecode
42 pages
Analysis, Design, Development & Testing Methodology
No ratings yet
Analysis, Design, Development & Testing Methodology
47 pages
Project Report
No ratings yet
Project Report
12 pages
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
No ratings yet
Creating A Boiler Plant Dashboard in Power BI Involves Several Steps
2 pages
Java Question Bank Answers
No ratings yet
Java Question Bank Answers
43 pages
Interpret All Statistics and Graphs For One-Way ANOVA - Minitab Express
No ratings yet
Interpret All Statistics and Graphs For One-Way ANOVA - Minitab Express
18 pages
SQL Ques
No ratings yet
SQL Ques
13 pages
Cloud Security UNIT 1
No ratings yet
Cloud Security UNIT 1
17 pages
Turban Chap 03
No ratings yet
Turban Chap 03
30 pages
Obsolete DBA Obsolete DBA Best Practices: EMEA Pug Challenge 2015
No ratings yet
Obsolete DBA Obsolete DBA Best Practices: EMEA Pug Challenge 2015
54 pages
Chapter 7
No ratings yet
Chapter 7
73 pages
Xafpay Full-Stack Job Description
No ratings yet
Xafpay Full-Stack Job Description
2 pages
Infinite Retry Framework
No ratings yet
Infinite Retry Framework
12 pages
Azure Redis Implementation
No ratings yet
Azure Redis Implementation
7 pages

BI Unit 4 Final

Uploaded by

BI Unit 4 Final

Uploaded by

Q1) Data Pre-processing BI Unit 4.

You might also like