BI Unit 4 Final
BI Unit 4 Final
Q2)Data Cleaning – 1) Data cleaning is the process of removing or correcting inaccurate, incomplete,
Data pre-processing is the step where raw data is cleaned, transformed & prepared before it is used in duplicate or irrelevant data from dataset. # Need for data cleaning – 1) In real-world scenarios, data
analysis, reporting or ML. Pre-processing ensures that data is accurate, well-structured ready for use. often comes from various sources like files, sensors or databases. 2) This raw data may contain errors,
#Need for Pre- processing -1) In BI many we often collect data from sources like databases, sensors or missing values, duplicates or wrong formats. 3) If this unclean data is used in BI or analysis it can lead to
files. 2) This raw data may contain missing values, duplicate records, errors, which can cause wrong incorrect results poor decision-making. 4) That's why data cleaning is necessary. It helps improve
results or errors in reports. 3) That's why data pre-processing is important. 4) It helps in cleaning data, accuracy, quality reliability of data, which in turn improves quality of reports.
handling missing or duplicate values, changing formats & converting data into consistent structure. #Methods of Data Cleaning – 1) Removing Duplicates - It finds and deletes records that appear more
#Data Pre-processing techniques- 1) Data Cleaning- This step fixes or removes incorrect, incomplete, than once. E.g. If same customer is listed twice in database, keep only one. 2) Handling Missing Values-
duplicate or irrelevant data. It also includes removing outliers & making sure column names & labels are It fills missing data using techniques like mean, median, mode. Else it removes rows / columns if too
consistent. E.g. If some entries in table have missing prices, you can fill them using average price. many values are missing. 3) Standardizing Formats- Converts all data into uniform format (like dates in
2) Data Transformation - This change data into right format structure for analysis. Also helps in merging DD-MM- YYYY or currency in INR). E.g. Change "Jan 1, 2025" to "01-01-2025". 4)Outlier Detection &
data from by making formats compatible. E.g. changing all dates format to DD-MM-YYYY Handling - It finds & handle unusual values or extreme pattern that don't fit the pattern. E.g. If product
price is entered as 1,00,000 instead of 1000 it can be flagged or Corrected.
Q3) Data Validation, Incompleteness, Noise, Inconsistency of Quality of input data.
A) Data Validation – 1)Data validation is the process of checking if input data is correct, meaningful & Q5) Data Reduction- 1) Data reduction is the process of minimizing amount of data while still
useful, before it is used in report, analysis. 2) It ensures that data follows rules & formats, like proper maintaining its meaning and usefulness. 2) It is useful when you are working with large datasets that
dates, correct values, no missing fields. 3) Without validation, incorrect data may enter System & cause are hard to store and process or analyse. 3) E.g. Imagine a company has customer data with 100
errors or misleading results in reports. B) Incompleteness- 1) Incomplete data means that some values columns, but only 10 are important for sales analysis. 4) By reducing the data to those 10 useful
are missing in dataset. 2)This can happen due to human errors, data collection issues or system failures. columns, analysis becomes faster, cleaner & focused. #Techniques for Data Reduction: 1)Sampling- i)It
3) If not handled properly, incomplete data can lead to inaccurate reports or misleading insights.. is the basic yet powerful technique where a subset of data is selected from large datasets to represent
4) It's important to either fill missing values or remove incomplete entries during cleaning. the whole. ii) The goal is to perform analysis on this smaller portion which still reflects overall
C) Noise- 1)Noise refers to random or meaningless data that does not follow pattern of other data characteristics of original data. iii) It saves time and computing resources, especially when working with
points. 2)Noise can affect quality of reports & make it hard to find real patterns in data. 3)It should be massive datasets. 2)Feature selection: i) It is the process of identifying and selecting most relevant
identified and removed during data preprocessing. D) Inconsistency in Quality of input data :- attributes in a dataset. ii) Irrelevant or abundant columns are removed. iii) This helps in reducing
1) Inconsistency means data is not uniform or standardized. 2) E.g. Dates written in different formats dimensionality, improves model performance, and simplifies understanding. 3)Principles Component
(like 01-01-2025 & Jan 1, 2025) or product names written differently (like "TV" and "Television") cause Analysis ( PCA ) : i)PCA is a mathematical technique used to transform high-dimensional data into a
confusion. 3) Inconsistent data reduces quality of reports & lead to wrong conclusions. smaller set of variables called principal components. ii) These components are a linear combination of
original features but are uncorrelated and arranged in a way that first few capture most variation in
Q4)Data Transformation:- 1) Data transformation is the process of converting raw data into clean &
data. iii) PCA is very useful when datasets have many correlated variables such as financial, health data.
usable format so it can be used in reports, dashboards. 2) This step is very important because raw data
iv) It helps visualize complex data in 2D or 3D and improve performance in analytics.
collected from various sources is often in different formats. 3)Data transformation helps by
reorganizing, converting or standardizing this data to make it easier to understand & analyse. Q7) Data discretization – 1) Data discretization is the process of converting continuous data into
4)Example - Imagine you have dataset with customer birthdates written as: January 1, 2025', discrete groups or intervals. 2) This is helpful in BI and data mining to simplify the data and make
'01/01/2025', '2025-01-01'. These formats are different, so, using data transformation you convert all patterns easier to detect and improve the performance of algorithms. 3) E.g. Instead of using exact ages
dates into one standard format like DD-MM-YYYY 01-01-2025. # Process:- 1) First, the data is collected like 23, 24, 25..., we can discrete age into ranges like 18-25 = young, 26-35 =adult, 36-60 =senior.
from different sources like databases, CSV, APIs. 2)Next, the data is inspected to find inconsistencies #Methods of data discretization: 1) Binding: is one of the common discretization method. It divides
and errors. 3)Then, appropriate transformation rules are applied such as changing formats, converting continuous variable into fixed number of equal sized bins or based on data distribution
data types. 4) After transformation, the data is stored or loaded into Data Warehouse for analysis. #Working- There are three types of binding:- a) Equal width binding: The range is divided into equal
#Techniques for Data Transformation:- 1) Normalization- It Scales numeric data to a common range sized intervals. E.g. For values 0-100 and 5 bins are 0-20, 21-40, 41-60, 61-80, 81-100. b)Equal
like 0 to 1. It helps remove the effect of different units ( e.g. dollar vs rupees) & improves performance. frequency binding: Each bin has same number of data points regardless of their actual range. E.g. A list
E.g. Income values like 10000, 100000 and 500000 can be scaled to values like 0.1,1.0,5.0. of 10 values is divided into 2 bins. Each bin has 4-5 values. c) Clustering based binding: Similar values
2)Encoding(Categorical to Numerical) - This converts text labels into numbers so they can be analysed are grouped using algorithms like K-Means. 2) Histogram based discretization: In this method, a
or used in models. Text values (like Yes or No) cannot be used directly in calculations, so encoding helps. histogram is used to determine how to divide data into intervals based on its distribution. Unlike simple
E.g. Convert 'Yes' = 1 and 'No' = 0. 3) Data Aggregation- This groups data and calculates summary values binding, this approach looks at how frequently values occur and set boundaries at natural gaps in data.
like totals, average. It helps in understanding trends and making summarized reports. Example for Sales E.g. if the most values are between 0-30, the histogram might create more bins in that range and fewer
Data: Group by Region and Calculate Total Sales by Region. in ranges where the data is sparse.
Q6)Dimensionality Reduction – 1) Dimensionality Reduction aims to reduce number of input variables Q10) Univariate analysis:- 1)Univariate analysis is the simplest form of data analysis. 2) It focuses on
in dataset without losing key information. 2) High dimensional data often cause problems like one variable at a time. 3) The main aim is to understand the basic characteristics of that single variable
overfitting, increased storage cost, and slower processing. 3) By reducing dimensions, we simply data like mean, mode, minimum, maximum. 4) Visualization tools like bar charts, pie charts, histogram, etc.
structure, improve clarity in visualization, and enhance speed of ML processes. 4) Dimensionality are often used. 5) Example- If company wants to analyse ages of customers, Univariate analysis will
Reduction can be supervised or unsupervised. help show age distribution, whether customers are young, middle-aged, or old.
#Data Compression: 1) Data compression is the process of reducing physical size of dataset to save 6)Applications:- Surveys, customer profiling, and when preparing data for further analysis.
space or speed up transformation. 2) Unlike sampling or feature reduction, compression focuses more
Q11) Bivariate analysis:- 1) Bivariate analysis involves analysis of two variables to understand
on efficient storage and data handling. 3) It can be lossless ( original data is preserved) or lossy ( some
relationship or association between them. 2) This analysis is useful in comparison and correlation
information is permanently removed ). 4) Examples Tools like zip for files, ORC formats in data
studies. 3) Techniques like scatter plots, correlation coefficients, etc. are commonly used.
warehousing.
4)E.g.:- business may want to see if there is a connection between marketing spend and sales revenue.
Q8) Data Exploration- 1) Data Exploration is the process of understanding the structure, pattern, and 5) If spending increases and sales also increases, the two variables have positive relationships.
key features of a dataset before going into any deep analysis. 2) It's like getting to know your data, what #Application:- Marketing, Education, Business.
kind of values it contains, how complete and clean it is, etc. 3) This step is also called Exploratory Data #Need/importance of Bivariate: 1) 2Bivariate analysis is important because it reveals how two variables
Analysis and is a very important part of BI. 4) During data exploration, we look at:- a)What types of interact. 2) This helps businesses and researchers make better decisions. 3) It also helps in identifying
data are present. b)How data is distributed. c)Missing values or duplicates. d)Relationships between cause-effect relationships and predictive patterns.
variables and outliers. 5) It often involves summary statistics and visualizations such as histograms, #Types of Bivariate Analysis: 1) Numerical vs. Numerical: In this, both variables are numbers and we
scatter plots, etc. to make data patterns easier to see. 6) Example:- Suppose you are analysing study their relationship using tools like scatter plots and correlation. E.g. height vs. weight.
superstore sales data in Power BI, you might begin exploration like i) Understand data columns: 2)Numerical vs. Categorical: In this, one variable is numerical and other is category. We compare
columns like order ID, customer name, region, sales, profit, etc. ii)Check summary statistics: average numeric values across different groups using bar charts /box plots. E.g. salary across job roles.
sales=5000, maximum profit=15000,etc iii) Detect missing values: you notice that some rows have 3)Categorical vs. Categorical: In this, both variables are categories. We check their association using
missing shipping date. iv)Visualize patterns: you create a bar chart showing that technology category contingency tables or key square test. E.g. gender vs. product preference.
has higher sales. A scatter plot shows that higher sales often lead to higher profits but not always.
Q12) Multivariate Analysis:- 1) Multivariate analysis deals with three or more variables at the same
v)Spot outliers: A few orders have huge sales amount but zero profit, this could indicate an error. Data
time. 2) It helps in understanding complex relationships among the several variables and is used in
exploration acts as a bridge between raw data and meaningful business decisions. It ensures that you
advanced analytics like ML , deep data insights, etc. 3) Techniques include multiple regression, PCA,
are not just working with data but working with right and clean data.
cluster analysis. 4) Example, a business might analyse how age, income, and education higher together
Q9) Contingency Table & Marginal Distribution . affect likelihood of purchasing product. #Applications - Customer segmentation, Risk modelling.
A) Contingency Table – 1) Contingency table is a type of table used in statistics to organize and display predicting disease, etc.
relationship between two or more categorical variables. 2) It shows the frequency of observation that
Q13) Mean , Median, Mode.
fall into each combination of categories. 3) This helps in understanding how one variable may be related
A) Mean:- The mean is the sum of all values divided by the number of values.
to another. 4) Contingency tables are often used in data analysis, surveys, and reports to identify
#Formula:- Σ f. x / Σ f . where x is midpoint of each class.
patterns, trends, and relationships. 5) Example:- Suppose company wants to analyse relationship
B) Median:- The median is the middle value when the data is arranged in order.
between Gender ( Male / Female ) & Product preference ( Electronics / Clothing ). The data might be
#Formula:- Median= L + (N/2 – CF/ f) . h, Where, L= lower interval. N=Total frequency. CF= Cumulative
arranged in a 2x2 contingency table showing how many males & females prefer each product category.
frequence before median. F= frequence of median class. h= Difference between upper interval- lower.
This gives clear view of how preferences differ across gender & can help make business decision.
C) Mode:- The mode is the number that appears most often. #Formula:- L + ( F1- F0 / 2F1-F0-F2) . h
B)Marginal distribution- 1) It refers to totals of each row and column in a contingency table. 2) It shows
Where, L = lower interval, F1= modal class frequency, F0= Previous class frequency. F2= next class
the distribution of only one variable, ignoring the effect of others. 3) Marginal distributions help
frequency. h= Difference between upper interval- lower
summarize data and give an overview of how categories of single variables are spread across datasets.
4) These totals are usually found in margins of tables, that's why it's called marginal. Q14) How Mean , Median, Mode are used during data cleaning.
5) Example:- Using the same example of gender and product reference. The marginal distribution of 1)Mean:- Used to: Replace missing numeric values if data is normally distributed (no outliers). Risk:
gender is male=30 and female=40. The marginal distribution of product is electronics=35 and Affected by outliers, which can skew the data. 2) Median:- Used to: Replace missing values in skewed
clothing=35. These marginal totals are useful for calculating probabilities and percentage. distributions or when outliers are present. Advantage: Not affected by outliers — gives a better central
value in such cases. 3)Mode:- Used to: Fill in missing categorical values (like gender, country, etc.).
Detect and fix inconsistent labeling (e.g., "USA", "usa", "United States").