0% found this document useful (0 votes)
23 views18 pages

BI Vimlu

Uploaded by

Nirnay Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

BI Vimlu

Uploaded by

Nirnay Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BI Vimlu Virginkar Services

Unit 3: Data Analysis and Visualization

■ Time: Year, quarter, month, day.


a) Drill Up & Drill Down:
■ Location: Country, region, city.
○ Facts:
1. Drill Down:
■ Sales revenue, quantity sold,
○ Drill down involves moving from a
profit.
higher level of summary data to a
● Using this model, users can analyze
more detailed level.
sales performance by various
○ For example, in financial reporting, if
dimensions.
you start with the total revenue for a
company, you can drill down to see
c) Data Grouping and Sorting:
revenue by region, then by country,
and then by city.
● Data Grouping:
○ In a data visualization tool or
○ Data grouping involves arranging
dashboard, this might involve clicking
data into logical categories or groups
on a chart element representing a
based on common attributes. It helps
higher-level category to reveal more
in organizing and summarizing data
detailed data beneath it.
for analysis.
○ It provides users with the ability to
○ Example: In a sales dataset, you can
explore data and identify specific
group sales data by product category
trends or outliers at lower levels of
to calculate total sales revenue for
granularity.
each category.
2. Drill Up:
● Data Sorting:
○ Drill up, on the other hand, involves
○ Data sorting involves arranging data
moving from a detailed level of data to
in a specified order based on one or
a higher, summary level.
more criteria, such as alphabetical
○ Using the previous example, after
order or numerical order.
drilling down to see revenue by city, you
○ Example: Sorting a list of customer
might drill up to see revenue by country,
names alphabetically.
and then by region, and finally back to
the total revenue for the company.
d) Different Types of Reports:
○ This allows users to maintain context
and understand how individual data
1. Tabular Reports:
points contribute to the overall picture.
○ Tabular reports present data in rows
○ Drill up is particularly useful for
and columns, similar to a
understanding trends and patterns at
spreadsheet.
higher levels of aggregation.
○ They are used for detailed,
structured data presentation.
b) Multidimensional Data Model: A
○ Example: Sales report showing
multidimensional data model organizes data
products, quantities sold, and
into multiple dimensions, allowing users to
revenue.
analyze and explore data from different
2. Summary Reports:
perspectives. It is commonly used in data
○ Summary reports provide
warehousing and OLAP (Online Analytical
aggregated data, typically in the
Processing) systems.
form of totals, averages, or other
summary statistics.
● Example:
○ They offer a high-level overview of
○ Dimensions:
key metrics.
■ Product: Categories, subcategories,
brands.
○ Example: Monthly sales summary ■ Borrowings: Contains information about
showing total revenue and average books borrowed by members, with columns
order value. for borrowing ID, book ID, member ID,
3. Drill-Down Reports: borrow date, and return date.
○ Drill-down reports allow users to ○ In this example, the tables are related as
navigate from summary information follows:
to detailed data. ■ Each book can have one author,
○ They provide interactive capabilities establishing a one-to-many relationship
for exploring data at different levels of between the Books and Authors tables.
detail. ■ Each borrowing is associated with one book
○ Example: Financial report allowing and one member, establishing one-to-many
users to drill down from total revenue relationships between the Borrowings table
to revenue by product category, then and the Books and Members tables.
by region.
4. Dashboard Reports: f) Filtering Reports: Filtering reports involve
○ Dashboard reports present multiple applying criteria to data to display only the
visualizations and key performance information that meets specific conditions.
indicators (KPIs) on a single screen. It helps in focusing on relevant data and
○ They provide a comprehensive view of excluding irrelevant or unwanted data from
business performance at a glance. the report.
○ Example: Sales dashboard showing
revenue trends, top-selling products, ● Example: In a sales report, you can filter
and customer satisfaction scores. data to show sales only for a specific time
5. Ad Hoc Reports: period, particular product category, or
○ Ad hoc reports are customizable target market segment. Filtering reports
reports generated on-the-fly to meet enhance data analysis by allowing users to
specific user requirements. customize views based on their
○ Users can define criteria, select data requirements and make informed decisions.
fields, and format the report as
needed. g) Best Practices in Dashboard Design:
○ Example: Customized sales report
showing revenue by product category 1. Clarity and Simplicity:
and region for a specific time period. ○ Design dashboards with a clear and simple
layout to avoid overwhelming users with
e) Relational Data Model: The relational data unnecessary information. Use concise labels
model organizes data into tables (relations) and intuitive visualizations.
consisting of rows and columns, where each 2. Consistent Design:
row represents a record and each column ○ Maintain consistency in design elements
represents an attribute. Relationships between such as colors, fonts, and layout across the
tables are established through keys. dashboard to provide a cohesive user
experience.
● Example: 3. Relevant Information:
○ Tables: ○ Include only relevant data and key
■ Books: Contains information about books, performance indicators (KPIs) aligned with
with columns for book ID, title, author, and the dashboard's purpose. Avoid cluttering
genre. the dashboard with excessive details.
■ Authors: Contains information about 4. Interactivity:
authors, with columns for author ID, name, ○ Incorporate interactive features such as
and nationality. drill-down capabilities, filters, and tooltips
■ Members: Contains information about to enable users to explore data and gain
library members, with columns for member deeper insights.
ID, name, and contact information. 5. Responsive Design:
○ Ensure that the dashboard is responsive and i) Use of Data Grouping & Sorting,
adapts to different screen sizes and devices Filtering Reports:
for optimal viewing experience.
6. Feedback Mechanism: 1. Data Grouping & Sorting:
○ Provide feedback mechanisms such as ○ Use: Data grouping and sorting are used
notifications or alerts to keep users to organize and present data in a
informed about important updates or structured manner, making it easier to
changes in data. analyze and interpret.
○ Example: In a sales report, you can
h) Difference between Relational and group sales data by product category to
Multidimensional Data Model: analyze total revenue generated by each
category. Sorting the products based on
1. Structure: revenue allows you to identify top-selling
○ Relational Data Model: Organizes data categories.
into tables with rows and columns, where 2. Filtering Reports:
each table represents an entity and ○ Use: Filtering reports help in focusing on
relationships between entities are specific subsets of data based on
established using keys. user-defined criteria, providing
○ Multidimensional Data Model: customized views.
Organizes data into multiple dimensions, ○ Example: In a customer feedback report,
with each dimension representing a you can filter feedback responses to
different attribute or aspect of the data. display only those related to product
quality issues. This allows management
2. Complexity: to address specific areas of concern
○ Relational Data Model: Supports
efficiently.
complex relationships between entities,
allowing for flexible querying and j) File Extension: A file extension is a suffix
analysis of data. attached to the end of a filename, indicating
○ Multidimensional Data Model: Simplifies the format or type of the file. It helps
data analysis by pre-aggregating data operating systems and applications identify
along different dimensions, making it the file's contents and determine how to
easier to analyze data from various handle it.
perspectives.

3. Querying: ● Structure of CSV File


○ Relational Data Model: Queries involve (Comma-Separated Values):
joining tables based on common keys to ○ A CSV file is a plain text file format used
retrieve data. to store tabular data, with each line
○ Multidimensional Data Model: Queries representing a row of data and commas
involve slicing and dicing data along separating values within each row.
different dimensions to analyze subsets
Example
of data.
Name, Age, Gender
4. Usage: John, 25, Male
○ Relational Data Model: Commonly used Jane, 30, Female
in transactional databases and OLTP
(Online Transaction Processing) systems. ● In this example:
○ Multidimensional Data Model: ○ Each row represents a record, with
Commonly used in analytical databases values separated by commas.
and OLAP (Online Analytical Processing) ○ The first row often contains headers,
systems for decision support and indicating the names of columns.
business intelligence purposes. ○ Values can be enclosed in quotes if they
contain special characters or spaces.
● CSV files are widely used for the revenue generated by a specific
exchanging data between different category, and the entire pie represents
applications and systems due to their the total revenue
simplicity and ease of use.

k) Charts: Charts are graphical


representations of data, used to visually
illustrate trends, patterns, and relationships
within datasets. Different types of charts are
used based on the nature of the data and the
insights to be communicated.

● Different Types of Charts: Bar Chart,


Line Chart, Pie Chart, Scatter Plot,
Histogram, Area Chart, Bubble Chart,
Box Plot, Radar Chart, Gantt Chart

Pie Chart:

● Description: A pie chart is a circular


statistical graphic divided into slices to
illustrate numerical proportions. Each
slice represents a proportion of the
whole, and the size of each slice is
proportional to the quantity it
represents.
● Components:
○ Slices: Each slice represents a
category or segment of the data.
○ Labels: Labels are used to identify
each slice and its corresponding
category.
○ Center: Often, additional
information such as the total value
or percentage of each category is
displayed in the center of the pie
chart.
● Use and Application: Pie charts are
commonly used to show the
composition of a whole and highlight
the relative proportions of different
categories within a dataset. They are
effective for visualizing data with a
small number of categories and for
conveying percentages or proportions.
However, they may not be suitable for
displaying large datasets with many
categories or for comparing individual
values within each category.
● Example: A pie chart can be used to
illustrate the distribution of sales
revenue across different product
categories, where each slice represents
Unit 4

5. Exploratory Data Analysis (EDA):


Perform in-depth analysis to uncover
a) Data Exploration:
patterns or trends in the data. This may
Data exploration is the initial step in data involve segmenting the data based on
analysis where the primary focus is on different criteria (e.g., location, house
understanding the characteristics of the type) and comparing distributions or
dataset. It involves summarizing the main relationships within each segment.
characteristics of the data, often using
By exploring the data, we can gain a better
visualization techniques and statistical
understanding of factors influencing housing
methods. The goal is to gain insights into the
prices in the city and make informed
underlying structure, patterns, distributions,
decisions in subsequent analysis or modeling
and relationships within the data.
tasks.
Example: Let's say we have a dataset
b) Data Transformation:
containing information about housing prices
in a certain city. To explore this dataset, we
Data transformation involves converting raw
might perform the following steps:
data into a more suitable format for analysis
or modeling. This process may include
1. Summary Statistics: Calculate
normalization, standardization, encoding
summary statistics such as mean,
categorical variables, and creating new
median, standard deviation, minimum,
features through mathematical operations or
maximum, and quartiles for variables
transformations.
like house price, square footage,
number of bedrooms, etc.
Example: Consider a dataset containing
2. Data Visualization: Create
information about students' exam scores in
visualizations such as histograms for
different subjects. Here's how we might
continuous variables (e.g., house price
perform data transformation:
distribution), box plots to identify
outliers, scatter plots to explore 1. Normalization: Scale numerical
relationships between variables (e.g., features to a standard range, such as
house price vs. square footage), and between 0 and 1, to ensure that all
heatmap to visualize correlations variables contribute equally to the
between variables. analysis. For instance, we can
3. Data Cleaning: Identify and handle normalize exam scores using Min-Max
missing values, outliers, and scaling.
inconsistencies in the data. This may 2. Standardization: Standardize
involve imputing missing values, numerical features to have a mean of 0
removing outliers, and correcting and a standard deviation of 1. This is
errors. particularly useful for algorithms that
4. Feature Engineering: Derive new assume normally distributed data, such
features from existing ones if as linear regression. We can
necessary. For example, creating a new standardize exam scores using Z-score
feature like price per square foot by normalization.
dividing house price by square 3. Encoding Categorical Variables:
footage. Convert categorical variables into
numerical representations that can be
understood by machine learning Inconsistency: Inconsistency occurs when
algorithms. For example, we can use there are contradictions or conflicts between
one-hot encoding to represent different parts of the dataset. This can
student's grade levels (e.g., freshman, include discrepancies in attribute values,
sophomore, junior, senior) as binary logical errors, or violations of integrity
variables. constraints. Inconsistencies can arise due to
4. Feature Transformation: Create new data integration from multiple sources, data
features by applying mathematical entry errors, or changes in data over time.
transformations to existing ones. For Data validation techniques such as
instance, we can calculate the cross-validation, rule-based checks, and
logarithm of exam scores to reduce anomaly detection can help identify and
skewness in the data. resolve inconsistencies in the data.
5. Handling Text Data: Process and
tokenize text data to extract d) Data Reduction:
meaningful features, such as word
frequencies or TF-IDF scores, for Data reduction refers to the process of
natural language processing tasks. reducing the volume of data while retaining
its integrity and meaningfulness. It aims to
After data transformation, the dataset is simplify complex datasets by eliminating
ready for analysis or modeling, with features redundant or irrelevant information, thereby
that are standardized, encoded improving efficiency in storage, processing,
appropriately, and possibly augmented with and analysis.
new derived features.
Example: Consider a large dataset
c) Data Validation, Incompleteness, Noise, containing customer transaction histories for
Inconsistency: a retail business. Here's how we might
perform data reduction:
Data validation: Data validation refers to
the process of ensuring that the data 1. Feature Selection: Identify and select
collected is accurate, consistent, and reliable a subset of relevant features that are
for analysis or modeling purposes. most informative for the analysis or
Incompleteness, noise, and inconsistency are modeling task. This can involve using
common challenges encountered in techniques such as correlation
real-world datasets that can affect data analysis, feature importance ranking,
quality. or domain knowledge.
2. Dimensionality Reduction: Reduce the
Incompleteness: Incompleteness refers to number of dimensions in the dataset
missing values in the dataset. Missing data while preserving its essential structure
can arise due to various reasons such as and patterns. Techniques such as
human error, equipment malfunction, or principal component analysis (PCA) or
intentional omission. It's essential to identify t-distributed stochastic neighbor
missing values and handle them embedding (t-SNE) can be used to
appropriately through techniques like project high-dimensional data onto a
imputation or removal. lower-dimensional space.
3. Sampling: Instead of using the entire
Noise: Noise refers to irrelevant or erroneous dataset, extract a representative
data present in the dataset that can obscure sample that captures the essential
patterns or relationships. Noise can arise due characteristics of the population. This
to measurement errors, data entry mistakes, can help reduce computational
or variability in the data collection process. complexity and memory requirements
It's important to identify and filter out noisy while still providing reliable insights.
data using techniques like smoothing, outlier 4. Aggregation: Aggregate data at a
detection, or data cleaning algorithms. higher level of granularity to reduce
the number of records. For example,
instead of storing individual and principal component
transactions, aggregate sales data by analysis.
day, week, or month.
5. Data Compression: Apply compression f) Data Discretization:
techniques to reduce the storage
space required for the dataset while Data discretization is the process of
preserving its original information converting continuous variables into discrete
content. Techniques such as gzip intervals or categories. It is often performed
compression or delta encoding can be to simplify data analysis, reduce complexity,
used to compress data efficiently. and facilitate decision-making in various
applications.
e) Difference between Univariate,
Bivariate, and Multivariate Analysis: Note on Data Discretization:

1. Univariate Analysis: ● Purpose: Data discretization is used to


○ Definition: Univariate analysis convert continuous data into
involves the examination of a single categorical or ordinal data, making it
variable at a time. easier to analyze and interpret.
○ Goal: The primary goal is to ● Techniques: There are several
understand the distribution, central techniques for data discretization,
tendency, dispersion, and shape of including equal width/binning, equal
the variable's values. frequency/binning, clustering-based
○ Techniques: Common techniques discretization, and decision tree-based
used in univariate analysis include discretization.
histograms, box plots, summary ○ Equal Width Binning: Divides
statistics (mean, median, mode), and the range of continuous values
measures of variability (standard into a specified number of
deviation, variance). intervals of equal width.
2. Bivariate Analysis: ○ Equal Frequency Binning:
○ Definition: Bivariate analysis Divides the data into intervals
examines the relationship between such that each interval contains
two variables simultaneously. approximately the same number
○ Goal: The focus is on understanding of data points.
how changes in one variable ○ Clustering-based
correlate with changes in another Discretization: Uses clustering
variable. algorithms to group similar data
○ Techniques: Common techniques points into discrete bins.
used in bivariate analysis include ○ Decision Tree-based
scatter plots, correlation analysis, Discretization: Utilizes decision
and cross-tabulation. tree algorithms to identify
3. Multivariate Analysis: optimal split points for
○ Definition: Multivariate analysis discretizing continuous variables.
involves the simultaneous ● Considerations: When discretizing
examination of three or more data, it's essential to consider the
variables. trade-off between granularity and
○ Goal: The goal is to understand information loss. Too few intervals may
complex relationships and oversimplify the data, while too many
interactions between multiple intervals may lead to overfitting or
variables. noisy results.
○ Techniques: Common techniques ● Applications: Data discretization is
used in multivariate analysis commonly used in data mining,
include multiple regression, machine learning, and statistical
factor analysis, cluster analysis, analysis tasks such as classification,
clustering, and association rule mining.
g) Computing Mean, Median, and Mode: Mode: The mode is the class interval with the
highest frequency.
To compute the mean, median, and mode for
the given data, we first need to calculate the Mode=Class interval with highest
midpoint of each class interval. Then, we can frequency=30−35
apply the formulas for mean, median, and
mode. h) Explanation of Univariate, Bivariate,
and Multivariate Analysis with Examples
Class Frequency: and Applications:

● 10-15: 2 1. Univariate Analysis:


● 15-20: 28 ○ Definition: Univariate analysis
● 20-25: 125 focuses on analyzing a single
● 25-30: 270 variable at a time to understand
● 30-35: 303 its distribution, central tendency,
● 35-40: 197 and variability.
● 40-45: 65 ○ Example: Analyzing the
● 45-50: 10 distribution of exam scores of
students in a class.
Mean: ○ Applications:
Mean=∑Frequency∑(Midpoint×Frequency)​ ■ Descriptive statistics:
Calculating mean, median,
Mean=2+28+125+270+303+197+65+10(12.5×2)+ mode, and standard
(17.5×28)+(22.5×125)+(27.5×270)+(32.5×303)+(37 deviation.
.5×197)+(42.5×65)+(47.5×10)​ ■ Finance: Analyzing stock
prices, returns, and
Median: The median is the midpoint of the volatility.
data when arranged in ascending order. ■ Healthcare: Studying
Since the data is already grouped, we find patient demographics,
the median by locating the class interval disease prevalence, and
containing the median. medical test results.
2. Bivariate Analysis:
Median class=(2n​)th item=(21345​)=672.5th ○ Definition: Bivariate analysis
item examines the relationship
between two variables
The median class is the 4th class (20-25), and simultaneously to understand
the formula for median is: their correlation or association.
○ Example: Investigating the
Median=L+(fN/2−C​)×w
relationship between rainfall and
crop yield.
Where:
○ Applications:
● L = Lower boundary of the median ■ Market research:
class (20) Analyzing the relationship
● N = Total number of observations between advertising
(1345) expenditure and sales
● C = Cumulative frequency of the class revenue.
before the median class (126) ■ Social sciences: Studying
● w = Width of the median class (5) the correlation between
● f = Frequency of the median class (125) education level and
income.
Median=20+(125672.5−126​)×5=20+(125546.5×5​ ■ Environmental science:
)=20+(1252732.5​)=20+21.86≈41.86 Exploring the association
between pollution levels
and health outcomes.
3. Multivariate Analysis: Contingency Table:
○ Definition: Multivariate analysis
involves the simultaneous examination ● The rows represent the categories of
of three or more variables to the "Gender" variable (Male and
understand complex relationships and Female).
interactions. ● The columns represent the categories
○ Example: Studying the impact of of the "Voting Preference" variable
multiple factors (e.g., income, (Democrat, Republican, and
education, age) on voting behavior. Independent).
○ Applications: ● The cells contain the frequencies of
■ Predictive modeling: Building observations corresponding to each
regression models to predict sales combination of categories.
based on multiple variables.
■ Market segmentation: Identifying Marginal Distribution:
customer segments based on
demographic, behavioral, and ● Marginal Distribution of Gender:
psychographic variables. Summing the counts across columns
■ Epidemiology: Analyzing the joint provides the distribution of gender.
effects of risk factors on disease ○ Male: 150 + 100 + 50 = 300
incidence and prevalence. ○ Female: 200 + 120 + 80 = 400
● Marginal Distribution of Voting
i) Contingency Table and Marginal Preference: Summing the counts
Distribution: across rows provides the distribution of
voting preference.
Contingency Table: A contingency table, ○ Democrat: 150 + 200 = 350
also known as a cross-tabulation table, is a ○ Republican: 100 + 120 = 220
tabular representation of the joint ○ Independent: 50 + 80 = 130
distribution of two or more categorical
variables. It displays the frequencies or Marginal distributions help in understanding
counts of observations that fall into each the distribution of individual variables in a
combination of categories for the variables. contingency table, providing valuable
insights for further analysis.
Marginal Distribution: Marginal distribution
refers to the distribution of a single variable j) Explanation of Data Reduction
from a contingency table by summing or Techniques: Sampling, Feature Selection,
aggregating the counts or frequencies across Principal Component Analysis:
the other variables. It provides insights into
the distribution of individual variables 1. Sampling:
independent of other variables. ○ Definition: Sampling involves
selecting a subset of data points
Example: Consider a survey conducted to from a larger population to
study the relationship between gender and represent the whole. It aims to
voting preference. The data collected is reduce the size of the dataset while
represented in the contingency table below: preserving its essential
characteristics.
○ Example: Randomly selecting 10%
Gender Democrat Republican Independent of customers from a database for a
satisfaction survey.
Male 150 100 50 ○ Applications:
■ Market research: Conducting
Female 200 120 80 surveys on a sample of consumers
to make inferences about the entire
population.
■ Quality control: Testing a sample of patterns from a large set of
products from a manufacturing meteorological variables.
batch to ensure consistency.
■ Opinion polling: Surveying a By applying these data reduction techniques,
sample of voters to predict election we can simplify complex datasets, improve
outcomes. computational efficiency, and enhance the
2. Feature Selection: interpretability of analysis or modeling
○ Definition: Feature selection results.
involves choosing a subset of
relevant features (variables) from
the original dataset while
discarding irrelevant or redundant
ones. It aims to reduce
dimensionality and improve model
performance.
○ Example: Selecting the most
informative features (e.g., age,
income, education) for predicting
customer churn in a telecom
company.
○ Applications:
■ Machine learning: Identifying key
features for building predictive
models to improve accuracy and
interpretability.
■ Signal processing: Selecting
relevant features for pattern
recognition and classification tasks.
■ Bioinformatics: Choosing genetic
markers for disease diagnosis and
prognosis in genomic studies.
3. Principal Component Analysis (PCA):
○ Definition: PCA is a dimensionality
reduction technique that transforms
the original variables into a new set
of orthogonal variables called
principal components. It aims to
capture the maximum variance in the
data with fewer dimensions.
○ Example: Reducing the dimensions of
a dataset containing correlated
variables (e.g., height, weight, and
body mass index) into a smaller set of
uncorrelated components.
○ Applications:
■ Image processing: Reducing the
dimensionality of image data for
compression and recognition tasks.
■ Finance: Analyzing and visualizing
financial market data with reduced
dimensions.
■ Environmental science: Identifying
key factors influencing climate
Unit 5:

a) Association Rule Mining: ● Hierarchical Clustering: In hierarchical


clustering, data is grouped into a tree of
Association rule mining is a data mining clusters, where each node represents a
technique used to discover interesting cluster. It can be agglomerative, starting
relationships, associations, or patterns among with individual data points as clusters and
variables in large databases. It's commonly used merging them into larger clusters, or
in market basket analysis to identify divisive, starting with one cluster
combinations of items frequently purchased containing all data points and recursively
together. Three important terms in association splitting it into smaller clusters.
rule mining are: Hierarchical clustering doesn't require the
number of clusters to be specified
● Support: Support refers to the frequency beforehand.
of occurrence of a particular itemset in the ● Partitioning Method: In partitioning
dataset. It indicates how often the itemset methods like k-means, the data is
appears in the dataset. Mathematically, partitioned into a predefined number of
it's calculated as the ratio of the number of clusters. Initially, k centroids are randomly
transactions containing the itemset to the chosen, and each data point is assigned to
total number of transactions. Higher the nearest centroid. Then, centroids are
support values indicate more frequent recalculated as the mean of the points
itemsets. assigned to each cluster, and the process
iterates until convergence. Unlike
Support(X)=Total transactionsTransactions hierarchical clustering, the number of
containing X​ clusters (k) needs to be specified
beforehand in partitioning methods.
● Confidence: Confidence measures the
reliability or strength of the association Apriori Algorithm:
between two items in an itemset. It's
calculated as the ratio of the number of Apriori algorithm is a popular algorithm for
transactions containing both the mining frequent itemsets and generating
antecedent and consequent of a rule to association rules. Given a dataset of transactions,
the number of transactions containing the it works by iteratively finding frequent itemsets
antecedent. High confidence values with increasing size. Here's how you can apply
indicate strong associations between the Apriori algorithm to the given dataset:
items.
1. Identify Individual Items and Calculate
Confidence(X→Y)=Support(X)Support(X∪Y)​ Support: Count the occurrences of each
individual item in the dataset and
● Lift: Lift measures how much more likely calculate their support.
item B is purchased when item A is 2. Generate Candidate Itemsets: Generate
purchased, compared to its likelihood candidate itemsets of size 2 or more based
without the presence of item A. It's on the frequent itemsets from the previous
calculated as the ratio of the observed iteration.
support of the itemset to the expected 3. Calculate Support for Candidate
support if the items were independent. Lift Itemsets: Count the occurrences of
values greater than 1 indicate that the two candidate itemsets in the dataset and
items are positively correlated, values calculate their support.
equal to 1 indicate independence, and 4. Generate Association Rules: For each
values less than 1 indicate negative frequent itemset, generate association
correlation. rules based on the minimum confidence
threshold.
Lift(X→Y)=Support(X)×Support(Y)Support(X∪Y)​ 5. Filter Rules Based on Confidence: Keep
only those rules that satisfy the minimum
Difference between Hierarchical Clustering and confidence threshold.
Partitioning Method:
Here's the calculation for the given dataset:
Minimum support count is 2. ● P(A) is the prior probability of A occurring
independently.
Minimum confidence is 60%. ● P(B) is the prior probability of B occurring
independently.
Itemset Support
Bayes' theorem is widely used in various fields
such as statistics, machine learning, and artificial
{11} 6 intelligence for tasks like classification, anomaly
detection, and probabilistic reasoning. It
{12} 7 provides a framework for updating beliefs or
hypotheses in the light of new evidence.

{13} 5 e) Difference between Classification and


Clustering:
{14} 2
● Classification:
○ Classification is a supervised learning technique
{15} 2 where the goal is to categorize input data into
predefined classes or labels.
{11,12} 5 ○ In classification, the algorithm learns from
labeled data to predict the class labels of new,
unseen data.
{11,13} 3 ○ Example: Spam email classification. Given a
dataset of emails labeled as spam or not spam,
{12,13} 3 a classification algorithm learns to predict
whether new emails are spam or not based on
features such as words frequency, sender's
{11,15} 2
address, etc.
Clustering:
{12,14} 1 ○ Clustering is an unsupervised learning
technique where the goal is to group similar
data points into clusters based on their inherent
Association rules:
characteristics or properties.
○ In clustering, the algorithm discovers the
1. {11} => {12} (Support: 5, Confidence: 5/6 =
underlying structure or patterns in the data
83.33%)
without any predefined class labels.
2. {12} => {11} (Support: 5, Confidence: 5/7 =
○ Example: Customer segmentation. Given a
71.43%)
dataset of customer attributes like age, income,
3. {12} => {13} (Support: 3, Confidence: 3/7 =
and purchase history, clustering algorithms can
42.86%)
group similar customers together to identify
4. {13} => {12} (Support: 3, Confidence: 3/5 =
segments for targeted marketing strategies.
60%)
f) Logistic Regression:
d) Bayes Theorem:
Logistic regression is a widely used statistical
Bayes' theorem is a fundamental concept in
technique for binary classification problems. It's
probability theory that describes the probability
called "logistic" regression because it models the
of an event, based on prior knowledge of
probability of the binary outcome using the
conditions that might be related to the event. It's
logistic function.
stated mathematically as:
Key points about logistic regression:
P(A∣B)=P(B)P(B∣A)×P(A)​
● It's a parametric model that estimates
Where:
coefficients to describe the relationship
● P(A∣B) is the posterior probability of event between the independent variables and the
A occurring given that B is true. log-odds of the dependent variable.
● P(B∣A) is the likelihood of B occurring ● It's used when the dependent variable is
given that A is true. categorical (binary) and the independent
variables can be continuous, discrete, or ● Support: Support measures the frequency
categorical. of occurrence of an itemset in the dataset.
● Logistic regression is interpretable, and the It indicates how often the itemset appears
coefficients can provide insights into the in transactions. Higher support values
influence of each independent variable on the indicate more frequent itemsets.
probability of the outcome. Mathematically, it's calculated as the ratio
● Despite its name, logistic regression is a of the number of transactions containing
classification algorithm, not a regression the itemset to the total number of
algorithm, as it predicts the probability of a transactions.
binary outcome.
Support(X→Y)=Total transactionsTransactions
Example: containing both X and Y​

Consider a dataset of student exam scores and ● Confidence: Confidence measures the
their corresponding pass/fail status. The goal is reliability or strength of the association
to predict whether a student will pass (1) or fail between two itemsets in a rule. It's
(0) the exam based on their exam scores. We can calculated as the ratio of the number of
use logistic regression to build a model that transactions containing both the
predicts the probability of passing the exam antecedent and consequent of a rule to
based on the exam scores. the number of transactions containing the
antecedent.
Let's say we have two predictor variables:
exam1_score (x1​) and exam2_score (x2​). The Confidence(X→Y)=Support(X)Support(X∪Y)​
logistic regression model can be represented as:
Example:
log(1−pp​)=β0​+β1​x1​+β2​x2​
Consider a dataset of supermarket transactions,
Where: and we want to find association rules. Let's say
we have the following rule: {Diapers} → {Beer}.
● p is the probability of the student passing
the exam given their exam scores. ● Support(Diapers → Beer) = 0.2: This means
● x1​ and x2​ are the exam1_score and that 20% of transactions contain both
exam2_score respectively. diapers and beer.
● β0​, β1​, and β2​ are the coefficients of the ● Confidence(Diapers → Beer) = 0.8: This
model. means that among the transactions
containing diapers, 80% also contain beer.
The logistic regression model estimates the
coefficients β0​, β1​, and β2​from the training data, In this example, a support of 0.2 indicates that
and the predicted probability p is used to make the rule is relevant in 20% of transactions, while a
predictions. If the predicted probability is greater confidence of 0.8 indicates that the rule is
than a certain threshold (e.g., 0.5), the student is accurate in 80% of cases where diapers are
predicted to pass (1); otherwise, they are purchased.
predicted to fail (0).
h) Different Formulae for Evaluation of
g) Association Rules and Evaluation using Classification Models:
Support and Confidence:
● Accuracy:
Association Rules:
Accuracy=TP + TN + FP + FNTP + TN​
Association rules are patterns or relationships
discovered in datasets consisting of transactions ● Precision:
or items. They are used in market basket analysis
to identify co-occurrence relationships between Precision=TP + FPTP​
different items in a transaction. Association rules
typically take the form of "if-then" statements, ● Recall (Sensitivity):
where antecedents imply consequents.
Recall=TP + FNTP​
Evaluating Association Rules using Support
and Confidence: ● F1 Score:
F1 Score=2×Precision + RecallPrecision×Recall​ Recall=TP + FNTP​
● Precision:
● Specificity:
Precision=TP + FPTP​
Specificity=TN + FPTN​ ● Accuracy:
● False Positive Rate (FPR): Accuracy=TP + TN + FP + FNTP + TN​

FPR=FP + TNFP​ These metrics provide insights into the


performance of a classification model in terms of
i) Clustering Visitors by Age with K = 2: its ability to correctly classify instances into
different classes, detect true positives, and
To cluster visitors by age into two groups, we can minimize false positives and false negatives.
use the K-means clustering algorithm. Given the
ages: 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, k) K-Means Partitioning Method with Example:
43, 44, 45, 61, 62, 66.
K-means is a partitioning clustering algorithm
1. Initial centroids: Let's choose two initial used to divide a dataset into K clusters. Here's
centroids, say 20 and 40. how it works:
2. Assign each age to the nearest centroid.
3. Update centroids as the mean of ages in each 1. Choose K initial centroids randomly or
cluster. based on some heuristic.
4. Repeat steps 1 and 2 until convergence. 2. Assign each data point to the nearest
centroid, forming K clusters.
Let's apply these steps: 3. Update the centroids as the mean of data
points in each cluster.
● Initial centroids: 20, 40 4. Repeat steps 2 and 3 until convergence
● Iteration 1: (centroids do not change significantly).
○ Cluster 1: 16, 16, 17, 20, 20, 21, 21, 22, 23, 29
○ Cluster 2: 36, 41, 42, 43, 44, 45, 61, 62, 66 Example:
○ New centroids: 21.4, 49.4
● Iteration 2: Suppose we have the following dataset with two
○ Cluster 1: 16, 16, 17, 20, 20, 21, 21, 22, 23, 29 features (x and y):
○ Cluster 2: 36, 41, 42, 43, 44, 45, 61, 62, 66
Data point x y
○ New centroids: 21.4, 49.4
1 2 3
The clusters are: 2 3 4

3 8 7
● Cluster 1: Ages 16 to 29.
● Cluster 2: Ages 36 to 66. 4 9 8

Classification Evaluation Model using 5 10 7


Confusion Matrix, Recall, Precision, &
Accuracy:
Let's say K = 2. We randomly initialize two
Confusion Matrix: centroids:
● Centroid 1: (2, 3)
Predicted Positive Predicted Negative ● Centroid 2: (8, 7)
Assign data points to the nearest centroid:
Actual True Positive (TP) False Negative (FN) ● Cluster 1: {1, 2} (centroid: (2.5, 3.5))
Positive
● Cluster 2: {3, 4, 5} (centroid: (9, 7.33))

Actual False Positive (FP) True Negative (TN) Update centroids:


Negative ● New centroid for cluster 1: (2.5, 3.5)
● ● New centroid for cluster 2: (9, 7.33)
Recall (Sensitivity):
Repeat until convergence
Unit 6:

a) Business Intelligence (BI) applications in 5. Forecasting and Predictive Analytics:


Customer Relationship Management (CRM) Analytical tools enable organizations to
involve utilizing data analysis and reporting tools forecast future trends and outcomes based
to gain insights into customer behavior, on historical data and predictive analytics
preferences, and trends. One example of a BI models. This helps businesses anticipate
application in CRM is the analysis of customer changes in the market, demand, or customer
purchase history to identify patterns and trends. behavior, allowing them to plan and
strategize accordingly.
By using BI tools, such as data mining algorithms
for predictive analytics models, businesses can c) Business Intelligence (BI) refers to the use
segment their customers based on buying of data analysis tools and techniques to
behavior, demographics, or other relevant transform raw data into meaningful and
factors. For instance, a retail company may use actionable insights for business
BI to analyze past purchases and identify which decision-making. It involves gathering,
products are frequently bought together (known storing, analyzing, and presenting data to help
as market basket analysis). This information can organizations understand their performance,
then be used to personalize marketing identify opportunities, and make informed
campaigns, offer targeted promotions, or decisions.
optimize inventory management.
Different tools used for Business Intelligence
b) Analytical tools play crucial roles in Business include:
Intelligence (BI) by enabling organizations to
process, analyze, and visualize data to derive 1. Reporting Tools: These tools enable users to
actionable insights. Some key roles of analytical create and generate reports from their data.
tools in BI include: Examples include Microsoft Power BI,
Tableau, and SAP Crystal Reports. They allow
1. Data Integration: Analytical tools facilitate users to visualize data through charts,
the integration of data from multiple sources, graphs, and tables, making it easier to
such as databases, spreadsheets, and cloud understand and interpret.
applications, into a single, unified platform. 2. OLAP (Online Analytical Processing) Tools:
This ensures that organizations have access OLAP tools facilitate multidimensional
to a comprehensive dataset for analysis. analysis of data, allowing users to explore
2. Data Analysis: Analytical tools offer various data from different perspectives. Examples
techniques for analyzing data, including include Microsoft SQL Server Analysis
statistical analysis, data mining, and Services and Oracle OLAP. OLAP tools are
predictive modeling. These techniques help particularly useful for complex data analysis
businesses identify trends, patterns, and and ad-hoc querying.
relationships within their data, allowing them 3. Data Mining Tools: Data mining tools use
to make informed decisions. algorithms to discover patterns, trends, and
3. Reporting and Visualization: Analytical relationships in large datasets. Examples
tools enable users to create interactive include IBM SPSS Modeler, RapidMiner, and
reports and visualizations to communicate Knime. These tools are used for predictive
insights effectively. Dashboards, charts, and analytics, customer segmentation, and
graphs help stakeholders understand anomaly detection.
complex data quickly and facilitate 4. Dashboard and Data Visualization Tools:
data-driven decision-making. These tools provide interactive dashboards
4. Performance Monitoring: Analytical tools and visualization capabilities for presenting
provide capabilities for monitoring key data in a visually appealing and informative
performance indicators (KPIs) and tracking manner. Examples include Tableau, QlikView,
business performance in real-time. This and Domo. They help users monitor KPIs,
allows organizations to identify areas of track performance, and gain insights at a
improvement and take proactive measures to glance.
address issues. 5. ETL (Extract, Transform, Load) Tools: ETL
tools are used to extract data from various enhance customer satisfaction and loyalty.
sources, transform it into a consistent format, ● Performance Monitoring: BI dashboards and
and load it into a data warehouse or reporting tools enable banks to monitor key
repository for analysis. Examples include performance indicators (KPIs) such as
Informatica PowerCenter, Talend, and profitability, asset quality, and operational
Microsoft SQL Server Integration Services efficiency. Real-time analytics help identify
(SSIS). bottlenecks, optimize processes, and make
6. Data Warehousing Tools: These tools are data-driven decisions.
used to design, build, and manage data
warehouses, which serve as central Logistics & Production
repositories for storing and integrating data
from different sources. Examples include Logistics
Amazon Redshift, Snowflake, and Google
BigQuery. ● Supply Chain Optimization: BI analyzes
7. Predictive Analytics Tools: Predictive inventory levels, demand forecasts, and
analytics tools use statistical algorithms and transportation routes to optimize supply chain
machine learning techniques to forecast operations. By identifying inefficiencies,
future trends and outcomes based on companies can streamline processes, reduce
historical data. Examples include SAS costs, and improve delivery performance.
Predictive Modeling, IBM Watson Analytics, ● Warehouse Management: BI enables
and RapidMiner. monitoring warehouse operations, inventory
turnover, and stock levels in real-time.
d) Business Intelligence (BI) finds various Analyzing data helps optimize warehouse
applications in the telecommunication and layouts, inventory storage, and order
banking sectors: fulfillment processes.
● Route Planning and Optimization: BI
Telecommunication analyzes transportation data to optimize route
planning, vehicle utilization, and delivery
● Customer Segmentation & Churn Prediction: schedules. Predictive analytics help minimize
BI helps segment customers based on usage fuel costs, reduce transit times, and enhance
patterns, demographics, and preferences to customer service.
identify high-value customers and predict
churn. This enables proactive retention Production
strategies.
● Network Optimization: BI analyzes network ● Demand Forecasting: BI helps forecast
performance data to identify areas of demand by analyzing historical sales data,
congestion, outages, or issues. By analyzing market trends, and customer preferences.
historical and real-time data, companies can Accurate demand forecasts help optimize
optimize network infrastructure and improve production schedules, inventory levels, and
service quality. resource allocation.
● Revenue Management: BI analyzes revenue ● Quality Control: BI enables monitoring quality
streams, pricing structures, and billing data. metrics, defect rates, and production yield in
Understanding customer spending patterns real-time. Analyzing quality data helps identify
and revenue drivers helps optimize pricing root causes of defects, implement corrective
strategies, offer personalized packages, and actions, and improve product quality.
maximize revenue. ● Capacity Planning: BI facilitates capacity
planning and resource optimization by
Banking analyzing production capacity, equipment
utilization, and resource availability.
● Risk Management: BI aggregates and Identifying constraints and bottlenecks helps
analyzes data from various sources to assess optimize production processes, minimize
credit risk, market risk, and operational risk. downtime, and maximize efficiency.
Predictive analytics help assess
creditworthiness, detect fraud, and mitigate Finance & Marketing
risks.
● Customer Relationship Management (CRM): Finance
BI facilitates customer segmentation, profiling,
and targeting. Banks can offer personalized ● Financial Analysis and Reporting: BI helps
products, cross-sell and upsell services, and analyze financial data, generate reports, and
gain insights into KPIs like revenue, expenses,
and profitability. Visualizing financial metrics ● ERP systems: Transaction processing, data
helps stakeholders make informed decisions, management, workflow automation
identify trends, and monitor financial health. (streamline operations).
● Risk Management: BI enables assessing and ● BI systems: Data analysis, reporting,
mitigating risks by analyzing credit portfolios, visualization (provide insights for strategic
market trends, and regulatory compliance decisions).
data. Predictive analytics help identify
potential risks and implement proactive 4. User Base:
mitigation strategies.
● ERP systems: Operational users (finance
Marketing managers, HR professionals, supply chain
managers).
● Customer Segmentation and Targeting: BI ● BI systems: Business analysts, data scientists,
enables segmenting customer data based on decision-makers (analyze data, generate
demographics, behavior, and preferences. By reports, derive insights).
analyzing customer segments, marketers can
personalize campaigns, target specific groups, h) The Role of Data Analytics in Business
and improve campaign effectiveness.
● Campaign Performance Analysis: BI Data analytics is crucial for extracting valuable
facilitates analyzing marketing campaign insights from large data volumes to drive
performance by tracking metrics like strategic decision-making and gain a competitive
conversion rates, click-through rates, and edge. For example, in retail, it can help with:
return on investment (ROI). By analyzing
campaign data in real-time, marketers can ● Understanding customer preferences
optimize strategies, allocate resources ● Optimizing pricing strategies
effectively, and maximize ROI. ● Improving inventory management

Similarities and Differences Between ERP and By analyzing data, retailers can identify trends,
BI forecast demand, and personalize marketing
campaigns to enhance customer satisfaction and
Similarities increase sales.

● Data Integration: Both ERP and BI integrate i) Implementing Business Intelligence Findings
data from various sources to provide a unified
view of business operations. Implementing BI findings involves several steps:
● Decision Support: Both offer tools for data
1. Identify Key Objectives: Determine the
analysis, reporting, and visualization to
specific business goals BI aims to address
support decision-making.
(optimize operations, improve customer
● Improving Efficiency: Both help improve
satisfaction, increase revenue).
operational efficiency, streamline processes,
2. Data Collection and Integration: Gather
and optimize resource allocation by providing
relevant data from various sources
insights into business performance.
(databases, spreadsheets, CRM systems,
external sources). Ensure data is cleaned,
Differences between ERP and BI Systems:
standardized, and integrated for analysis.
1. Scope and Focus: 3. Data Analysis and Insights Generation:
Utilize BI tools and techniques to analyze
● ERP systems: Manage core business processes data and generate actionable insights (data
(finance, HR, inventory, supply chain). visualization, statistical analysis, predictive
● BI systems: Analyze and interpret data to modeling, machine learning algorithms).
support decision-making across functions. 4. Interpretation and Decision-Making:
Interpret insights in the context of business
2. Real-time vs. Historical Data: objectives. Collaborate with stakeholders to
understand implications and devise
● ERP systems: Real-time transactional data implementation strategies.
(day-to-day operations). 5. Implementation Planning: Develop a plan
● BI systems: Historical data to identify trends for implementing changes based on BI
and insights over time. findings (reallocate resources, redesign
processes, launch new initiatives, revise
3. Functionality: existing strategies).
6. Monitoring and Evaluation: Continuously ● Clustering: Includes algorithms for discovering
monitor the implementation of BI-driven hidden patterns and grouping similar data
initiatives and track their impact on KPIs. points together (k-means, hierarchical
Evaluate the effectiveness of strategies and clustering, expectation-maximization (EM)).
make adjustments as needed. ● Association Rule Mining: Supports algorithms
for discovering relationships between variables
j) Business Intelligence (BI) Applications in (Apriori, FP-Growth) for market basket
Logistics analysis, recommendation systems, and
identifying cross-selling opportunities.
BI applications in logistics leverage data ● Visualization: Provides tools for exploring and
analytics to optimize supply chain operations, interpreting data analysis results (scatter plots,
enhance efficiency, and improve histograms, decision tree diagrams).
decision-making. BI in logistics enables Visualizations help users understand patterns,
organizations to: trends, and relationships in the data more
intuitively.
1. Demand Forecasting: Analyze data to
forecast future demand accurately (optimize
inventory levels, reduce stockouts, improve
customer service).
2. Route Optimization: Utilize BI tools to
analyze transportation data (traffic patterns,
delivery routes, vehicle utilization) to
minimize fuel costs, reduce transit times, and
improve delivery efficiency.
3. Warehouse Management: Implement BI
solutions for monitoring warehouse
operations, inventory levels, and order
fulfillment processes. Real-time analytics help
optimize warehouse layouts, reduce picking
times, and enhance overall efficiency.
4. Supplier Management: Analyze supplier
performance metrics (lead times, quality
levels, delivery reliability) to identify top
performers and optimize supplier
relationships. BI insights enable better
decision-making in supplier selection and
contract negotiations.
5. Risk Management: Utilize BI tools to identify
and mitigate risks in the supply chain
(disruptions, delays, inventory shortages).
Predictive analytics models help proactively
manage risks and implement contingency
plans to minimize their impact.

k) WEKA and its Use in Business Intelligence


(BI)

WEKA (Waikato Environment for Knowledge


Analysis) is a popular open-source data mining
software tool used in BI for various tasks:

● Data Preprocessing: Provides techniques for


data cleaning, attribute selection,
normalization, and transformation (prepare
data for analysis and improve results).
● Classification and Regression: Offers
algorithms for predictive modeling (decision
trees, support vector machines (SVM),
k-nearest neighbors (k-NN), neural networks).

You might also like