0% found this document useful (0 votes)
22 views51 pages

Bi Endsem Notes

The document discusses the differences between relational and multidimensional data models, highlighting their advantages and disadvantages for reporting and business intelligence. It also covers various types of reports, including lists, crosstabs, statistics, charts, maps, and financial reports, along with methods for data grouping, sorting, filtering, and calculations. Additionally, it addresses data validation techniques to ensure data quality and completeness.

Uploaded by

iamviku143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views51 pages

Bi Endsem Notes

The document discusses the differences between relational and multidimensional data models, highlighting their advantages and disadvantages for reporting and business intelligence. It also covers various types of reports, including lists, crosstabs, statistics, charts, maps, and financial reports, along with methods for data grouping, sorting, filtering, and calculations. Additionally, it addresses data validation techniques to ensure data quality and completeness.

Uploaded by

iamviku143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT 3

1. Building reports with relational vs Multidimensional data models


Data models are used for reporting and for business intelligence tools
The output from analytical processing is aggregated and given as an input to reporting
and BI tools.
Relational data models
● A database with collection of relations is known as relation model
● A relation is a table of values in which every row is a collection of related values.
● Rows represent a real world entity.
● Physical storage of data is independent of the way data is logically
organized.
● Concepts related to relational data models.
○ Attribute:Properties that define a table
○ Tables:Has two properties: rows and columns.
○ Tuple:Single row of table is called as tuple
○ Relation Schema:name of the relationship with its attributes
○ Relation Instance:Finite set of tuples ,do not have duplicate tuples
○ Relation Key :attribute or set of attributes that uniquely identifies a row in
a table.
○ Attribute domain: predefined values or scope of an attribute
● Advantages
○ Has structural Independence
○ Complex database navigation can be avoided using high level query
language like sql.
● Disadvantages
○ Some of the relational databases have limits on field lengths.
○ Become more complex due to large amounts of data as relations become
complicated.

Multidimensional Data Models:-


● Here, data is stored in the form of cubes.
● Data cube enables modeling and viewing of data in multiple dimensions.
● A MDM is defined by dimensions and facts
● Dimensions are entities. Eg- for sales DW, dimensions are time, item, location.
● Dimension table describes the attributes.

Advantages -
1. Easy to handle and maintain
2. MDM performance is better than relational model
3. Can be used on complex systems and applications
● Disadvantages-
1. Complicated ; thus, databases are dynamic in design
2. End product achievement is complicated
3. System is insecure during security break

2. Types of Reports –
a. List
● Used to show detailed information from your database.
● E.g. item list,customer list
● Data in list is shown in the form of rows and columns
● Each column shows all the values for a data item in the database.
● Following operations can be performed on the list.
○ Set list properties.
○ Hide columns in list report
○ Create a scrollable list.
○ Use repeaters.
b. Crosstabs Reports
● Also k/a Matrix reports.
● Used to show relationships between three or more query items
● Data is shown in rows and columns with info summarized at intersection
point.
● Crosstabs nodes and node members
● Set crosstabs properties
● Create single edge crosstab
● Create nested Crosstab
● Swap columns and rows
● Change list into crosstabs.

c. Statistics
● By using statistics the collected data can be abbreviated and represented
in such a way that it can be easily understood and actionable insights
can be extracted.
● Statistical reporting strategy uses three basic statistical tests.
○ Descriptive statistics
■ Main objective of descriptive statistics is to demonstrate a
huge portion of the collected data through summary,charts
and tables.
○ Inferential statistics
■ Provide a more detailed and effective statistical data
analysis
○ Psychometric tests
■ Psychometric tests analyze the attributes and performance
of the conducted survey to ensure that the survey data is
reliable and valid
● Statistical reporting tools include factor analysis,cluster analysis,gap
analysis,Z-test and U-test.
● Types of statistical reporting data
○ Categorical data
○ Ordinal data
○ Interval data
○ Ratio data
d. Chart
● Used to present data in a way that is useful to the end user.
● They are visual representations of all types of data which may or may not
be related.
● It can represent large sets of data which makes it easy to understand.
● Combinational charts are those where more than one type of charts are
used to represent data.
● Many different types of charts are available:-
a. Column charts
b. Line charts
c. Pie charts
d. Bar charts
e. Area charts
f. Point charts
g. Scatter chart
h. Bubble chart
i. Quadrant chart
e. Map
● Another approach of data visualization which is used to analyze and
represent the geographically correlated data and present it in the form of
maps.
● Helps to identify insights from data and make proper decisions.
● Different types of maps:-
a. Heap map:- A heap map is used to represent the relationship
between two data variables and provides quantity wise
information, such as high, medium, low. It is represented using
different colors.
b. Point map
c. Flow map
d. Statistical maps
e. Bubble map
f. Regional map
g. Administrative maps.
f. Financial
● Known as financial statement; is a management tool used for
communication of key financial information
● Used for tracking their performance and report to their investors,to stay
compliant with law regulations that require them to respond to certain
guidelines.
● Three major types of finance reports include Balance sheet,Income
Statement,cash flow statement

Data Grouping & Sorting


Grouping:-
● Process of selecting rows that share common properties.
● Grouping makes any document easy to understand
● Also removes repeated data and summarizes the category and values as per
calculations
● Grouping is based on selected category and conditional parameters. (eg - select
employee in HR dept whose salary > 25000)
● As per conditions, the data is summarized category wise.
● Steps in grouping and sorting:-
a. Add sort operation on selected columns
b. Put all sorted data into groups
c. Add sub-groups.
Sorting
● Sorting data in reports can be done in two ways,
firstly sort the data source object itself and add groups to the report using group
by and,
specify how each group should be sorted using the group by & order by.
● Data can be sorted in two types
○ Ascending sorting
○ Descending sorting
3. Filtering Reports
● Filters are useful in simplifying large amounts of data and only displaying data
what the users really need to see.
● Filters ensure that the reports contain data only specific to business query.
● To retrieve the desired data ,it is important to design filter correctly
● Hidden filters can be enabled for additional filters.
● When a report filter is runned,criteria from report filters can be selected to limit
what data is displayed in the report

4. Adding Calculations to Reports


● Calculations can be performed in a report by using one or more report items
● Eg - one wants to see the results of 5% increase in salary then the salary column
can be multiplied by 1.05.

● Some more operations are :- Power, square root, % of total, etc.

5. Conditional formatting
● Helps users to extract interesting data from reports.
● Basically works on changing the appearance of cells by highlighting them in
different colors or format.
● These conditions are user defined rules like comparing with some numerical
values, result of some formula and text matching.
● Conditional formatting options in Excel -
a. Highlight cell rules
b. Top/ Bottom rules
c. Data bars
d. Color scales
e. Icon sets
6. Adding Summary Lines to Reports
● Helps to extract the quick insights from the dataset and also helps for further
analysis of business.
● Number of tools such as excel,PowerBI, Tabelau,Power Query Builder etc are
available.
● Summaries can be applied to detail values and summary values.
● Predefined summary functions include.
○ Total
○ Count
○ Minimum
○ Maximum
○ Average
7. Drill up
● Performs aggregation by ascending the location hierarchy where one or more
dimensions are removed.
● E.g. Monthly salary can be converted to to yearly salary or a group of districts
can be shown as one state
Drill- down
● It is a dimension expansion technique that can be applied by adding new
dimensions or expanding existing dimensions.
● E.g. States can be drill down to districts or yearly salary can be converted into
monthly salary.
Drill-through capabilities:
● Using this, we can move from one report to another within a session while
maintaining focus on the same data.
● Used to build analytical applications that are bigger than single report
● Drill-through operations consist of a network of linked reports that users can
navigate, retaining context and focus to explore and analyze information.

8. Run or schedule report


● Users with administrative privilege have the ability to schedule a report.
● Scheduled reports based on defined time will start automatically.
● They execute in the background and the results can be viewed later.
● To prevent automatic execution, reports can be deactivated
● Reporting users can view the existing report schedules they are associated with
or can edit reports they have created.
9. Different output forms – (To be updated)
a. PDF (Portable Document Format)
i. It's a versatile file format created by Adobe that gives people an easy,
reliable way to present and exchange documents - regardless of the
software, hardware, or operating systems being used by anyone who
views the document.
ii. PDF supports majority stylesheet attributes which helps in full report
formatting.
iii. PDF displays documents consistently irrespective of application
software,hardware and operating systems.
b. Excel
i. Format is also known as xlsx which provides a fast way to deliver
native excel spreadsheets.
ii. Few exceptions in excel format include charts are static images,row
height can change in the report,column widths are ignored.
c. CSV(Comma Separated Values)
i. Can be opened using variety of spreadsheet software applications and
has .csv extension
d. XML(Extensible Markup Language)
a. They save the data in internal schema format, xmldata.xsd
b. XML format consist of data elements having metadata element and a data
element.
c. Metadata has data item information and data element contains all the
rows and columns
d. XML and CSV cannot produce maps, charts that do not have at least one
category or series and reports that have more than one query defined in
the report.

CSV:

1. Universal compatibility: CSV files are widely supported by various applications,


programming languages, and database systems. They offer a simple and
standardized format for data exchange, allowing seamless integration with
different platforms and systems in the BI ecosystem.
2. Lightweight and efficient: CSV files are lightweight and have a small file size
compared to other formats, making them ideal for handling large volumes of
data in BI. They are efficient for data storage, transfer, and processing, ensuring
faster data retrieval and analysis.
3. Easy to import and export: CSV files can be easily imported into BI tools or
spreadsheet applications like Excel, making it convenient for users to manipulate
and analyze the data. Similarly, exporting data from BI tools to CSV format is
straightforward, enabling users to share data with stakeholders or integrate it
into other systems.
4. Human-readable and editable: CSV files are human-readable, as the data is
presented in plain text with values separated by commas. This readability allows
users to inspect and modify the data using text editors or spreadsheet software.
It facilitates data cleaning, transformation, and customization according to
specific requirements.

PDF:

1. Preserves formatting: PDF files are designed to preserve the formatting and
layout of documents, including images, charts, fonts, and styles. In BI, this
ensures that reports, dashboards, and visualizations are presented consistently,
maintaining their visual integrity across different devices and platforms.
2. Secure and non-editable: PDF files can be encrypted and password-protected,
providing security for sensitive BI data. They are typically non-editable, preventing
unauthorized modifications and preserving the integrity of the information. This
is crucial when sharing reports or distributing information externally.
3. Print-friendly: PDF files are optimized for printing, ensuring that BI reports or
documents can be easily printed without any loss of quality or formatting issues.
This makes it convenient for users who prefer physical copies or need to share
hard copies of the BI output.
4. Cross-platform compatibility: PDF files can be viewed on various devices and
operating systems using free PDF readers. This cross-platform compatibility
ensures that BI reports can be accessed and reviewed by stakeholders,
regardless of the software or hardware they use.

XML:

1. Structure and organization: XML allows for structured representation of data


using user-defined tags, making it suitable for modeling complex and hierarchical
data in BI. It enables logical grouping, organization, and categorization of data
elements, facilitating effective data integration and interpretation.
2. Metadata support: XML supports the inclusion of metadata, which provides
additional context and description about the data. Metadata enhances the
understanding of the data's meaning, source, and characteristics, enabling better
data governance and facilitating effective data management in BI environments.
3. Extensibility and flexibility: XML is highly extensible, allowing users to define their
own custom tags and attributes to suit specific business requirements. This
flexibility enables seamless integration and interoperability between different
systems and applications in the BI landscape.
4. Data transformation and interoperability: XML is commonly used for data
transformation and exchange between heterogeneous systems in BI workflows.
It serves as a common format that enables interoperability and seamless data
integration across platforms, facilitating efficient data sharing and collaboration.

Excel:

1. Advanced data analysis: Excel provides a rich set of features and functions for
data analysis and manipulation. Users can perform complex calculations, apply
formulas, create charts, and perform statistical analysis directly within Excel,
making it a popular choice for in-depth BI analysis.
2. Data visualization: Excel offers a wide range of charting and graphing options,
allowing users to create visual representations of their BI data. These
visualizations can aid in understanding trends, patterns, and relationships,
enabling stakeholders to derive meaningful insights from the data.
3. Formula-driven calculations: Excel supports powerful formula capabilities,
enabling users to perform calculations on BI data dynamically. This allows for
real-time updates and automatic recalculation of values based on changes in
underlying data, facilitating dynamic reporting and analysis.
4. Collaboration and sharing: Excel files can be easily shared and collaborated on
within teams. Multiple users can work on the same Excel file simultaneously,
making it convenient for collaborative BI efforts. Additionally, Excel files can be
saved in cloud storage platforms, allowing for easy access and sharing across
different locations.
UNIT 4
Data validation:
● The quality of input data may prove unsatisfactory due to incompleteness,noise and
inconsistency.
● Hence this data is corrected in the data pre-processing process by filling out missing
values,smoothing out the noise and correcting inconsistencies.
1. Incomplete data (I SIE)
● There is a possibility that some data were not recorded at source in a systematic way or
it may not be available at the time of transaction of record.
● Techniques to partially correct incomplete data are as follows:
○ Elimination:In case of classification, suppose a class label is missing for a
row,such data row could be eliminated.Or if many attributes within a row are
missing even in this case the data row could be eliminated.
○ Inspection:Inspect each missing value of the attribute and find the possible
subset of the attribute.It is time consuming for large datasets but is accurate if
skillfully exercised.
○ Identification:A conventional value can be used to code and identify missing
values,so it is not necessary to delete entire records from the
dataset.Example:For a continuous attribute that assumes only positive values,it
is possible to assign the values {-1} to all missing data.
○ Substitution:For missing values,mean or median of its discrete values may be
used as a replacement.Example.In a database with family income,missing values
may be replaced with the average of the income
2. Data affected by noise .
● A random error or variance in a measure variable is known as noise.
● Noise in a data may be introduced due to:
○ Fault in data collection instruments.
○ Error introduced at data entry by a human or a computer.
● Outliers in the dataset must be identified so that they can be corrected and adjusted
subsequently,or entire records containing them can be removed.
● Various ways to identify outliers.
○ Outlier analysis by clustering
■ Partition dataset into clusters and one can store cluster representation
only i.e. replace all the values of cluster by that one value representing
the cluster.
○ Regression
■ Regression is a statistical measure used to determine the strength of the
relationship between one dependent variable denoted by Y and a series
of independent changing variables.
■ Use regression analysis on values of attributes to fill missing values.Two
basic types of regression linear regression and multiple
regression.Difference between linear and multiple regression is that
former uses only one independent variable whereas the later uses two or
more independent variables to predict the outcome.

Data transformation:
● Data warehouses integrating data from multiple sources face a problem of
inconsistency.To deal with this inconsistency, data transformation process is employed.
● Data transformation techniques are used to normalize or rescale numerical data to make
it more manageable and comparable.
1.Standardization (mzd) like msd
● Standardization is the process of making the entire dataset values have a particular
property.
● Following methods can be used for standardization /normalization.
○ Min-Max
■ Min-max scaling is a data transformation technique that rescales the
values of a numerical feature to a specific range, typically between 0 and
1.
■ This transformation preserves the relative ordering of the data while
ensuring that all values are within the desired range.
■ It achieves this by subtracting the minimum value from each data point
and then dividing it by the difference between the maximum and minimum
values.
x’=x-min(x)/max(x)-min(x)
○ Z-score
■ Transforms a numerical feature by subtracting the mean value and
dividing it by the standard deviation.
x’=x-mean(x)/std(x)
○ Decimal scaling
■ Decimal scaling is a data transformation technique that scales down the
values of a numerical feature by shifting the decimal point.
■ The feature is divided by a power of 10 to ensure that the transformed
values lie within a specified range, typically between -1 and 1.
x’=x/10^k
2.Feature extraction.
● Data transformation technique used to reduce the dimensionality of a dataset by
selecting or creating a smaller set of features that capture the most important and
relevant information.
● It helps in removing redundant or irrelevant features and focuses on those that
contribute the most to the analysis or prediction task at hand.
● It converts the raw data into compact representation that still retains the essential
characteristics of the original data.
● The extracted features can then be used as input for various machine learning
algorithms or statistical models to perform tasks such as classification, regression, or
clustering.

Data reduction :
● Data reduction refers to the process of reducing the size or complexity of a dataset while
preserving as much relevant information as possible.
● It aims to overcome challenges such as high dimensionality, computational inefficiency,
or noise in the data.
Criteria to determine whether a data reduction technique should be used;-
1. Efficiency:- data reduction increase-> efficiency increase
2. Accuracy:- Data reduction increase -> accuracy decrease
3. Simplicity::- data reduction increase -> simplicity increase

1. Sampling
● Sampling involves selecting a subset of the original data points from a larger
dataset.
● It is done to reduce the computational burden or to obtain a representative
sample that captures the essential characteristics of the entire dataset.
● Types of sampling
○ Simple random sampling:Equal probability of selecting any particular
item.
○ Sampling without replacement:As each item is selected ,it is removed
from the population
○ Sampling with replacement:Objects selected for the sample is not
removed from the population.In this technique the same items may be
selected multiple times.
○ Stratified sampling:Data is split into partitions and samples are drawn
from each partition randomly

2. Feature selection
● The objective is to select optimal number of features to train and build models
that generalize on data and prevent overfitting
Feature selection can be divided into 3 main areas:- (FWrE) free

● Filter Methods:
Here, features are selected based on correlation.
It checks relationship of each feature with the response variable to be predicted
Types of methods: - Threshold based method, Statistical tests
● Wrapper Methods:-
These methods try to capture the interaction between multiple features by using
a recursive approach to build multiple models using subsets of features and
select the best feature subset.
Types of methods:- Forward elimination, Backward selection.
● Embedded Methods:-
Combine benefits of filter and wrapper methods
Uses ML models to rank and score feature variables based on their
importance.
Types of methods:- Random forest, Decision trees.

3. Principal component analysis

● To resolve overfitting. By reducing data dimensionality


● Find the principal components of model which can be less than or equal to no. of
attributes
● Views:- viewing data in different dimension(2D->1D)
● PC1 has the highest priority .
● Orthogonal property:- PC1 and PC2 (or other PCs) should be independent /
orthogonal to each other.

Data discretization (bcc hd)


1. Binning
Discretization by binning has two approaches
1. Equal-width partitioning
Divides the range into N intervals of equal size uniform grid.
Bin width=(max value-min value)/N
E.g. Consider the set of observed values in the range from 0-100.
The data could be placed into 5 bins as follows
[0-20],[20-40],[40-60],[60,80],[80,100]
2. Equal-frequency partitioning
● The entire range is divided into N intervals,each containing
approximately the same number of samples.
Let us consider sorted data
4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal bins
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
● Smoothing by bin means replacing each value of bin with its
mean value.
● Smoothing by bin boundaries means the minimum and
maximum values of the bin boundaries are found and each value
is replaced with its nearest value either minimum or maximum

2. Histogram Analysis
● It replaces data with an alternative, smaller data representation.
● Divide data into buckets and store average (sum) for each bucket.
● A bucket represents attribute-value / frequency pair
● Types of histogram:-
a. Equal-width histograms - divides the range into N intervals of equal size
b. Equal-depth (frequency) partitioning - Divides the range into N intervals,
each containing approx same no. of samples.
3. Cluster Analysis
● Clustering is used to group the elements based on their similarity w/o prior
knowledge of their classes
● No target variable is to be predicted
● Categorisation of cluster:- (HOPE)
a. Exclusive cluster - ekta jeev sadashiv
b. Overlapping - sagli kade naak khupasnar
c. Probabilistic - kahi ka na raha
d. Hierarchical - bin aani tichi pilla

4. Decision Tree Analysis


● Uses top-down splitting strategy.
● Supervised technique that uses class information
● It identifies the optimal splitting points that would determine the bins
● The original variable values are then replaced by the probability returned by the
tree.
● The probability is the same for all the observations within a single bin

5. Correlation Analysis
● Data redundancy when data from multiple sources is considered for integration.
● X^2(chi square) test can be carried out-on nominal data to test how strongly the
two attributes are related.
● Correlation coefficient and covariance may be used with numeric data ,this will
give variation between the attributes.
Data exploration :
● Highlights the relevant features of each attribute using graphical methods.
● 3 phases :-
1. Univariate analysis:- investigate properties of each single attribute in dataset
2. Bivariate analysis:- measure degree of relationship between pairs of attribute
3. Multivariate analysis:- investigate relationship holding within a subset of
attributes.

1.Univariate analysis :
● Univariate analysis is a statistical analysis technique that focuses on examining and
understanding a single variable at a time.
● It involves studying the characteristics, distribution, and summary statistics of a single
variable to gain insights and draw conclusions about it.
● Weight (kg) of six students of a class: - [55, 60, 70, 50, 56]. ->only one variable
● Conclusions can be drawn using
○ central tendency measures (mean, median, mode),
○ dispersion or spread of data (range, min, max, variance, standard deviation).
○ frequency distribution tables, histograms, etc

A. Graphical analysis of categorical attributes

● Categorical attributes allow for only qualitative classification


● Can be described using graphs where we can visualize the difference in proportions or
% of times a particular value is observed.
Relative frequency = Frequency / n
(n-> sample size)
● Methods to represent categorical data-
a. Line graph: to track changes over short and long periods of time
b. Pie chart: to compare parts of a whole.
c. Bar chart: compare things between different groups or to track changes over
time.

B. Graphical analysis of numerical attributes


● Can be done using box-plots and histograms.
a. Box-Plot
Box Plot - 1 | How to draw Box Plot and Outlier | Data Mining | Statistics
● Used to represent the distribution of data
● The degree to which numeric data tends to spread is called dispersion or
variance of data.
● Quartiles: Q1 (25th percentile), Q3( 75th percentile)
● Inter-Quartile Range: distance between 1st and 3rd quartile
IQR = Q3 - Q1
● Five number summary:- min, Q1, M, Q3, max
To obtain complete summary of distribution

1. Arrange data in ascending order


2. Find median (M or Q2)
3. Q1 = median of data points till M
4. Q3 = median of data point from M to N
5. IQR = Q3 - Q1
6. Find min and max values

b. Histograms
● Is a graphical representation of the distribution of single variable
● Provides visual summary of the frequency or count of observations falling into
different intervals or bins
● Provides insights into central tendency, spread, skewness, and outliers
C. Measures of central tendency for numerical attributes

● Central tendency is also known as measure of central location that describes the central
position within set of data.
● Different measures of central tendencies.
a. Mean
i. Mostly used with continuous attributes.
ii. Mean is equal to sum of all the values in the data set divided by total no.
of observation in it.
iii. Disadvantage - highly susceptible to outliers. Thus, median is preferred in
such cases.

b. Median
i. Suitable when data is skewed as it is not affected by skewed values
ii. It is the middle score for a dataset, which is in ascending order.
iii. Odd values case - central value
iv. Even values case - average of two central values
v. Used in cases where we have extreme large values
c. Mode
i. It is the most frequently occurring value in the dataset.
ii. It used for categorical data, where most common category is to be
known.
iii. Types - Unimodal (1 mode), Bimodal (2 modes), Trimodal (3 modes)
iv. Empirical relation:
Mean - mode = 3 x (mean - median)
d. Midrange
i. Average of the largest and smallest value in dataset

D. Measures of dispersion for numerical attributes,


● Help in understanding the distribution of data
● As data becomes more diverse, the value of measure of dispersion increases
● Measures of dispersion:- (Refined Value System Makes Quality)
a. Range - difference between two extreme observations
Range = X max - X min
b. Variance -
c. Standard deviation - square root of variance
- Measures the spread about the mean.
- It is zero if and only if all the values are equal.

d. Mean Absolute Deviation-


i. The mean absolute deviation (MAD) is a statistical measure that
quantifies the average absolute difference between each data point in a
dataset and the mean of the dataset.
ii. It provides a measure of the average variability or dispersion of the data
points around the mean.

e. Quartile Deviation -
i. Distance between the first and third quartile
ii. IQR = Q3 - Q1

E. Identification of outliers for numerical attributes


a. Z-Score/ Zero Mean Normalisation: Page-20, 21, 22 N/A
i. In z-score normalisation, data is normalised based on the mean and std
deviation
b. Box-Plot - already given

2. Bivariate analysis:
● It is an analysis of two variables to determine the relationship between them.
● Used to test hypothesis of association.
● There are 3 cases in bivariate analysis:-
○ Both attributes are numerical
○ One attribute is numerical and other is categorical
○ Both are categorical

A. Graphical analysis
a. Scatter plot
i. Scatter plots can reveal clusters or groupings of data points, indicating a
potential subgroup or pattern within the data.
ii. They can also highlight outliers
iii. Tells relationship between 2 variables : Response variable (Y), Other
independent variable (X)
iv. Give an example:
b. Loess Plots
i. Loess curve is used for fitting a smooth curve between two variables
ii. Applies nonparametric smoothing techniques to scatter plots
iii. They fit a smooth curve to data points, capturing underlying trends or
relationship between variables
iv. Also called as local regression.

c. Level Curves
i. Level curves, also known as contour lines or isocontours, are curves on a
two-dimensional surface or map that connect points of equal value of a
particular quantity.
ii. This quantity could be a function, such as temperature, elevation,
pressure, etc
iii. By examining the contour lines, one can observe areas of high or low
values
iv. Level curves are commonly used in topographic maps to represent
elevation or height above sea level.
d. Quantile-Quantile plots
i. Useful to compare the quantile of two sets of numbers
ii. Example -

e. Box-Plots - already given

f. Time series -
i. Time series data is a collection of observations for a single entity at
different intervals of time.
ii. Example of time series analysis:- Rainfall measurement, Stock prices
iii. It is plot of time series data on one axis (Y-axis) against time on the other
axis (X-axis)

B. Measures of correlation for numerical attributes


a. Correlation Coefficient
● If r(p,q) > 0 -> p and q positively correlated
● If r(p,q) = 0 ->independent
● If r(p,q) < 0 -> negatively correlated.

b. Covariance

C. Contingency tables for categorical attributes


a. Contingency tables, also known as cross-tabulation or crosstab tables, are used
to summarize and display the relationship between two or more categorical
variables.
b. They provide a tabular representation of the frequencies or counts of
observations that fall into various combinations or categories of the variables
being analyzed.
c. A contingency table is structured as a grid or matrix, with rows representing one
categorical variable and columns representing another categorical variable.
d. Each cell of the table contains the count or frequency of observation
3.Multivariate analysis:

● It is defined as the process of involving multiple dependent variables resulting in one


outcome.
● Eg - sales of product depends on product category, location, cost of product, etc

a. Graphical analysis
i. Scatter Plot Matrix
1. Scatterplot matrices are a great way to roughly determine if you have a
linear correlation between multiple variables.
2. The variables in the scatterplot matrix are written in a diagonal line from
top left to bottom right.
3. In the below scatterplot ,it can be said that there is a correlation between
Girth and Volume because plot looks like a line.There is probably less of a
corelation between Height and Girth.

4.
ii. Star Plot
1. Star plots,sometimes called radar charts or web charts are used to
display multivariate data.
2. Multivariate in this sense refers to having multiple characteristics to
observe.
3. Starplots are often used to display several different observations of the
same type of data.
iii. Spider Web Chart
1. Known as radar chart ,is often used when you want to display data across
several unique dimensions.
2. These dimensions are usually quantitative,and typically range from zero
to maximum value.

b. Measures of correlation for numerical attributes


i. Variance-Covariance Matrix
Additional Points:
Univariate Analysis:
● Provides a basic understanding of individual variables and their properties.
● Helps identify outliers, missing values, and distributional patterns.
● Summarizes and visualizes data using measures of central tendency and
variability.
● Useful for initial data exploration and descriptive statistics.

Bivariate Analysis:
● Explores the relationship between two variables to understand their association.
● Determines the strength, direction, and significance of the relationship.
● Visualizes the relationship using scatter plots, line graphs, or bar charts.
● Helps in hypothesis testing and identifying potential predictors or dependent
variables.

Multivariate Analysis:
● Examines the relationships among multiple variables simultaneously.
● Considers the interdependencies and interactions between variables.
● Provides a deeper understanding of complex relationships and patterns.
● Aids in predicting outcomes, identifying latent factors, and clustering similar
cases.
Unit 5:
Impact of Machine learning in Business Intelligence Process
Classification:

Classification problems
● The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data.
● In Classification, a program learns from the given dataset or observations and then
classifies new observations into a number of classes or groups. Such as, Yes or No, 0 or
1, Spam or Not Spam, cat or dog, etc.
● Classes can be called as targets/labels or categories.
● In the classification algorithm, a discrete output function(y) is mapped to the input
variable(x).

y=f(x), where y = categorical output

There are two types of Classifications:


○ Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
○ Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Below are some popular use cases of Classification Algorithms:
○ Email Spam Detection
○ Speech Recognition
○ Identifications of Cancer tumor cells.
○ Drugs Classification
○ Biometric Identification, etc.
Evaluation of classification models
1. Accuracy and Error Measure
2. Holdout Method
● In holdout method,data is divided into training data set and testing data
set(usually ⅓ for testing and ⅔ for training.
● If training is more then better model is constructed and if test data is more then
accurate the error estimates.

3. Cross Validation
● It helps in obtaining a more reliable estimate of how well the model is
likely to perform on unseen data by simulating the process of training and
testing on multiple different subsets of the data.
● First step:Data is split into k subsets of equal size.
● Second Step:Each subset in turn is used for testing and remainder for
training.
● The advantage is that all the examples are used for testing and training.

4. ROC Curve3
● ROC curve stands for Receiver Operating Characteristics Curve.
● A trade-off between the true positive rate and false positive rate is shown on ROC curve.
● Vertical axis represents the true positive rate and horizontal axis represents the false
positive rate
● The model with perfect accuracy will have an area of 1.0

5. Bootstrapping
● The bootstrap sampling method is a resampling method that uses random sampling with
replacement.

● This means that it is very much possible for an already chosen observation to be
chosen again.

Bayesian methods
Logistic regression.
● It is supervised ML algorithm mostly used for classification problems. It used for
predicting the categorical dependent variable using a given set of independent
variables.
● Output : probabilistic values which lie between 0 and 1.
● Curve : "S" shaped logistic function, which predicts two maximum values (0 or 1).
● Mathematical Formula - Logistic regression consists of sigmoid function which
can be given as follows y=1/1+e^-x ….y=dependent variable
…x=independent variable
● Sigmoid function is used to convert independent variable into expression
of probability that ranges between 0 and 1.
● Types of logistic regression-Binomial, Multinomial, Ordinal

Clustering: Clustering methods


● Clustering is the process of dividing the objects into various groups.
● Similar objects are present within the same cluster/group and dissimilar objects are
present in the different cluster.
● Applications of clustering are as follows
○ Used by Amazon,Netflix,to provide recommendations as per past searches.
○ Social Network Analysis.
○ City Planning
Types of clustering methods:-

1. Partitioning Clustering

Divides the data into non-hierarchical groups. It is also known as the centroid-based method.
The most common example of partitioning clustering is the K-Means Clustering algorithm.

2. Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters

3. Distribution Model-Based Clustering

The data is divided based on the probability of how a dataset belongs to a particular distribution.
The grouping is done by assuming some distributions, commonly Gaussian Distribution.

4. Hierarchical Clustering

the dataset is divided into clusters to create a tree-like structure, which is also called a
dendrogram.
5. Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster.

Partition methods
● There are further two types:
○ K-means(Link for example https://youtu.be/CLKW6uWJtTc)
■ K-means clustering is a method of vector quantization
■ It originates from signal processing
■ It aims to partition ‘n’ observations into ‘k’ clusters where each
observation belongs to the cluster with the nearest mean ,serving as a
prototype of a cluster.
■ Centroid is a point that represents the mean of parameter values of all the
points in the cluster.
■ Steps of K-means clustering.
● First we choose final number clusters.
● Examine each element and assign it to one of the clusters
depending upon the minimum euclidean distance.
● Each time the element is added to the cluster the centroid position
is recalculated.This process goes on until all the elements are
grouped into clusters.

○ K-Medoids or PAM (Partitioning around Medoids) : each


cluster is represented by one of the objects in the cluster.

Hierarchical methods
● In this technique, the dataset is divided into clusters to create a tree-like
structure, which is also called a dendrogram.
● There are two types of Hierarchical clustering:- Divisive and Agglomerative
Agglomerative Clustering Divisive Clustering

Bottom-Up approach Top-Down approach

Starts with individual data points as Initially, all objects are in 1 cluster and
clusters and move upwards by grouping then sub-divides the cluster until we are
similar data points until we achieve a left with individual data points
single cluster

Evaluation of clustering models.


● Different aspects that may be considered for the validation of clustering
algorithms includes
○ Clustering tendency in the data.
○ Correct number of clusters
○ Quality of clusters
○ Comparing two sets of clusters to find which is better.
● Internal Validation Methods.
○ Here the quality of clustering methods can be evaluated without using external
information.
○ Two types of internal validation metrics can be used.
■ Cohesion:Evaluates how closely the elements are within the cluster.
■ Separation:Metric evaluates the level of separation between the clusters
● External Validation
○ Associated with supervised learning problems
○ This method requires additional information like external class labels for the
training examples
Silhouette Score
● Silhouette score is a way to measure how the elements within a cluster are close to the
points in its neighboring clusters.
● It is based on the principle of “maximum internal cohesion and maximum cluster
separation” meaning how similar an object is to its own cluster (cohesion) compared to
other clusters (separation).
● It finds the optimal value of k (no. of clusters) during clustering
● The value of silhouette score is bounded between -1 and 1
● Score close to 1: data point assigned correctly to cluster
● Score close to -1 : data point assigned wrongly to cluster
● Score 0 : data point lies between two clusters

Association Rule: Structure of Association Rule,

● Association Rules have the general form


I1->I2(where I1⋂I2=0)
● The rule states that “Given someone has bought the items in the set I1 they are likely to
also buy the items in the set I2

Support
● The support of an itemset is the percentage of transactions in which the items appear.
● If A=>B,
Then Support (A=>B) = (Tuples containing both A and B) / Total no. of tuples

Confidence
● The confidence or strength of an association rule is the ratio of number of transactions
that contain A and B to the no. of transactions that contain A.
● Confidence (A=>B) = (Tuples containing both A and B) / Tuples containing A

Apriori Algorithm
● The Apriori algorithm uses frequent itemsets to generate association rules and is mainly
designed to work on databases that contain transactions.
● Used to determine how strongly or weakly the two objects are connected.
● Uses breadth first search or hashtree to generate associations rules efficiently.
UNIT 6
Tools for Business Intelligence
● Business Intelligence (BI) tools are software applications that enable organizations to
gather, analyze, and visualize data to gain insights and make informed business
decisions.
● BI tools are used for query and generating reports on data, but they can also combine a
large set of applications for the analysis of data.
● Some of the tools are:-
1. Tableau:-
● Most popular and simple Microsoft BI tool
● It offers a varied range of graphical representations that are extremely interactive
and pleasing.
● This tool mainly serves two functions, the collection of data and data analysis.
2. Datapine:-
● datapine is a cloud-based BI tool that simplifies data analysis and reporting.
● It offers a drag-and-drop interface to create visually appealing dashboards and
reports.
● datapine supports real-time data processing, collaboration, and provides
advanced features like predictive analytics and data mining.
3. Sisense:-
● the main purpose of the BI tool, Sisense, is to gather, analyze, and visualize
datasets, irrespective of their size.
● intuitive feature that offers a user-friendly drag-and-drop interface, allowing
anyone to use and understand it, including those who are not from an IT
background.
● also provides advanced analytics capabilities and supports data integration from
multiple sources.
4. Power BI:-
● Microsoft product that provides a suite of tools for data analysis and visualization.
● It allows users to connect to various data sources, create interactive dashboards,
and generate insightful reports.
● Power BI offers features like natural language querying, data modeling, and
collaboration options.
Role of analytical tools in BI (mr pev)
● Data exploration
○ Helps to find trends,insights that were previously concealed.
○ Users can select areas for additional study and obtain a deeper grasp of their
data
● Data visualization
○ View data in a variety of visual representations,including graphs,charts and
maps.
○ Simple to find patterns and trends which can be shared with others.
● Analytics for prediction
○ Users can create predictive models that can predict future outcomes using
information from the past.
○ This helps businesses to make better decisions and plan more effectively.
● Data Modelling
○ Build data models to comprehend the connections between various data pieces.
○ Helps companies to find areas of optimization and development.
● Reporting
○ Offer real time insights which helps companies to identify opportunities for
development and reach data-driven decisions.

Case study of Analytical Tools:


1. WEKA (Waikato Environment for Knowledge Analysis)
Introduction:-
● It is a collection of open source ML algorithms
a. Pre-processing (49)
b. Classifiers (76)
c. Clustering (8)
d. Association rule (3)
● Created by researchers at University of Waikato in New Zealand
● It is Java Based
Features
● Platform independent
● Open source and free
● Different Machine learning algorithms for data mining
● Easy to use
● Data preprocessing tools
● Flexibility for scripting experiments
● 3 Graphical user interfaces :- Explorer, KnowledgeFlow, Experimenter
WEKA Explorer:-
● Interactive Data Exploration: visualizations, summary statistics
● Data Preprocessing and Transformation: attribute selection,
normalization, filtering
● Built-in Classification and Regression:
● Visual Model Evaluation: confusion matrices, ROC curves
● Workflow Visualization

WEKA Knowledge Flow:-


● Visual Data Pipelines: data preprocessing and modeling workflows by
connecting components in a drag-and-drop manner
● Real-Time Data Processing
● Integration with External Tools: integration capabilities with external
tools and libraries, such as R and Python
● Monitoring and Visualization: visualizations and monitoring capabilities
to track the progress of data flows
● Scalability and Efficiency: ability to handle large datasets and support
parallel processing

WEKA Experimenter
● Experimental Setup and Configuration:specify datasets,algorithms
● Automated Execution of Experiments:Perform automated execution of
algorithms,cross-validation etc
● Statistical Analysis and Reporting: provides statistical analysis
features, such as significance testing and confidence intervals,
allowing users to assess the performance
● Result Visualization and Comparison: Charts,graphs,scatterplots.
● Experiment Management: allows users to save and load experiment
configurations

2. KNIME (Konstanz Information Miner)


● is an open-source data analytics and integration platform that provides a
comprehensive suite of tools for business intelligence (BI) and data
science.
● The KNIME platform offers a visual workflow interface that allows users to
build data workflows by connecting nodes representing data processing
and analysis steps.
● It supports the entire data analytics lifecycle, including data
preprocessing, transformation, modeling, visualization, and deployment.

Nodes and Workflow:-


● Here, individual tasks are represented by nodes
● Each node is displayed as a colored box with input and output ports
● Nodes can perform any tasks, including reading/ writing files, transforming data,
training model, creating visualizations

● Workflow
○ Series of interconnected nodes define a workflow.
○ Once the workflow is executed, data in the workflow flows from left to
right.
● Component
○ A component in KNIME is a reusable sub-workflow.
○ It is a way to encapsulate a set of nodes and their connections into
a single node that can be easily reused within the same workflow or
across different workflows.
● Metanode
a. It is similar to a component but provides a higher level of
abstraction.
b. A metanode allows you to group nodes together, define a
dedicated configuration interface for the group.

3. Rapid Miner and R


GUI of blank process in rapid miner
● Contains three sections
○ Repository:Holds our datasets.We can import our own datasets
and also offers public datasets.
○ Operators:It includes everything we need to build a data mining
process such as data access,data cleansing and modelling.
○ Parameters:Parameters are used to adjust operators.

Features of Rapid Miner:-


● It is a GUI tool which is more user friendly for carrying out data analysis and modelling
tasks.
● Users can easily design workflows for data preparation, modelling and analysis
● Drag-and-drop functionality
● Provides pre-built algorithms for data mining, text mining and predictive modelling
Features of R:-
● It is a software environment and programming language dedicated to statistical
computing and graphics
● Open-source
● Libraries and packages for machine learning, data analysis and visualisation
● It needs more programming knowledge and have higher learning curve

It is advantageous to combine both Rapid Miner and R.


R-> packages
Rapid miner-> workflows

1. Data analytics
● Data analysis is the process of analyzing and interpreting data which helps in
greater understanding of corporate performance,consumer behavior,market
trends etc.

a. Descriptive analytics
Summarize, compile and describe historical data
b. Diagnostic analytics
Identify patterns and determine factors that led to particular outcome
c. Predictive analytics
Find patterns in data to predict future events using previous data.
d. Prescriptive analytics
Uses data to suggest precise actions that should be taken to obtain a given
result.
2. Business analytics (UIPI)
● Understand Company Performance:
○ Evaluating data on measures such as revenue,customer happiness and
profitability helps organizations to better understand their business..
● Identify trends and performance.
○ BA helps us to find the trends and patterns in data which can be used to
make predictions.
● Improve operations
a. BA help businesses to cut cost, boost productivity, and boost consumer
happiness by pinpointing areas for improvement and optimize their
operations
● Promote innovations
a. BA assists firms in identifying new business prospects and creative
solutions
b. Studies industry trends, consumer behavior, etc.

3. ERP and Business Intelligence

4. BI and operation management


● Operation management involves utilizing the resources,staff,materials and
technology such that the minimal waste occurs
a. BI can be used to track business metrics,output rates and client demand which
will help business organizations identify the inefficiencies and places for
improvement.
b. BI can be utilized to enhance quality assurance and decrease operational
waste.Costs can be cut and customer satisfaction can increase as a result.
c. BI helps businesses in supply chain management and also helps businesses to
find risks and opportunities and make wise decisions.

5. BI in inventory management system


a. BI tools help users to collect, analyze and present data in a variety of different
formats.
b. BI enables firms to optimize inventory levels, lower costs, and boost operational
effectiveness.
c. Helps in understanding inventory data and taking well-informed decisions
d. Help detect slow moving inventory and put plans in place to stop inventory from
becoming obsolete.
e. Help to lower supply chain risks and enhance supply chain efficiency.
f. It can help in speeding up deliveries, cutting down on wait times and maximizing
inventory levels throughout the supply chain.
g. Estimate future demand and examine historical demand pattern

6. BI and human resource management


a. Help in analyzing employee data like performance indicators, retention rates and
absenteeism.
b. Assist managers in developing employee training programmes, identifying top
performers, and developing specialized retention methods.
c. Enhance recruitment and selection procedures
d. Help in identifying areas of improvement like lower time to hire, etc
e. Organizations can manage employee salary and benefits with the help of BI
f. Can compare their remuneration packages(other benefits) to industry norms.
g. Help managers to make data driven decisions to increase employee happiness
and organizational success.

7. BI Applications in CRM
a. BI helps in identifying trends as well as opportunities to boost customer
engagement,retention and satisfaction.
b. BI will be used in CRM as follows (SCAM)
i. Sales analytics.Helps to discover patterns and trends in customer
behavior.
ii. Client segmentation.Helps to divide clients into several groups based on
demographics,past purchasing patterns etc.
iii. Analysis of customer feedback.Helps to analyze customer
reviews,comments and customer care contacts
iv. Marketing analytics.Helps companies in tracking and analyzing the
results of their marketing initiatives.
8. BI Applications in Marketing (2C2P)
a. Customer segmentation:Helps to divide customers into groups depending
on characteristics like behavior,buying habits and demographics
b. Predictive analytics:Helps to forecast customer behavior and find trends
that are likely to result in sales using predictive analytics.
c. Performance metrics:Includes website traffic, conversion rates and social
media involvement.
d. Competitive analysis:Helps to monitor and examine the marketing
initiatives of rival companies ,giving users insightful knowledge into their plans.

9. BI Applications in Logistics and Production (sqp3)


a. Supply Chain Optimization
● Bi provides insights on inventory levels, shipment timelines, etc
● Lowering transportation costs, speeding up delivery time, etc
b. Predictive Maintenance
● Assist in avoiding expensive downtime and reduce maintenance
expenses
c. Quality Control
● Real-time quality control monitoring is possible
● Identify quality problems early on and stop faulty products from reaching
customers
d. Production Planning
● When to schedule production runs and how much inventory to keep on
hand by examining historical data and forecasting and demand trends.
e. Performance metrics
● Monitor KPI (key performance indicators) like on-time delivery rates,
inventory turnover, and production efficiency.

10. Role of BI in Finance (fbroc)


● Financial reporting
Provide reports that describe a company's financial performance (sales, costs,
profits, losses).
● Budgeting and forecasting
Aid in estimating future financial results by analyzing previous data. Thus, allows
financial managers to make precise projections and budget
● Risk management
Identify potential risks and problems that could have influence on financial
performance
● Operational effectiveness
Find inefficiencies in financial operations and processes thus, can streamline
procedures and cut costs
● Customer analysis
Analyze customer data->develop targeted marketing plans->customer
satisfaction sky rockets->habibi money money.

11. BI Applications in Banking (ccrpf)


● Risk Management:Helps to examine the transaction data ,credit scores which
assist banks in identifying and reducing risks.
● Customer Analytics:Helps to understand customer preferences and behavior
which helps banks in providing their clients with more individualized services and
goods
● Compliance:Helps to keep track of data pertaining to compliance issues,BI tools
can assist banks in meeting regulatory requirements.
● Performance Management:Helps to analyze performance such as profitability
,efficiency and customer satisfaction.
● Fraud Detection:With the help of transaction data,customer behavior and other
factors BI helps banks to detect fraud.

12. BI Applications in Telecommunications


● CRM
Help in analyzing customer data to learn more about consumer preferences,
behavior and trends -> retains more customers ->personalize their services
● Network Performance Management
Identify and address problems by monitoring network’s performance in real time.
->network optimisation ->outage mitigation.
● Sales and marketing
Examine customer demographics, usage trends, etc to create customized
marketing strategies.
● Efficiency in Operations
Track and examine operational data (inventory levels, billing procedures, and
staff performance) ->helps in streamlining operations, cut costs, boost
performance.
13. BI in salesforce management
● Performance monitoring
○ Coaching and training programmes can be designed for the team by
identifying potential trouble spots.
● Management of sales pipeline
○ Spot bottlenecks or locations where deals are stalling.
● Forecasting
○ Create precise sales which can help managers in better planning of the
tasks.
● Territory management:
○ Optimize resource allocation
● Management of sales team:
○ Analyze individual performance indicators and spot team-wide patterns.

You might also like