Bda Unit 5

lOMoARcPSD|25913154
BDA UNIT-5
lOMoARcPSD|25913154
UNIT V
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple
linear regression, Interpretation of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application
5.1 What is predictive analytics?

Predictive analytics is a branch of advanced analytics that makes predictions about future outcomes
using historical data combined with statistical modeling, data mining techniques and machine
learning. Companies employ predictive analytics to find patterns in this data to identify risks and
opportunities. Predictive analytics is often associated with big data and data science.
Today, companies today are inundated with data from log files to images and video, and all of this
data resides in disparate data repositories across an organization. To gain insights from this data,
data scientists use deep learning and machine learning algorithms to find patterns and make
predictions about future events. Some of these statistical techniques include logistic and linear
regression models, neural networks and decision trees. Some of these modeling techniques use
initial predictive learnings to make additional predictive insights.
Types of predictive modeling
Predictive analytics models are designed to assess historical data, discover patterns, observe trends,
and use that information to predict future trends. Popular predictive analytics models include
classification, clustering, and time series models.
Classification models
Classification models fall under the branch of supervised machine learning models. These models
categorize data based on historical data, describing relationships within a given dataset. For
example, this model can be used to classify customers or prospects into groups for segmentation
purposes. Alternatively, it can also be used to answer questions with binary outputs, such
answering yes or no or true and false; popular use cases for this are fraud detection and credit risk
evaluation. Types of classification models include logistic regression, decision trees, random
forest, neural networks, and Naïve Bayes.
Clustering models
Clustering models fall under unsupervised learning. They group data based on similar attributes.
For example, an e-commerce site can use the model to separate customers into similar groups
based on common features and develop marketing strategies for each group. Common clustering
algorithms include k-means clustering, mean-shift clustering, density-based spatial clustering of
applications with noise (DBSCAN), expectation-maximization (EM) clustering using Gaussian
Mixture Models (GMM), and hierarchical clustering.
Time series models
Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

lOMoARcPSD|25913154
Time series models use various data inputs at a specific time frequency, such as daily, weekly,
monthly, et cetera. It is common to plot the dependent variable over time to assess the data for
seasonality, trends, and cyclical behavior, which may indicate the need for specific transformations
and model types. Autoregressive (AR), moving average (MA), ARMA, and ARIMA models are
all frequently used time series models. As an example, a call center can use a time series model to
forecast how many calls it will receive per hour at different times of day.
Predictive analytics industry use cases
Predictive analytics can be deployed in across various industries for different business problems.
Below are a few industry use cases to illustrate how predictive analytics can inform decision-
making within real-world situations.
• Banking: Financial services use machine learning and quantitative tools to predict credit
risk and detect fraud. As an example, BondIT is a company that specializes in fixed-income
asset-management services. Predictive analytics allows them to support dynamic market
changes in real-time in addition to static market constraints. This use of technology allows
it to both customize personal services for clients and to minimize risk.
• Healthcare: Predictive analytics in health care is used to detect and manage the care of
chronically ill patients, as well as to track specific infections such as sepsis. Geisinger
Health used predictive analytics to mine health records to learn more about how sepsis is
diagnosed and treated. Geisinger created a predictive model based on health records for
more than 10,000 patients who had been diagnosed with sepsis in the past. The model
yielded impressive results, correctly predicting patients with a high rate of survival.
• Human resources (HR): HR teams use predictive analytics and employee survey metrics
to match prospective job applicants, reduce employee turnover and increase employee
engagement. This combination of quantitative and qualitative data allows businesses to
reduce their recruiting costs and increase employee satisfaction, which is particularly
useful when labor markets are volatile.
• Marketing and sales: While marketing and sales teams are very familiar with business
intelligence reports to understand historical sales performance, predictive analytics enables
companies to be more proactive in the way that they engage with their clients across the
customer lifecycle. For example, churn predictions can enable sales teams to identify
dissatisfied clients sooner, enabling them to initiate conversations to promote retention.
Marketing teams can leverage predictive data analysis for cross-sell strategies, and this
commonly manifests itself through a recommendation engine on a brand’s website.
• Supply chain: Businesses commonly use predictive analytics to manage product inventory
and set pricing strategies. This type of predictive analysis helps companies meet customer
demand without overstocking warehouses. It also enables companies to assess the cost and
return on their products over time. If one part of a given product becomes more expensive
to import, companies can project the long-term impact on revenue if they do or do not pass
on additional costs to their customer base. For a deeper look at a case study, you can read

lOMoARcPSD|25913154
more about how FleetPride used this type of data analytics to inform their decision making
on their inventory of parts for excavators and tractor trailers. Past shipping orders enabled
them to plan more precisely to set appropriate supply thresholds based on demand.

lOMoARcPSD|25913154
Benefits of predictive modeling

An organization that knows what to
expect based on past patterns has a
business advantage in managing
inventories, workforce, marketing
campaigns, and most other facets of
operation.
• Security: Every modern
organization must be concerned
with keeping data secure. A
combination of automation and
predictive analytics improves
security. Specific patterns
associated with suspicious and
unusual end user behavior can
trigger specific security
procedures.
• Risk reduction: In addition to keeping data secure, most businesses are working to reduce
their risk profiles. For example, a company that extends credit can use data analytics to
better understand if a customer poses a higher-than-average risk of defaulting. Other
companies may use predictive analytics to better understand whether their insurance
coverage is adequate.
• Operational efficiency: More efficient workflows translate to improved profit margins.
For example, understanding when a vehicle in a fleet used for delivery is going to need
maintenance before it’s broken down on the side of the road means deliveries are made on
time, without the additional costs of having the vehicle towed and bringing in another
employee to complete the delivery.
• Improved decision making: Running any business involves making calculated decisions.
Any expansion or addition to a product line or other form of growth requires balancing the
inherent risk with the potential outcome. Predictive analytics can provide insight to inform
the decision-making process and offer a competitive advantage.
5.2 Simple Linear Regression
Simple linear regression is used to find out the best relationship between a single input variable
(predictor, independent variable, input feature, input parameter) & output variable (predicted,
dependent variable, output feature, output parameter) provided that both variables are continuous
in nature. This relationship represents how an input variable is related to the output variable and
how it is represented by a straight line.
To understand this concept, let us have a look at scatter plots. Scatter diagrams or plots provides a
graphical representation of the relationship of two continuous variables.

lOMoARcPSD|25913154
After looking at scatter plot we can understand:

1. The direction
2. The strength
3. The linearity
The above characteristics are between variable Y and variable X. The above scatter plot shows us
that variable Y and variable X possess a strong positive linear relationship. Hence, we can project
a straight line which can define the data in the most accurate way possible.
If the relationship between variable X and variable Y is strong and linear, then we conclude that
particular independent variable X is the effective input variable to predict dependent variable Y.
To check the collinearity between variable X and variable Y, we have correlation coefficient (r),
which will give you numerical value of correlation between two variables. You can have strong,
moderate or weak correlation between two variables. Higher the value of “r”, higher the preference
given for particular input variable X for predicting output variable Y. Few properties of “r” are
listed as follows:
1. Range of r: -1 to +1
2. Perfect positive relationship: +1
3. Perfect negative relationship: -1
4. No Linear relationship: 0
5. Strong correlation: r > 0.85 (depends on business scenario)
Command used for calculation “r” in RStudio is:
> cor(X, Y)

lOMoARcPSD|25913154
where, X: independent variable & Y: dependent variable Now, if the result of the above command
is greater than 0.85 then choose simple linear regression.
If r < 0.85 then use transformation of data to increase the value of “r” and then build a simple linear
regression model on transformed data.
Steps to Implement Simple Linear Regression:
1. Analyze data (analyze scatter plot for linearity)
2. Get sample data for model building
3. Then design a model that explains the data
4. And use the same developed model on the whole population to make predictions.
The equation that represents how an independent variable X is related to a dependent variable Y.
Example:
Let us understand simple linear regression by considering an example. Consider we want to predict
the weight gain based upon calories consumed only based on the below given data.
Now, if we want to predict weight gain when you consume 2500 calories. Firstly, we need to
visualize data by drawing a scatter plot of the data to conclude that calories consumed is the best
independent variable X to predict dependent variable Y.

lOMoARcPSD|25913154
We can also calculate “r” as follows:
As, r = 0.9910422 which is greater than 0.85, we shall consider calories consumed as the best
independent variable(X) and weight gain(Y) as the predict dependent variable.
Now, try to imagine a straight line drawn in a way that should be close to every data point in the
scatter diagram.

lOMoARcPSD|25913154
To predict the weight gain for consumption of 2500 calories, you can simply extend the straight
line further to the y-axis at a value of 2,500 on x-axis . This projected value of y-axis gives you
the rough weight gain. This straight line is a regression line.
Similarly, if we substitute the x value in equation of regression model such as:
y value will be predicted.

Following is the command to build a linear regression model.
We obtain the following values
Substitute these values in the equation to get y as shown below.
So, weight gain predicted by our simple linear regression model is 4.49Kgs after consumption of
2500 calories.
5.3 What is Multiple Linear Regression

Linear regression is a model that predicts one variable's values based on another's importance. It's
one of the most popular and widely-used models in machine learning, and it's also one of the first
things you should learn as you explore machine learning.
Linear regression is so popular because it's so simple: all it does is try to predict values based on
past data, which makes it easy to get started with and understand. The simplicity means it's also
easy to implement, which makes it a great starting point if you're new to machine learning.
There are two types of linear regression algorithms -

lOMoARcPSD|25913154
• Simple - deals with two features.

• Multiple - deals with more than two features.
In this guide, let’s understand multiple linear regression in depth.
What Is Multiple Linear Regression (MLR)?

One of the most common types of predictive analysis is multiple linear regression. This type of
analysis allows you to understand the relationship between a continuous dependent variable and
two or more independent variables.
The independent variables can be either continuous (like age and height) or categorical (like
gender and occupation). It's important to note that if your dependent variable is categorical, you
should dummy code it before running the analysis.
Formula and Calculation of Multiple Linear Regression

Several circumstances that influence the dependent variable simultaneously can be controlled
through multiple regression analysis. Regression analysis is a method of analyzing the
relationship between independent variables and dependent variables.
Let k represent the number of variables denoted by x1, x2, x3, ……, xk.
For this method, we assume that we have k independent variables x1, . . . , xk that we can set,
then they probabilistically determine an outcome Y.
Furthermore, we assume that Y is linearly dependent on the factors according to
Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε
• The variable yi is dependent or predicted
• The slope of y depends on the y-intercept, that is, when xi and x2 are both zero, y will be β0.
• The regression coefficients β1 and β2 represent the change in y as a result of one-unit changes in xi1
and xi2.
• βp refers to the slope coefficient of all independent variables
• ε term describes the random error (residual) in the model.
Where ε is a standard error, this is just like we had for simple linear regression, except k doesn’t
have to be 1.
We have n observations, n typically being much more than k.

lOMoARcPSD|25913154
For i th observation, we set the independent variables to the values xi1, xi2 . . . , xik and measure
a value yi for the random variable Yi.
Thus, the model can be described by the equations.
Yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + i for i = 1, 2, . . . , n,
Where the errors i are independent standard variables, each with mean 0 and the same unknown
variance σ2.
Altogether the model for multiple linear regression has k + 2 unknown parameters:
β0, β1, . . . , βk, and σ 2.
When k was equal to 1, we found the least squares line y = βˆ 0 +βˆ 1x.
It was a line in the plane R 2.
Now, with k ≥ 1, we’ll have a least squares hyperplane.
y = βˆ 0 + βˆ 1x1 + βˆ 2x2 + · · · + βˆ kxk in Rk+1.
The way to find the estimators βˆ 0, βˆ 1, . . ., and βˆ k is the same.
Take the partial derivatives of the squared error.
Q = Xn i=1 (yi − (β0 + β1xi1 + β2xi2 + · · · + βkxik))2
When that system is solved we have fitted values
yˆi = βˆ 0 + βˆ 1xi1 + βˆ 2xi2 + · · · + βˆ kxik for i = 1, . . . , n that should be close to the actual
values yi.
Assumptions of Multiple Linear Regression

In multiple linear regression, the dependent variable is the outcome or result from you're trying
to predict. The independent variables are the things that explain your dependent variable. You
can use them to build a model that accurately predicts your dependent variable from the
independent variables.

lOMoARcPSD|25913154
For your model to be reliable and valid, there are some essential requirements:
• The independent and dependent variables are linearly related.
• There is no strong correlation between the independent variables.
• Residuals have a constant variance.
• Observations should be independent of one another.
• It is important that all variables follow multivariate normality.
Example of How to Use Multiple Linear Regression

from sklearn.datasets import load_boston
import pandas as pd
from sklearn.model_selection import train_test_split
def sklearn_to_df(data_loader):
X_data = data_loader.data
X_columns = data_loader.feature_names
X = pd.DataFrame(X_data, columns=X_columns)
y_data = data_loader.target
y = pd.Series(y_data, name='target')
return x, y
x, y = sklearn_to_df(load_boston())
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42)
from load_dataset import x_train, x_test, y_train, y_test
from multiple_linear_regression import MultipleLinearRegression
from sklearn.linear_model import LinearRegression
mulreg = MultipleLinearRegression()
# fit our LR to our data
mulreg.fit(x_train, y_train)
# make predictions and score
pred = mulreg.predict(x_test)
# calculate r2_score
score = mulreg.r2_score(y_test, pred)
print(f'Our Final R^2 score: {score}')
The Difference Between Linear and Multiple Regression

When predicting a complex process's outcome, it is best to use multiple linear regression instead
of simple linear regression.
A simple linear regression can accurately capture the relationship between two variables in
simple relationships. On the other hand, multiple linear regression can capture more complex
interactions that require more thought.

lOMoARcPSD|25913154
A multiple regression model uses more than one independent variable. It does not suffer from the
same limitations as the simple regression equation, and it is thus able to fit curved and non-linear
relationships. The following are the uses of multiple linear regression.
1. Planning and Control.
2. Prediction or Forecasting.
Estimating relationships between variables can be exciting and useful. As with all other
regression models, the multiple regression model assesses relationships among variables in terms
of their ability to predict the value of the dependent variable.
5.4 How to Interpret Regression Coefficients

In statistics, regression analysis is a technique that can be used to analyze the relationship between
predictor variables and a response variable.
When you use software (like R, Stata, SPSS, etc.) to perform a regression analysis, you will
receive a regression table as output that summarize the results of the regression.
Arguably the most important numbers in the output of the regression table are the regression
coefficients. Yet, despite their importance, many people have a hard time correctly interpreting
these numbers.
This tutorial walks through an example of a regression analysis and provides an in-depth
explanation of how to interpret the regression coefficients that result from the regression.
Related: How to Read and Interpret an Entire Regression Table
A Regression Analysis Example
Suppose we are interested in running a regression analysis using the following variables:
Predictor Variables
• Total number of hours studied (continuous variable – between 0 and 20)
• Whether or not a student used a tutor (categorical variable – “yes” or “no”)
Response Variable
• Exam score ( continuous variable – between 1 and 100)
We are interested in examining the relationship between the predictor variables and the response
variable to find out if hours studied and whether or not a student used a tutor actually have a
meaningful impact on their exam score.
Suppose we run a regression analysis and get the following output:
Term CoefficientStandard Errort StatP-value

lOMoARcPSD|25913154
Intercept 48.56 14.32 3.39 0.002

Hours studied2.03 0.67 3.03 0.009
Tutor 8.34 5.68 1.47 0.138
Let’s take a look at how to interpret each regression coefficient.
Interpreting the Intercept
The intercept term in a regression table tells us the average expected value for the response
variable when all of the predictor variables are equal to zero.
In this example, the regression coefficient for the intercept is equal to 48.56. This means that for a
student who studied for zero hours (Hours studied = 0) and did not use a tutor (Tutor = 0), the
average expected exam score is 48.56.
It’s important to note that the regression coefficient for the intercept is only meaningful if it’s
reasonable that all of the predictor variables in the model can actually be equal to zero. In this
example, it’s certainly possible for a student to have studied for zero hours (Hours studied = 0) and
to have also not used a tutor (Tutor = 0). Thus, the interpretation for the regression coefficient of
the intercept is meaningful in this example.
In some cases, though, the regression coefficient for the intercept is not meaningful. For example,
suppose we ran a regression analysis using square footage as a predictor variable and house
value as a response variable. In the output regression table, the regression coefficient for the
intercept term would not have a meaningful interpretation since square footage of a house can
never actually be equal to zero. In that case, the regression coefficient for the intercept term simply
anchors the regression line in the right place.
Interpreting the Coefficient of a Continuous Predictor Variable
For a continuous predictor variable, the regression coefficient represents the difference in the
predicted value of the response variable for each one-unit change in the predictor variable,
assuming all other predictor variables are held constant.
In this example, Hours studied is a continuous predictor variable that ranges from 0 to 20 hours.
In some cases, a student studied as few as zero hours and in other cases a student studied as much
as 20 hours.
From the regression output, we can see that the regression coefficient for Hours studied is 2.03.
This means that, on average, each additional hour studied is associated with an increase of 2.03
points on the final exam, assuming the predictor variable Tutor is held constant.
For example, consider student A who studies for 10 hours and uses a tutor. Also consider student
B who studies for 11 hours and also uses a tutor. According to our regression output, student B is
expected to receive an exam score that is 2.03 points higher than student A.

lOMoARcPSD|25913154
The p-value from the regression table tells us whether or not this regression coefficient is actually
statistically significant. We can see that the p-value for Hours studied is 0.009, which is
statistically significant at an alpha level of 0.05.
Note: The alpha level should be chosen before the regression analysis is conducted – common
choices for the alpha level are 0.01, 0.05, and 0.10.
Related post: An Explanation of P-Values and Statistical Significance
Interpreting the Coefficient of a Categorical Predictor Variable
For a categorical predictor variable, the regression coefficient represents the difference in the
predicted value of the response variable between the category for which the predictor variable = 0
and the category for which the predictor variable = 1.
In this example, Tutor is a categorical predictor variable that can take on two different values:
• 1 = the student used a tutor to prepare for the exam
• 0 = the student did not used a tutor to prepare for the exam
From the regression output, we can see that the regression coefficient for Tutor is 8.34. This means
that, on average, a student who used a tutor scored 8.34 points higher on the exam compared to a
student who did not used a tutor, assuming the predictor variable Hours studied is held constant.
For example, consider student A who studies for 10 hours and uses a tutor. Also consider student
B who studies for 10 hours and does not use a tutor. According to our regression output, student
A is expected to receive an exam score that is 8.34 points higher than student B.
The p-value from the regression table tells us whether or not this regression coefficient is actually
statistically significant. We can see that the p-value for Tutor is 0.138, which is not statistically
significant at an alpha level of 0.05. This indicates that although students who used a tutor scored
higher on the exam, this difference could have been due to random chance.
Interpreting All of the Coefficients At Once
We can use all of the coefficients in the regression table to create the following estimated
regression equation:
Expected exam score = 48.56 + 2.03*(Hours studied) + 8.34*(Tutor)
Note: Keep in mind that the predictor variable “Tutor” was not statistically significant at alpha
level 0.05, so you may choose to remove this predictor from the model and not use it in the final
estimated regression equation.
Using this estimated regression equation, we can predict the final exam score of a student based
on their total hours studied and whether or not they used a tutor.

lOMoARcPSD|25913154
For example, a student who studied for 10 hours and used a tutor is expected to receive an exam
score of:
Expected exam score = 48.56 + 2.03*(10) + 8.34*(1) = 77.2
Considering Correlation When Interpreting Regression Coefficients
It’s important to keep in mind that predictor variables can influence each other in a regression
model. For example, most predictor variables will be at least somewhat related to one another (e.g.
perhaps a student who studies more is also more likely to use a tutor).
This means that regression coefficients will change when different predict variables are added or
removed from the model.
One good way to see whether or not the correlation between predictor variables is severe enough
to influence the regression model in a serious way is to check the VIF between the predictor
variables. This will tell you whether or not the correlation between predictor variables is a problem
that should be addressed before you decide to interpret the regression coefficients.
If you are running a simple linear regression model with only one predictor, then correlated
predictor variables will not be a problem.
5.5 What is data visualization?

Data visualization is the representation of data through use of common graphics, such as charts,
plots, infographics, and even animations. These visual displays of information communicate
complex data relationships and data-driven insights in a way that is easy to understand.
Data visualization can be utilized for a variety of purposes, and it’s important to note that is not
only reserved for use by data teams. Management also leverages it to convey organizational
structure and hierarchy while data analysts and data scientists use it to discover and explain
patterns and trends. Harvard Business Review (link resides outside IBM) categorizes data
visualization into four key purposes: idea generation, idea illustration, visual discovery, and
everyday dataviz. We’ll delve deeper into these below:
Idea generation
Data visualization is commonly used to spur idea generation across teams. They are frequently
leveraged during brainstorming or Design Thinking sessions at the start of a project by supporting
the collection of different perspectives and highlighting the common concerns of the collective.
While these visualizations are usually unpolished and unrefined, they help set the foundation
within the project to ensure that the team is aligned on the problem that they’re looking to address
for key stakeholders.
Idea illustration
Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It
is commonly used in learning settings, such as tutorials, certification courses, centers of excellence,
but it can also be used to represent organization structures or processes, facilitating communication

lOMoARcPSD|25913154
between the right individuals for specific tasks. Project managers frequently use Gantt charts and
waterfall charts to illustrate workflows. Data modeling also uses abstraction to represent and better
understand data flow within an enterprise’s information system, making it easier for developers,
business analysts, data architects, and others to understand the relationships in a database or data
warehouse.
Visual discovery
Visual discovery and every day data viz are more closely aligned with data teams. While visual
discovery helps data analysts, data scientists, and other data professionals identify patterns and
trends within a dataset, every day data viz supports the subsequent storytelling after a new insight
has been found.
Data visualization
Data visualization is a critical step in the data science process, helping teams and individuals
convey data more effectively to colleagues and decision makers. Teams that manage reporting
systems typically leverage defined template views to monitor performance. However, data
visualization isn’t limited to performance dashboards. For example, while text mining an analyst
may use a word cloud to to capture key concepts, trends, and hidden relationships within this
unstructured data. Alternatively, they may utilize a graph structure to illustrate relationships
between entities in a knowledge graph. There are a number of ways to represent different types of
data, and it’s important to remember that it is a skillset that should extend beyond your core
analytics team.
Types of data visualizations
The earliest form of data visualization can be traced back the Egyptians in the pre-17th century,
largely used to assist in navigation. As time progressed, people leveraged data visualizations for
broader applications, such as in economic, social, health disciplines. Perhaps most notably, Edward
Tufte published The Visual Display of Quantitative Information (link resides outside IBM), which
illustrated that individuals could utilize data visualization to present data in a more effective
manner. His book continues to stand the test of time, especially as companies turn to dashboards
to report their performance metrics in real-time. Dashboards are effective data visualization tools
for tracking and visualizing data from multiple data sources, providing visibility into the effects of
specific behaviors by a team or an adjacent one on performance. Dashboards include common
visualization techniques, such as:
• Tables: This consists of rows and columns used to compare variables. Tables can show a
great deal of information in a structured way, but they can also overwhelm users that are
simply looking for high-level trends.
• Pie charts and stacked bar charts: These graphs are divided into sections that represent
parts of a whole. They provide a simple way to organize data and compare the size of each
component to one other.

lOMoARcPSD|25913154
• Line charts and area charts: These visuals show change in one or more quantities by
plotting a series of data points over time and are frequently used within predictive analytics.
Line graphs utilize lines to demonstrate these changes while area charts connect data points
with line segments, stacking variables on top of one another and using color to distinguish
between variables.
• Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces
between the bars), representing the quantity of data that falls within a particular range. This
visual makes it easy for an end user to identify outliers within a given dataset.
• Scatter plots: These visuals are beneficial in reveling the relationship between two
variables, and they are commonly used within regression data analysis. However, these can
sometimes be confused with bubble charts, which are used to visualize three variables via
the x-axis, the y-axis, and the size of the bubble.
• Heat maps: These graphical representation displays are helpful in visualizing behavioral
data by location. This can be a location on a map, or even a webpage.
• Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles.
Treemaps are great for comparing the proportions between categories via their area size.
Open source visualization tools
Access to data visualization tools has never been easier. Open source libraries, such as D3.js,
provide a way for analysts to present data in an interactive way, allowing them to engage a broader
audience with new data. Some of the most popular open source visualization libraries include:
• D3.js: It is a front-end JavaScript library for producing dynamic, interactive data
visualizations in web browsers. D3.js (link resides outside IBM) uses HTML, CSS, and
SVG to create visual representations of data that can be viewed on any browser. It also
provides features for interactions and animations.
• ECharts: A powerful charting and visualization library that offers an easy way to add
intuitive, interactive, and highly customizable charts to products, research papers,
presentations, etc. Echarts (link resides outside IBM) is based in JavaScript and ZRender,
a lightweight canvas library.
• Vega: Vega (link resides outside IBM) defines itself as “visualization grammar,”
providing support to customize visualizations across large datasets which are accessible
from the web.
• deck.gl: It is part of Uber's open source visualization framework suite. deck.gl (link resides
outside IBM) is a framework, which is used for exploratory data analysis on big data. It
helps build high-performance GPU-powered visualization on the web.
Data visualization best practices
With so many data visualization tools readily available, there has also been a rise in ineffective
information visualization. Visual communication should be simple and deliberate to ensure that

lOMoARcPSD|25913154
your data visualization helps your target audience arrive at your intended insight or conclusion.
The following best practices can help ensure your data visualization is useful and clear:
Set the context: It’s important to provide general background information to ground the audience
around why this particular data point is important. For example, if e-mail open rates were
underperforming, we may want to illustrate how a company’s open rate compares to the overall
industry, demonstrating that the company has a problem within this marketing channel. To drive
an action, the audience needs to understand how current performance compares to something
tangible, like a goal, benchmark, or other key performance indicators (KPIs).
Know your audience(s): Think about who your visualization is designed for and then make sure
your data visualization fits their needs. What is that person trying to accomplish? What kind of
questions do they care about? Does your visualization address their concerns? You’ll want the data
that you provide to motivate people to act within their scope of their role. If you’re unsure if the
visualization is clear, present it to one or two people within your target audience to get feedback,
allowing you to make additional edits prior to a large presentation.
Choose an effective visual: Specific visuals are designed for specific types of datasets. For
instance, scatter plots display the relationship between two variables well, while line graphs
display time series data well. Ensure that the visual actually assists the audience in understanding
your main takeaway. Misalignment of charts and data can result in the opposite, confusing your
audience further versus providing clarity.
Keep it simple: Data visualization tools can make it easy to add all sorts of information to your
visual. However, just because you can, it doesn’t mean that you should! In data visualization, you
want to be very deliberate about the additional information that you add to focus user attention.
For example, do you need data labels on every bar in your bar chart? Perhaps you only need one
or two to help illustrate your point. Do you need a variety of colors to communicate your idea?
Are you using colors that are accessible to a wide range of audiences (e.g. accounting for color
blind audiences)? Design your data visualization for maximum impact by eliminating information
that may distract your target audience.
Data visualization is a graphical representation of information and data. By using visual elements
like charts, graphs, and maps, data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. This blog on data visualization techniques will
help you understand detailed techniques and benefits.
In the world of Big Data, data visualization in Python tools and technologies are essential to
analyze massive amounts of information and make data-driven decisions.
Contributed by: Dinesh
Benefits of good data visualization
Our eyes are drawn to colours and patterns. We can quickly identify red from blue, and square
from the circle. Our culture is visual, including everything from art and advertisements to TV and
movies.

lOMoARcPSD|25913154
Data visualization is another form of visual art that grabs our interest and keeps our eyes on the
message. When we see a chart, we quickly see trends and outliers. If we can see something, we
internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet
of data and couldn’t see a trend, you know how much more effective a visualization can be. The
uses of Data Visualization as follows.
• Powerful way to explore data with presentable results.
• Primary use is the pre-processing portion of the data mining process.
• Supports the data cleaning process by finding incorrect and missing values.
• For variable derivation and selection means to determine which variable to include and
discarded in the analysis.
• Also play a role in combining categories as part of the data reduction process.
Data Visualization Techniques
• Box plots
• Histograms
• Heat maps
• Charts
• Tree maps
• Word Cloud/Network diagram
Enrol Now – Data Visualization Using Tableau course for free offered by Great Learning
Academy.
Box Plots
The image above is a box plot. A boxplot is a standardized way of displaying the distribution of
data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3),
and “maximum”). It can tell you about your outliers and what their values are. It can also tell you
if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
A box plot is a graph that gives you a good indication of how the values in the data are spread out.
Although box plots may seem primitive in comparison to a histogram or density plot, they have
the advantage of taking up less space, which is useful when comparing distributions between many
groups or datasets. For some distributions/datasets, you will find that you need more information
than the measures of central tendency (median, mean, and mode). You need to have information
on the variability or dispersion of the data.
List of Methods to Visualize Data

lOMoARcPSD|25913154
• Column Chart: It is also called a vertical bar chart where each category is represented by
a rectangle. The height of the rectangle is proportional to the values that are plotted.
• Bar Graph: It has rectangular bars in which the lengths are proportional to the values
which are represented.
• Stacked Bar Graph: It is a bar style graph that has various components stacked together
so that apart from the bar, the components can also be compared to each other.
• Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked
horizontally.
• Area Chart: It combines the line chart and bar chart to show how the numeric values of
one or more groups change over the progress of a viable area.
• Dual Axis Chart: It combines a column chart and a line chart and then compares the two
variables.
• Line Graph: The data points are connected through a straight line; therefore, creating a
representation of the changing trend.
• Mekko Chart: It can be called a two-dimensional stacked chart with varying column
widths.
• Pie Chart: It is a chart where various components of a data set are presented in the form
of a pie which represents their proportion in the entire data set.
• Waterfall Chart: With the help of this chart, the increasing effect of sequentially
introduced positive or negative values can be understood.
• Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter Plot and a
Proportional Area Chart.
• Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to denote
values for two different numeric variables.
• Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard
gauges and meters.
• Funnel Chart: The chart determines the flow of users with the help of a business or sales
process.
• Heat Map: It is a technique of data visualization that shows the level of instances as color
in two dimensions.
Five Number Summary of Box Plot
Minimum Q1 -1.5*IQR

lOMoARcPSD|25913154
First quartile (Q1/25th The middle number between the smallest number (not the “minimum”)
Percentile)”: and the median of the dataset
Median (Q2/50th Percentile)”: the middle value of the dataset
Third quartile (Q3/75th the middle value between the median and the highest value (not the
Percentile)”: “maximum”) of the dataset.
Maximum” Q3 + 1.5*IQR
interquartile range (IQR) 25th to the 75th percentile.
Histograms
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar
groups numbers into ranges. Taller bars show that more data falls in that range. A histogram
displays the shape and spread of continuous sample data.
It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set
of continuous data. This allows the inspection of the data for its underlying distribution (e.g.,
normal distribution), outliers, skewness, etc. It is an accurate representation of the distribution of
numerical data, it relates only one variable. Includes bin or bucket- the range of values that divide
the entire range of values into a series of intervals and then count how many values fall into each
interval.
Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps,
the rectangles of histogram touch each other to indicate that the original value is continuous.
Histograms are based on area, not height of bars
In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores
there were within each bin. It is the product of height multiplied by the width of the bin that
indicates the frequency of occurrences within that bin. One of the reasons that the height of the
bars is often incorrectly assessed as indicating the frequency and not the area of the bar is because
a lot of histograms often have equally spaced bars (bins), and under these circumstances, the height
of the bin does reflect the frequency.
Also Read: Machine Learning Interview Questions
Histogram Vs Bar Chart
The major difference is that a histogram is only used to plot the frequency of score occurrences in
a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand,
can be used for a lot of other types of variables including ordinal and nominal data sets.

lOMoARcPSD|25913154
Heat Maps
A heat map is data analysis software that uses colour the way a bar graph uses height and width:
as a data visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat
map shows you in a visual way that’s easy to assimilate and make decisions from. It is a graphical
representation of data where the individual values contained in a matrix are represented as colours.
Useful for two purposes: for visualizing correlation tables and for visualizing missing values in
the data. In both cases, the information is conveyed in a two-dimensional table.
Note that heat maps are useful when examining a large number of values, but they are not a
replacement for more precise graphical displays, such as bar charts, because colour differences
cannot be perceived accurately.
Also Read: Top Data Mining Tools
Charts
Line Chart
The simplest technique, a line plot is used to plot the relationship or dependence of one variable
on another. To plot the relationship between the two variables, we can simply call the plot function.
Bar Charts
Bar charts are used for comparing the quantities of different categories or groups. Values of a
category are represented with the help of bars and they can be configured with vertical or horizontal
bars, with the length or height of each bar representing the value.
Pie Chart
It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc
length of each slide is proportional to the quantity it represents. As a rule, they are used to compare
the parts of a whole and are most effective when there are limited components and when text and
percentages are included to describe the content. However, they can be difficult to interpret
because the human eye has a hard time estimating areas and comparing visual angles.
Scatter Charts
Another common visualization technique is a scatter plot that is a two-dimensional plot
representing the joint variation of two data items. Each marker (symbols such as dots, squares and
plus signs) represents an observation. The marker position indicates the value for each observation.
When you assign more than two measures, a scatter plot matrix is produced that is a series scatter
plot displaying every possible pairing of the measures that are assigned to the visualization. Scatter
plots are used for examining the relationship, or correlations, between X and Y variables.
Bubble Charts
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional
dimension of data is represented in the size of the bubbles.

lOMoARcPSD|25913154
Timeline Charts
Timeline charts illustrate events, in chronological order — for example the progress of a project,
advertising campaign, acquisition process — in whatever unit of time the data was recorded — for
example week, month, year, quarter. It shows the chronological sequence of past or future events
on a timescale.
Tree Maps
A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles,
parent elements being tiled with their child elements. The sizes and colours of rectangles are
proportional to the values of the data points they represent. A leaf node rectangle has an area
proportional to the specified dimension of the data. Depending on the choice, the leaf node is
coloured, sized or both according to chosen attributes. They make efficient use of space, thus
display thousands of items on the screen simultaneously.
Word Clouds and Network Diagrams for Unstructured Data
The variety of big data brings challenges because semi-structured, and unstructured data require
new visualization techniques. A word cloud visual represents the frequency of a word within a
body of text with its relative size in the cloud. This technique is used on unstructured data as a way
to display high- or low-frequency words.
Another visualization technique that can be used for semi-structured or unstructured data is the
network diagram. Network diagrams represent relationships as nodes (individual actors within the
network) and ties (relationships between the individuals). They are used in many applications, for
example for analysis of social networks or mapping product sales across geographic areas.
Learn all about Data Visualization with Power BI with this free course.
FAQs Related to Data Visualization
• What are the techniques of Visualization?
A: The visualization techniques include Pie and Donut Charts, Histogram Plot, Scatter Plot, Kernel
Density Estimation for Non-Parametric Data, Box and Whisker Plot for Large Data, Word Clouds
and Network Diagrams for Unstructured Data, and Correlation Matrices.
• What are the types of visualization?
A: The various types of visualization include Column Chart, Line Graph, Bar Graph, Stacked Bar
Graph, Dual-Axis Chart, Pie Chart, Mekko Chart, Bubble Chart, Scatter Chart, and Bullet Graph.
• What are the various visualization techniques used in data analysis?
A: Various visualization techniques are used in data analysis. A few of them include Box and
Whisker Plot for Large Data, Histogram Plot, and Word Clouds and Network Diagrams for
Unstructured Data, to name a few.
• How do I start visualizing?

lOMoARcPSD|25913154
A: You need to have a basic understanding of data and present it without misleading the data.
Once you understand it, you can further take up an online course or tutorials.
• What are the two basic types of data visualization?
A: The two very basic types of data visualization are exploration and explanation.
• Which is the best visualization tool?
A: Some of the best visualization tools include Visme, Tableau, Infogram, Whatagraph, Sisense,
DataBox, ChartBlocks, DataWrapper, etc.
These are some of the Visualization techniques used to represent data effectively for their better
understanding and interpretation. We hope this article was useful. You can also upskill with our
free courses on Great Learning Academy.
What is Interactive Data Visualization?
Interactive data visualization refers to the use of modern data analysis software that enables users
to directly manipulate and explore graphical representations of data. Data visualization uses visual
aids to help analysts efficiently and effectively understand the significance of data. Interactive data
visualization software improves upon this concept by incorporating interaction tools that facilitate
the modification of the parameters of a data visualization, enabling the user to see more detail,
create new insights, generate compelling questions, and capture the full value of the data.
5.6 Interactive Data Visualization Techniques
Deciding what the best interactive data visualization will be for your project depends on your end
goal and the data available. Some common data visualization interactions that will help users
explore their data visualizations include:
• Brushing: Brushing is an interaction in which the mouse controls a paintbrush that directly
changes the color of a plot, either by drawing an outline around points or by using the brush
itself as a pointer. Brushing scatterplots can either be persistent, in which the new
appearance is retained once the brush has been removed, or transient, in which changes
only remain visible while the active plot is enclosed or intersected by the brush. Brushing
is typically used when multiple plots are visible and a linking mechanism exists between
the plots.
• Painting: Painting refers to the use of persistent brushing, followed by subsequent
operations such as touring to compare the groups.
• Identification: Identification, also known as label brushing or mouseover, refers to the
automatic appearance of an identifying label when the cursor hovers over a particular plot
element.
• Scaling: Scaling can be used to change a plot’s aspect ratio, revealing different data
features. Scaling is also commonly used to zoom in on dense regions of a scatter plot.

lOMoARcPSD|25913154
• Linking: Linking connects selected elements on different plots. One-to-one linking entails
the projection of data on two different plots, in which a point in one plot corresponds to
exactly one point in the other. Elements may also be categorical variables, in which all data
values corresponding to that category are highlighted in all the visible plots. Brushing an
area in one plot will brush all cases in the corresponding category on another plot.
How to Create Interactive Data Visualizations
Creating various interactive widgets, bar charts, and plots for data visualization should start with
the three basic attributes of a successful data visualization interaction design - available, accessible,
and actionable. Is there sufficient source data available to meet your data visualization goals? Can
you present this data in an accessible manner so that it is intuitive and comprehensible? Do your
data visualization interactions provide meaningful, actionable insights?
The general framework for an interactive data structure visualization project typically follows
these steps: identify your desired goals, understand the challenges presented by data constraints,
and design a conceptual model in which data can be quickly iterated and reviewed.
With a rough, conceptual model in place, data modeling is leveraged to thoroughly document every
piece of data and related meta-data. This is followed by the design of a user interface and the
development of your design’s core technology, which can be accomplished with a variety of
interactive data visualization tools.
Next it’s time to user test in order to refine compatibility, functionality, security, the user interface,
and performance. Now you are ready to launch to your target audience. Methods for rapid updates
should be built in so that your team can stay up to date with your interactive data visualization.
Some popular libraries for creating your own interactive data visualizations include: Altair, Bokeh,
Celluloid, Matplotlib, nbinteract, Plotly, Pygal, and Seaborn. Libraries are available for Python,
Jupyter, Javascript, and R interactive data visualizations. Scott Murray’s Interactive Data
Visualization for the Web is one of the most popular educational resources for learning how to
create interactive data visualizations.
Benefits of Interactive Data Visualizations
An interactive data visualization allows users to engage with data in ways not possible with static
graphs, such as big data interactive visualizations. Interactivity is the ideal solution for large
amounts of data with complex data stories, providing the ability to identify, isolate, and visualize
information for extended periods of time. Some major benefits of interactive data visualizations
include:
• Identify Trends Faster - The majority of human communication is visual as the human
brain processes graphics magnitudes faster than it does text. Direct manipulation of
analyzed data via familiar metaphors and digestible imagery makes it easy to understand
and act on valuable information.
• Identify Relationships More Effectively - The ability to narrowly focus on specific metrics
enables users to identify otherwise overlooked cause-and-effect relationships throughout

lOMoARcPSD|25913154
definable timeframes. This is especially useful in identifying how daily operations affect
an organization’s goals.
• Useful Data Storytelling - Humans best understand a data story when its development over
time is presented in a clear, linear fashion. A visual data story in which users can zoom in
and out, highlight relevant information, filter, and change the parameters promotes better
understanding of the data by presenting multiple viewpoints of the data.
• Simplify Complex Data - A large data set with a complex data story may present itself
visually as a chaotic, intertwined hairball. Incorporating filtering and zooming controls can
help untangle and make these messes of data more manageable, and can help users glean
better insights.
Static vs Interactive Data Visualization
A static data visualization is one that does not incorporate any interaction capabilities and does not
change with time, such as an infographic focused on a specific data story from a single viewpoint.
As there are no tools to adjust the final results of static visualizations, such as filtering and zooming
tools in interactive designs, it is essential to give great consideration about what data is being
displayed.
A static visualization is more suited for less complex data stories, building relationships between
concepts, and conveying a predetermined view than encouraging exploration and increasing user
autonomy. Static designs are also significantly less expensive to build than interactive designs.
Deciding whether to build a static or interactive data visualization depends on customer preference,
data story complexity, and ROI.
5.7 Different applications of data visualisation

What is data visualisation?
The graphical depiction of information and data is known as data visualisation. Data visualisation
tools make it easy to view and comprehend trends, outliers, and patterns in data by utilising visual
components like charts, graphs, and maps.
It provides insights on one or more pages or screens to assist you keep track of events or activities
at a glance. Unlike an infographic, which displays a static graphical representation, a dashboard
displays real-time data by extracting complicated data points from massive data sets.
An interactive dashboard allows you to quickly sort, filter, and dive into many sorts of data. Data
science approaches may be used to quickly understand what is occurring, why it is occurring, and
what will occur next.
Different applications of data visualisation
1. Healthcare Industries
A dashboard that visualises a patient's history might aid a current or new doctor in comprehending
a patient's health. It might give faster care facilities based on illness in the event of an emergency.

lOMoARcPSD|25913154
Instead than sifting through hundreds of pages of information, data visualisation may assist in
finding trends.
Health care is a time-consuming procedure, and the majority of it is spent evaluating prior reports.
By boosting response time, data visualisation provides a superior selling point. It gives matrices
that make analysis easier, resulting in a faster reaction time.(From)
2. Business intelligence
When compared to local options, cloud connection can provide the cost-effective “heavy lifting”
of processor-intensive analytics, allowing users to see bigger volumes of data from numerous
sources to help speed up decision-making.
Because such systems can be diverse, comprised of multiple components, and may use their own
data storage and interfaces for access to stored data, additional integrated tools, such as those
geared toward business intelligence (BI), help provide a cohesive view of an organization's entire
data system (e.g., web services, databases, historians, etc.).
Multiple datasets can be correlated using analytics/BI tools, which allow for searches using a
common set of filters and/or parameters. The acquired data may then be displayed in a standardised
manner using these technologies, giving logical "shaping" and better comparison grounds for end
users.
3. Military
It's a matter of life and death for the military; having clarity of actionable data is critical, and
taking the appropriate action requires having clarity of data to pull out actionable insights.
The adversary is present in the field today, as well as posing a danger through digital warfare
and cybersecurity. It is critical to collect data from a variety of sources, both organised and
unstructured. The volume of data is enormous, and data visualisation technologies are essential for
rapid delivery of accurate information in the most condensed form feasible. A greater grasp of past
data allows for more accurate forecasting.
Dynamic Data Visualization aids in a better knowledge of geography and climate, resulting in a
more effective approach. The cost of military equipment and tools is extremely significant; with
bar and pie charts, analysing current inventories and making purchases as needed is simple.
4. Finance Industries
For exploring/explaining data of linked customers, understanding consumer behaviour, having a
clear flow of information, the efficiency of decision making, and so on, data visualisation tools are
becoming a requirement for financial sectors.
For associated organisations and businesses, data visualisation aids in the creation of patterns,
which aids in better investment strategy. For improved business prospects, data visualisation
emphasises the most recent trends.

lOMoARcPSD|25913154
5. Data science
Data scientists generally create visualisations for their personal use or to communicate
information to a small group of people. Visualization libraries for the specified programming
languages and tools are used to create the visual representations.
Open source programming languages, such as Python, and proprietary tools built for
complicated data analysis are commonly used by data scientists and academics. These data
scientists and researchers use data visualisation to better comprehend data sets and spot patterns
and trends that might otherwise go undiscovered.
6. Marketing
In marketing analytics, data visualisation is a boon. We may use visuals and reports to analyse
various patterns and trends analysis, such as sales analysis, market research analysis, customer
analysis, defect analysis, cost analysis, and forecasting. These studies serve as a foundation for
marketing and sales.
Visual aids can assist your audience grasp your main message by visually engaging them and
visually engaging them. The major advantage of visualising data is that it can communicate a point
faster than a boring spreadsheet.
In b2b firms, data-driven yearly reports and presentations don't fulfil the needs of people who are
seeing the information. They are unable to grasp the art of engaging with their audience in a
meaningful or memorable manner. Your audience will be more interested in your facts if you
present them as visual statistics, and you will be more inclined to act on your discoveries.
7. Food delivery apps
When you place an order for food on your phone, it is given to the nearest delivery person. There
is a lot of math involved here, such as the distance between the delivery executive's present position
and the restaurant, as well as the time it takes to get to the customer's location.
Customer orders, delivery location, GPS service, tweets, social media messages, verbal
comments, pictures, videos, reviews, comparative analyses, blogs, and updates have all become
common ways of data transmission.
Users may obtain data on average wait times, delivery experiences, other records, customer
service, meal taste, menu options, loyalty and reward point programmes, and product stock and
inventory data with the help of the data.
8. Real estate business
Brokers and agents seldom have the time to undertake in-depth research and analysis on their
own. Showing a buyer or seller comparable home prices in their neighbourhood on a map,
illustrating average time on the market, creating a sense of urgency among prospective buyers and
managing sellers' expectations, and attracting viewers to your social media sites are all examples
of common data visualisation applications.

lOMoARcPSD|25913154
If a chart is difficult to understand, it is likely to be misinterpreted or disregarded. It is also seen to

be critical to offer data that is as current as feasible. The market may not alter overnight, but if the
data is too old, seasonal swings and other trends may be overlooked.
Clients will be pulled to the graphics and to you as a broker or agent if they perceive that you
know the market. If you display data in a compelling and straightforward fashion, they will be
drawn to the graphics and to you as a broker or agent.
9. Education
Users may visually engage with data, answer questions quickly, make more accurate, data-
informed decisions, and share their results with others using intuitive, interactive dashboards.
The ability to monitor students' progress throughout the semester, allowing advisers to act quickly
with outreach to failing students. When they provide end users access to interactive, self-service
analytic visualisations as well as ad hoc visual data discovery and exploration, they make quick
insights accessible to everyone – even those with little prior experience with analytics.
10. E-commerce
In e-commerce, any chance to improve the customer experience should be taken. The key to
running a successful internet business is getting rapid insights. This is feasible with data
visualisation because crossing data shows features that would otherwise be hidden.
Your marketing team may use data visualisation to produce excellent content for your audience
that is rich in unique information. Data may be utilised to produce attractive narrative through the
use of infographics, which can easily and quickly communicate findings.
Patterns may be seen all throughout the data. You can immediately and readily detect them if you
make them visible. These behaviours indicate a variety of consumer trends, providing you with
knowledge to help you attract new clients and close sales.

Bda Unit 5

Uploaded by

Copyright:

Available Formats

Bda Unit 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Unit 5

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|25913154

5.1 What is predictive analytics?

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Benefits of predictive modeling

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

After looking at scatter plot we can understand:

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

We can also calculate “r” as follows:

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

y value will be predicted.

We obtain the following values

Substitute these values in the equation to get y as shown below.

5.3 What is Multiple Linear Regression

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

• Simple - deals with two features.

What Is Multiple Linear Regression (MLR)?

Formula and Calculation of Multiple Linear Regression

Furthermore, we assume that Y is linearly dependent on the factors according to

Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε

• The variable yi is dependent or predicted

• βp refers to the slope coefficient of all independent variables

• ε term describes the random error (residual) in the model.

We have n observations, n typically being much more than k.

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Thus, the model can be described by the equations.

Yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + i for i = 1, 2, . . . , n,

β0, β1, . . . , βk, and σ 2.

It was a line in the plane R 2.

Now, with k ≥ 1, we’ll have a least squares hyperplane.

y = βˆ 0 + βˆ 1x1 + βˆ 2x2 + · · · + βˆ kxk in Rk+1.

The way to find the estimators βˆ 0, βˆ 1, . . ., and βˆ k is the same.

Take the partial derivatives of the squared error.

Q = Xn i=1 (yi − (β0 + β1xi1 + β2xi2 + · · · + βkxik))2

When that system is solved we have fitted values

Assumptions of Multiple Linear Regression

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

• The independent and dependent variables are linearly related.

• There is no strong correlation between the independent variables.

• Residuals have a constant variance.

• Observations should be independent of one another.

• It is important that all variables follow multivariate normality.

Example of How to Use Multiple Linear Regression

The Difference Between Linear and Multiple Regression

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

1. Planning and Control.

5.4 How to Interpret Regression Coefficients

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Intercept 48.56 14.32 3.39 0.002

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

5.5 What is data visualization?

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women

Mr Netaji Gandi ([email protected]) Dept. of IT Vignan's Institute of Engineering for Women