BA - Unit 5
BA - Unit 5
6
Predictive and Textual
Analytics
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 Simple Linear Regression Models
6.4
6.5 Multiple Linear Regression
6.6
6.7 Heteroscedasticity and Multi-Collinearity
6.8 Basics of Textual Data Analysis
6.9 Methods and Techniques of Textual Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings
Notes
6.2 Introduction
In this lesson we will focus on two important analytics for business:
Predictive and textual, each having its own importance and techniques.
Predictive analytics often referred to as advanced analytics, is very closely
linked with business intelligence, giving organizations actionable insights
for better decision-making and planning. For instance, if an organization
wants to know how much profit it will make a few years later based on
current trends of sales, customer demographics, and regional performance
they can take benefit from predictive analytics. It uses techniques such
as data mining and artificial intelligence to predict outcomes like future
profit or other factors that may be critical to the success of the organi-
zation. At its core, predictive analytics is a process that makes informed
predictions about future events based on historical data. Analysts and data
scientists apply statistical models, regression techniques, and machine
learning algorithms to identify trends in historical data, so businesses
can predict risks and trends and prepare for future events.
Predictive analytics is changing industries because it can facilitate da-
ta-driven decision-making and efficiency in operations. In marketing, it
aids businesses in understanding customer behavior, leads segmentation,
and targeting of high-value prospects. Retailers apply predictive analytics
in order to personalize shopping experiences, optimize pricing strategies,
and merchandise plans. Manufacturing uses predictive analytics to en-
hance the monitoring of machine performance, avoid equipment failures,
and smooth logistics. Fraud detection, credit scoring, churn analysis,
and risk assessment are some of the benefits for the financial sector. Its
applications in healthcare include personalized care, resource allocation,
and identification of high-risk patients for timely interventions. As tools
get better, predictive analytics continues revolutionizing industries with
smarter insights and better outcomes.
Textual analysis is the systematic examination and interpretation of textual
data in order to draw meaningful insight, patterns, and trends. There are
different techniques and methods involved in the processing of unstructured
text, including various forms of documents, social media posts, or reviews
by customers. The intention of textual analysis is for the transformation
of raw text to become accessible and useful as input in decision-making
120 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
processes or for research. This can be identifying hot words, categorizing Notes
text topics, or sentiment analysis associated with a particular piece of
text. Textual analysis really is the backbone of a number of fields, such
as marketing, social science, business intelligence, to name only a few,
in garnering valuable insights from huge streams of unstructured text.
0 1
are model coefficients which are unknown constants
We can build a simple linear regress model in R using the following steps:
The very first step is to prepare data for this we need to ensure
that the data set is clean and contain no missing values for the
variable involved. We need to load the data into R using functions
like read.csv() or read.table().
Once data is uploaded we need to visualize it using scatter plot.
After that we can fit the regression model using lm() function and
then we can use summarize method to understand the details of
the model.
We can also make predictions using predict() function.
We have already studied the reading techniques in previous lessons. We
will focus on linear regression functions in this section.
The lm() function in R is a built-in function for fitting linear models,
both simple and multiple linear regression. It estimates the relationship
PAGE 121
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 1
In code window 1 we have shown the code for linear regression on
built-in dataset “mtcars” available in R. The goal here is to predict how
122 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
changes in horsepower ‘hp’ influence miles per gallon ‘mpg’. So, in this Notes
example we will examine information regarding “mpg” and “hp”. Since
the data is already prepared and formatted, we visualize the data using
scatter plot. This will help us analyse whether there is a pattern or trend
in the data that is worth exploring further.
Next, we try to fit a simple linear regression model by using the formula
“lm(mpg ~ hp, data = mtcars)”, this predicts mpg based on hp. After
fitting the model, we enhance our scatter plot by adding a regression line
using the abline() function to show the model’s predicted relationship. To
make the model useful, we use the predict() function to estimate mpg for
a specific horsepower value. This prediction tells us what fuel efficiency
we might expect for such a car. Finally, we examine the model’s accu-
racy by plotting its residuals—the differences between the observed and
predicted values. This step helps us check if there are any patterns in
the errors, which could indicate that the model is missing some critical
information.
PAGE 123
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 2
In the code above, a simple linear regression model is fitted to the mtcars
dataset using lm(mpg ~ wt), where mpg is the dependent variable, and
wt is the independent variable that is the weight of the car. The model
predicts how car weight influences fuel efficiency. Then we have used
the predict function to compute a confidence interval for the mean re-
sponse (mpg) for the car with weight = 3.0. To do this, specify interval
= “confidence” and level = 0.95. The function will compute the range
in which the average mileage for cars weighing 3.0 will fall, with 95%
confidence. This computation gives the lower and upper bound of the
mean mpg, thus making it possible to evaluate the precision and reliability
of the estimated mean value.
A prediction interval (PI) predicts the interval where the actual value
of the dependent variable (y) is likely to lie for a given x. Unlike CIs,
prediction intervals contain the residual error variability ). Thus, a pre-
diction interval gives a range for an individual data point rather than the
mean response. For example, let’s consider the same regression model
as above, a prediction interval for a car with weight 3.0 might indicate
that its mileage lies between 16.0 and 22.0 mpg, with 95% confidence.
R code for same is given in code window 3.
Code Window 3
124 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Hence, confidence intervals are narrower in range than prediction intervals Notes
since they only consider the uncertainty of the estimate of the mean of
response variable. Prediction intervals are larger because they encompass
the variability of individual observations about the regression line as well
as the uncertainty in the mean.
The visualization code and output of confidence and prediction interval
is shown in code window 4.
Code Window 4
In this code we have visualized car weight against mileage for all the
cars in the mtcars dataset. The plot function is used to make a scatter
PAGE 125
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes plot of weight vs. mileage, the abline function plots the fitted regres-
sion line in blue with a width of 2 representing the linear relationship
between the two variables. Two shaded areas are added to the plot to
represent the confidence intervals and prediction intervals. The confidence
interval is represented by a green shaded area around the regression line
that indicates the range within which the mean predicted values lie for a
given car weight. Similarly, the prediction interval is represented as a red
shaded area, which is the range in which individual predictions for new
data points are likely to fall. Finally, a legend is added to the top-right
corner of the plot to distinguish the regression line, confidence interval,
and prediction interval. The legend uses different colors and labels to
make the plot easy to interpret, with blue for the regression line, green
for the confidence interval, and red for the prediction interval.
Thus both the serve analysts, decision-makers, and researchers in trying
to quantify uncertainty in bettering predictions and drawing appropriate
conclusions from regression models.
Where y 0
is the intercept term that represents
the value of y when all independent variables are 0. We have multiple
independent variables represented by x1, x2, …., xn 1
,
2 n
For an MLR model to give valid results we need to fulfil a few assump-
tions that are stated below:
The relationship between the independent and dependent variables
must be linear.
126 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 5
Once the model has been fitted, the interpretation of the coefficients is
of great importance. In each case, the coefficient reflects the change in
the dependent variable for a one-unit change in the respective indepen-
dent variable, with all other variables held constant. For example, if the
coefficient for car weight is -0.1, then an increase in car weight of one
unit would reduce mpg by 0.1, assuming horsepower and cylinders are
held constant.
To access the performance of MLR various statistics like r-squared, f-sta-
tistic, p-value etc. can be used. Multiple Linear Regression is a powerful
technique for modelling and understanding the relationship between a
PAGE 127
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
The key is the fact that each coefficient reflects the effect of the corre-
sponding variable, but only after controlling for the influence of the other
1
represents the change in mileage with
respect to change in car weight, holding the other variables (horsepower
2
shows change in mileage for each
additional unit of horsepower, again holding weight and cylinders constant.
128 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 129
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes the assumptions of linear regression. All error variances should be ap-
proximately constant at all independent variable levels.
Heteroscedasticity affects the standard errors of the coefficients, biasing
test statistics. The model may be statistically significant when it shouldn’t
be, or vice versa. Residual plots help analysts detect heteroscedasticity.
If the residuals plot as a random cloud around zero, everything is fine.
If they fan out or form a pattern as the independent variable increases,
heteroscedasticity may be present. A potential method to address hetero-
scedasticity is to take a log transformation of the dependent variable, or
use weighted least squares regression in which more weight is placed
on observations with less variability. To detect heteroscedasticity in your
regression model, you can use residual plots, the code is shown in code
window 6.
Code Window 6
130 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
If you observe heteroscedasticity in the residual plot, one way to address Notes
it is by applying a log transformation to the dependent variable. The code
is shown in code window 7.
Code Window 7
Multicollinearity occurs when two or more independent variables in a
regression model are highly correlated with each other. In simple terms,
the independent variables are stepping on each other’s toes and giving
redundant information, thereby not making it possible to separate the
individual effect of each variable on the dependent variable. For example
if we attempt to forecast a person’s income, given his or her level of
education and years of work experience. Since there is a strong tendency
for people with higher levels of education to have more work experience,
the two variables can be highly correlated. This kind of multicollinearity
renders it difficult to distinguish the contribution of each variable inde-
pendently to the prediction of income. The regression coefficients can
become unstable and generate widely fluctuating estimates with very
minor changes in data thus resulting in an unreliable model.
Analysts commonly use the Variance Inflation Factor (VIF) as a measure
to detect multicollinearity for each independent variable. A high VIF,
usually above 10, means that the variable is highly correlated with at
least one of the other predictors in the model. In case of multicollinearity,
PAGE 131
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes the easy solution is to delete one of the correlated variables or combine
them into one more meaningful variable-for instance, by calculating an
index or an average. The code is shown in code window 8.
Code Window 8
Both heteroscedasticity and multicollinearity can reduce the effectiveness
of a regression model. Heteroscedasticity interferes with the estimation
of the standard errors of the coefficients, which may lead to misleading
significance tests, while multicollinearity prevents the assessment of the
effects of each predictor. Knowledge of these issues and their detection
and resolution is crucial in developing more reliable and interpretable
regression models.
IN-TEXT QUESTIONS
1. What type of regression model uses one independent variable?
2. What interval is used to estimate the range within which the
true value of a parameter lies?
3. What term describes the condition when the variance of errors
is not constant?
4. What is the term for the relationship between predictors and
the outcome in regression?
5. What type of regression issue arises from high correlations
between predictors?
132 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 133
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 9
6.9.2 Categorization
It refers to the process of assigning text into predefined categories or
labels based on the content. This method is applied in many applications,
including email filtering (spam vs. non-spam), document classification
(business, sports, tech), and sentiment analysis (positive, negative, neutral).
The idea is to map a given text document to one or more categories that
best represent the content. Techniques of categorization involve super-
vised learning models including Naive Bayes, Support Vector Machines
(SVM), and Logistic Regression. Such models require labeled training
data to learn how to classify new, unseen data. The model can predict
the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating
a Document-Term Matrix (DTM) and using classification models such as
Naive Bayes. The sample code is shown in code window 10.
Code Window 10
134 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 11
IN-TEXT QUESTIONS
6. What technique is used to classify text into predefined categories?
7. What statistical tool helps identify sentiment in text?
8. What analysis method breaks down text into meaningful elements
for extraction?
PAGE 135
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
6.10 Summary
This lesson introduces essential concepts in regression analysis, starting
from simple and multiple linear regression models. It shows the role of
confidence and prediction intervals in statistical prediction and teaches
how to interpret regression coefficients. It also addresses two common
challenges: heteroscedasticity and multicollinearity. The second half of
the lesson covers textual data analysis techniques such as text mining,
categorization, and sentiment analysis, outlining an overview of how such
techniques can be applied in the extraction of insights from unstructured
text data. The practical use of R applications ensures students are well-
equipped to not only carry out statistical analyses but also to perform
text analysis.
136 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
6.13 References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Introduction
to statistical learning with applications in R. Springer.
Fox, J. (2016). Applied regression analysis and generalized linear
models (3rd ed.). Sage Publications.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy
approach. O’Reilly Media.
PAGE 137
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi