0% found this document useful (0 votes)
8 views19 pages

BA - Unit 5

This lesson covers predictive and textual analytics, focusing on methods like simple and multiple linear regression, and the importance of confidence and prediction intervals in statistical modeling. It explains how predictive analytics uses historical data to forecast future outcomes, while textual analysis interprets unstructured text data for insights. The lesson also emphasizes the significance of understanding regression coefficients and the assumptions necessary for valid multiple linear regression results.

Uploaded by

allinone6813
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

BA - Unit 5

This lesson covers predictive and textual analytics, focusing on methods like simple and multiple linear regression, and the importance of confidence and prediction intervals in statistical modeling. It explains how predictive analytics uses historical data to forecast future outcomes, while textual analysis interprets unstructured text data for insights. The lesson also emphasizes the significance of understanding regression coefficients and the assumptions necessary for valid multiple linear regression results.

Uploaded by

allinone6813
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

L E S S O N

6
Predictive and Textual
Analytics
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 Simple Linear Regression Models
6.4
6.5 Multiple Linear Regression
6.6
6.7 Heteroscedasticity and Multi-Collinearity
6.8 Basics of Textual Data Analysis
6.9 Methods and Techniques of Textual Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings

6.1 Learning Objectives


After reading this lesson students will be able to:
Explain the differences between simple and multiple linear regression.
Describe the purpose of confidence and prediction intervals.
Evaluate the impact of heteroscedasticity and multicollinearity on regression models.
Define and analyze the performance of text mining techniques.
PAGE 119
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 119 10-Jan-25 3:52:12 PM


BUSINESS ANALYTICS

Notes
6.2 Introduction
In this lesson we will focus on two important analytics for business:
Predictive and textual, each having its own importance and techniques.
Predictive analytics often referred to as advanced analytics, is very closely
linked with business intelligence, giving organizations actionable insights
for better decision-making and planning. For instance, if an organization
wants to know how much profit it will make a few years later based on
current trends of sales, customer demographics, and regional performance
they can take benefit from predictive analytics. It uses techniques such
as data mining and artificial intelligence to predict outcomes like future
profit or other factors that may be critical to the success of the organi-
zation. At its core, predictive analytics is a process that makes informed
predictions about future events based on historical data. Analysts and data
scientists apply statistical models, regression techniques, and machine
learning algorithms to identify trends in historical data, so businesses
can predict risks and trends and prepare for future events.
Predictive analytics is changing industries because it can facilitate da-
ta-driven decision-making and efficiency in operations. In marketing, it
aids businesses in understanding customer behavior, leads segmentation,
and targeting of high-value prospects. Retailers apply predictive analytics
in order to personalize shopping experiences, optimize pricing strategies,
and merchandise plans. Manufacturing uses predictive analytics to en-
hance the monitoring of machine performance, avoid equipment failures,
and smooth logistics. Fraud detection, credit scoring, churn analysis,
and risk assessment are some of the benefits for the financial sector. Its
applications in healthcare include personalized care, resource allocation,
and identification of high-risk patients for timely interventions. As tools
get better, predictive analytics continues revolutionizing industries with
smarter insights and better outcomes.
Textual analysis is the systematic examination and interpretation of textual
data in order to draw meaningful insight, patterns, and trends. There are
different techniques and methods involved in the processing of unstructured
text, including various forms of documents, social media posts, or reviews
by customers. The intention of textual analysis is for the transformation
of raw text to become accessible and useful as input in decision-making

120 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 120 10-Jan-25 3:52:13 PM


PREDICTIVE AND TEXTUAL ANALYTICS

processes or for research. This can be identifying hot words, categorizing Notes
text topics, or sentiment analysis associated with a particular piece of
text. Textual analysis really is the backbone of a number of fields, such
as marketing, social science, business intelligence, to name only a few,
in garnering valuable insights from huge streams of unstructured text.

6.3 Simple Linear Regression Models


We have seen relationship between variables in lesson 5, simple linear
regression can be defined as a statistical learning method that is used
to examine or predict quantitative relationship between two continuous
variables: one independent variable called predictor (X) and other depen-
dent variable called response (Y). This method helps us model the linear
relationship between the variables and make their predictions assuming
that there is approximately a linear relationship between independent
variable X and dependent variable Y. Mathematically, we can write this
linear relationship as:

0 1
are model coefficients which are unknown constants

We can build a simple linear regress model in R using the following steps:
The very first step is to prepare data for this we need to ensure
that the data set is clean and contain no missing values for the
variable involved. We need to load the data into R using functions
like read.csv() or read.table().
Once data is uploaded we need to visualize it using scatter plot.
After that we can fit the regression model using lm() function and
then we can use summarize method to understand the details of
the model.
We can also make predictions using predict() function.
We have already studied the reading techniques in previous lessons. We
will focus on linear regression functions in this section.
The lm() function in R is a built-in function for fitting linear models,
both simple and multiple linear regression. It estimates the relationship

PAGE 121
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 121 10-Jan-25 3:52:13 PM


BUSINESS ANALYTICS

Notes between a dependent variable and one or more independent variables


using the method of least squares.
Let us consider the code given in code window 1 that applies linear
regression on cars dataset.

Code Window 1
In code window 1 we have shown the code for linear regression on
built-in dataset “mtcars” available in R. The goal here is to predict how

122 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 122 10-Jan-25 3:52:13 PM


PREDICTIVE AND TEXTUAL ANALYTICS

changes in horsepower ‘hp’ influence miles per gallon ‘mpg’. So, in this Notes
example we will examine information regarding “mpg” and “hp”. Since
the data is already prepared and formatted, we visualize the data using
scatter plot. This will help us analyse whether there is a pattern or trend
in the data that is worth exploring further.
Next, we try to fit a simple linear regression model by using the formula
“lm(mpg ~ hp, data = mtcars)”, this predicts mpg based on hp. After
fitting the model, we enhance our scatter plot by adding a regression line
using the abline() function to show the model’s predicted relationship. To
make the model useful, we use the predict() function to estimate mpg for
a specific horsepower value. This prediction tells us what fuel efficiency
we might expect for such a car. Finally, we examine the model’s accu-
racy by plotting its residuals—the differences between the observed and
predicted values. This step helps us check if there are any patterns in
the errors, which could indicate that the model is missing some critical
information.

6.4 Confidence and Prediction Intervals


In predictive analysis, confidence intervals and prediction intervals are
two critical tools that can be used to quantify the uncertainty surrounding
any statistical estimates or predictions. Both of them give a quantitative
indication of ranges in which the true values are expected to lie, yet
they have different usage. They play an important role in interpreting and
evaluating the regression model. They provide insight into the accuracy
of parameter estimates and the range within which individual predictions
are likely to fall.
In case of simple linear regression, confidence interval (CI) is used to
estimate the range within which the actual mean of the dependent variable
(y) lies for a given value of the independent variable (x). For example,
assume that you have a fitted regression model to predict mileage (mpg)
of a car given its weight(wt). A 95% confidence interval indicates that
the mean mileage for cars whose weight is 3.0 lies between 18.5 and
20.0 mpg. Code for this is given in code window 2.

PAGE 123
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 123 10-Jan-25 3:52:14 PM


BUSINESS ANALYTICS

Notes

Code Window 2
In the code above, a simple linear regression model is fitted to the mtcars
dataset using lm(mpg ~ wt), where mpg is the dependent variable, and
wt is the independent variable that is the weight of the car. The model
predicts how car weight influences fuel efficiency. Then we have used
the predict function to compute a confidence interval for the mean re-
sponse (mpg) for the car with weight = 3.0. To do this, specify interval
= “confidence” and level = 0.95. The function will compute the range
in which the average mileage for cars weighing 3.0 will fall, with 95%
confidence. This computation gives the lower and upper bound of the
mean mpg, thus making it possible to evaluate the precision and reliability
of the estimated mean value.
A prediction interval (PI) predicts the interval where the actual value
of the dependent variable (y) is likely to lie for a given x. Unlike CIs,
prediction intervals contain the residual error variability ). Thus, a pre-
diction interval gives a range for an individual data point rather than the
mean response. For example, let’s consider the same regression model
as above, a prediction interval for a car with weight 3.0 might indicate
that its mileage lies between 16.0 and 22.0 mpg, with 95% confidence.
R code for same is given in code window 3.

Code Window 3
124 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 124 10-Jan-25 3:52:14 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Hence, confidence intervals are narrower in range than prediction intervals Notes
since they only consider the uncertainty of the estimate of the mean of
response variable. Prediction intervals are larger because they encompass
the variability of individual observations about the regression line as well
as the uncertainty in the mean.
The visualization code and output of confidence and prediction interval
is shown in code window 4.

Code Window 4
In this code we have visualized car weight against mileage for all the
cars in the mtcars dataset. The plot function is used to make a scatter

PAGE 125
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 125 10-Jan-25 3:52:15 PM


BUSINESS ANALYTICS

Notes plot of weight vs. mileage, the abline function plots the fitted regres-
sion line in blue with a width of 2 representing the linear relationship
between the two variables. Two shaded areas are added to the plot to
represent the confidence intervals and prediction intervals. The confidence
interval is represented by a green shaded area around the regression line
that indicates the range within which the mean predicted values lie for a
given car weight. Similarly, the prediction interval is represented as a red
shaded area, which is the range in which individual predictions for new
data points are likely to fall. Finally, a legend is added to the top-right
corner of the plot to distinguish the regression line, confidence interval,
and prediction interval. The legend uses different colors and labels to
make the plot easy to interpret, with blue for the regression line, green
for the confidence interval, and red for the prediction interval.
Thus both the serve analysts, decision-makers, and researchers in trying
to quantify uncertainty in bettering predictions and drawing appropriate
conclusions from regression models.

6.5 Multiple Linear Regression


Multiple Linear Regression (MLR) is just an extension of simple linear
regression that models the relationship between two or more independent
variables and a dependent variable. In MLR, the dependent variable is
predicted using a linear combination of multiple independent variables.
This method can be very helpful when we want to understand the influ-
ence of several independent factors on a single outcome or target variable.
The mathematical equation for MLR is:

Where y 0
is the intercept term that represents
the value of y when all independent variables are 0. We have multiple
independent variables represented by x1, x2, …., xn 1
,
2 n

For an MLR model to give valid results we need to fulfil a few assump-
tions that are stated below:
The relationship between the independent and dependent variables
must be linear.

126 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 126 10-Jan-25 3:52:15 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Observations must be independent. Notes


Variance of the error must be constant across the ranges of independent
variables i.e. homoscedasticity.
Residuals or error terms must be normally distributed.
Independent variables should not have strong mutual correlations
means no multicollinearity.
The r code is given in code window 5

Code Window 5
Once the model has been fitted, the interpretation of the coefficients is
of great importance. In each case, the coefficient reflects the change in
the dependent variable for a one-unit change in the respective indepen-
dent variable, with all other variables held constant. For example, if the
coefficient for car weight is -0.1, then an increase in car weight of one
unit would reduce mpg by 0.1, assuming horsepower and cylinders are
held constant.
To access the performance of MLR various statistics like r-squared, f-sta-
tistic, p-value etc. can be used. Multiple Linear Regression is a powerful
technique for modelling and understanding the relationship between a

PAGE 127
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 127 10-Jan-25 3:52:15 PM


BUSINESS ANALYTICS

Notes dependent variable and multiple independent variables. By fitting a model


and interpreting the coefficients, we can make predictions and assess the
effect of each predictor. Proper model evaluation, diagnostics, and check-
ing assumptions are necessary to ensure the reliability and validity of
the results. Multiple linear regression is widely applied in various fields,
such as economics, finance, marketing, and engineering. It applies where
complex relations between variables need to be understood and predicted.

6.6 Interpretation of Regression Coefficients


When working with a regression model, coefficients hold very important
meaning. They determine how each independent variable would cause the
dependent variable to move in simple terms, they describe the relationship
of the predictor variables to the outcome we are trying to predict. Let
us understand this concept considering example of predicting car mile-
age using car weight, consider the equation of linear regression given
in section 6.3.
0
would tell you what was expected mileage at zero pounds for the car,
in a real sense, the weight of the car is usually not zero pounds, so it’s
not a sense-making equation, but it does describe the starting point of
1
represents the slope of the regression line i.e. the change
in y corresponding to every one-unit change in x. This represents how
much the mileage is expected to increase or decrease when we variate
car weight i.e. for every additional unit of car weight. This relationship
is represented with a positive or negative value of coefficient.
The multiple linear regression interpretation is more complex, but the
same idea is still there. Suppose we have a model with more than one
predictor, such as predicting car mileage using weight, horsepower, and
the number of cylinders. The model might look something like this:

The key is the fact that each coefficient reflects the effect of the corre-
sponding variable, but only after controlling for the influence of the other
1
represents the change in mileage with
respect to change in car weight, holding the other variables (horsepower
2
shows change in mileage for each
additional unit of horsepower, again holding weight and cylinders constant.

128 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 128 10-Jan-25 3:52:15 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Understanding these coefficients is important because it helps us inter- Notes


pret the model in the context of the real-world problem we’re trying to
solve. Are the results meaningful? Do they make sense in the context of
what we know about the subject matter? For example, in the car mileage
example, it would be reasonable to expect that heavier cars generally
have lower mileage, and so a negative coefficient for weight might align
with our expectations. This is also a critical point because coefficients
alone don’t tell everything. They have to be seen in the context of other
statistical tests, like p-values and confidence intervals, to determine their
reliability. A coefficient with a very high p-value may indicate that the
corresponding predictor may not be as important as we assumed, and its
influence on the dependent variable may not be statistically significant.
Ultimately, the interpretation of regression coefficients is about under-
standing how the variables in your model relate to each other, and how
changes in one predictor may influence the outcome. From there, you
can make more informed decisions, predictions, and conclusions through
careful examination of these relationships.

6.7 Heteroscedasticity and Multi-Collinearity


As we have explained in previous section when building regression mod-
els, apart from the relationship between the variables, it is important to
guarantee that certain assumptions of the model are met. For instance,
two of these assumptions are homoscedasticity (which will be contrasted
with heteroscedasticity), and no multicollinearity. These are all important
because violations of assumptions can impact the reliability of your model
and lead one to incorrect conclusions.
Heteroscedasticity is defined as the case when the variation of errors
(i.e. the differences between observed and predicted values) varies with
the levels of the independent variables. In simple words, it means that
whenever the value of the independent variable is changed, the spread
or dispersion of the residuals also varies.
Assume an example of house price prediction by square feet. Hetero-
scedasticity occurs if error variance for the larger house predictions is
greater than the smaller houses. The errors-or, rather, residuals-for large
houses are spread further out than they are for smaller houses, violating

PAGE 129
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 129 10-Jan-25 3:52:16 PM


BUSINESS ANALYTICS

Notes the assumptions of linear regression. All error variances should be ap-
proximately constant at all independent variable levels.
Heteroscedasticity affects the standard errors of the coefficients, biasing
test statistics. The model may be statistically significant when it shouldn’t
be, or vice versa. Residual plots help analysts detect heteroscedasticity.
If the residuals plot as a random cloud around zero, everything is fine.
If they fan out or form a pattern as the independent variable increases,
heteroscedasticity may be present. A potential method to address hetero-
scedasticity is to take a log transformation of the dependent variable, or
use weighted least squares regression in which more weight is placed
on observations with less variability. To detect heteroscedasticity in your
regression model, you can use residual plots, the code is shown in code
window 6.

Code Window 6

130 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 130 10-Jan-25 3:52:16 PM


PREDICTIVE AND TEXTUAL ANALYTICS

If you observe heteroscedasticity in the residual plot, one way to address Notes
it is by applying a log transformation to the dependent variable. The code
is shown in code window 7.

Code Window 7
Multicollinearity occurs when two or more independent variables in a
regression model are highly correlated with each other. In simple terms,
the independent variables are stepping on each other’s toes and giving
redundant information, thereby not making it possible to separate the
individual effect of each variable on the dependent variable. For example
if we attempt to forecast a person’s income, given his or her level of
education and years of work experience. Since there is a strong tendency
for people with higher levels of education to have more work experience,
the two variables can be highly correlated. This kind of multicollinearity
renders it difficult to distinguish the contribution of each variable inde-
pendently to the prediction of income. The regression coefficients can
become unstable and generate widely fluctuating estimates with very
minor changes in data thus resulting in an unreliable model.
Analysts commonly use the Variance Inflation Factor (VIF) as a measure
to detect multicollinearity for each independent variable. A high VIF,
usually above 10, means that the variable is highly correlated with at
least one of the other predictors in the model. In case of multicollinearity,

PAGE 131
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 131 10-Jan-25 3:52:17 PM


BUSINESS ANALYTICS

Notes the easy solution is to delete one of the correlated variables or combine
them into one more meaningful variable-for instance, by calculating an
index or an average. The code is shown in code window 8.

Code Window 8
Both heteroscedasticity and multicollinearity can reduce the effectiveness
of a regression model. Heteroscedasticity interferes with the estimation
of the standard errors of the coefficients, which may lead to misleading
significance tests, while multicollinearity prevents the assessment of the
effects of each predictor. Knowledge of these issues and their detection
and resolution is crucial in developing more reliable and interpretable
regression models.
IN-TEXT QUESTIONS
1. What type of regression model uses one independent variable?
2. What interval is used to estimate the range within which the
true value of a parameter lies?
3. What term describes the condition when the variance of errors
is not constant?
4. What is the term for the relationship between predictors and
the outcome in regression?
5. What type of regression issue arises from high correlations
between predictors?

6.8 Basics of Textual Data Analysis


Textual analysis includes information extraction from text, such as emails,
reviews, or posts on social media. It is commonly applied in tasks such

132 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 132 10-Jan-25 3:52:17 PM


PREDICTIVE AND TEXTUAL ANALYTICS

as sentiment analysis, keyword extraction, and text classification. Basic Notes


steps of text analysis are:
Text Preprocessing i.e. cleaning and preparing the text for the
analysis process, for example conversion to lowercase, remove stop
words such as “the” and “is”, punctuation, and special characters.
Tokenization means dividing a text into smaller tokens or units
such as words, phrases, etc.
Text Representation implies converting text in a form that is ready
to be analysed, for instance as a frequency count of words, bag-
of-words representation.

6.9 Methods and Techniques of Textual Analysis


There are three methods of textual analysis: text mining, categorization,
and sentiment analysis. Text mining gives the fundamental tools for clean-
ing and extracting features from raw text; categorization helps classify
text into predefined categories; and sentiment analysis gives insights into
the emotional tone of the text. Together, these methods unlock valuable
insights from large volumes of unstructured text data, making it more
useful for business, research, and decision-making purposes.

6.9.1 Text Mining


The process of mining useful information and knowledge from unstructured
text is called text mining. It involves transforming text data into a struc-
tured format that can further be analyzed. Text mining allows discovery of
patterns, relationships, and trends within large collections of text: books,
articles, reviews, social media posts, and many more. Text mining com-
monly includes text preprocessing, tokenization, word frequency analysis,
and also more complex techniques such as topic modeling and clustering.
In text mining, the main objective is to draw insights and gain deeper
insight into the content. For example, we could apply text mining to the
analysis of customer reviews in order to understand common complaints,
frequently mentioned features, or overall satisfaction. R has an extensive
list of libraries such as tm, tidytext, and textTinyR that make the process
of text mining easier with functions for cleaning text, tokenization, and
much more. The code window 9 shows sample code for text mining.

PAGE 133
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 133 10-Jan-25 3:52:17 PM


BUSINESS ANALYTICS

Notes

Code Window 9

6.9.2 Categorization
It refers to the process of assigning text into predefined categories or
labels based on the content. This method is applied in many applications,
including email filtering (spam vs. non-spam), document classification
(business, sports, tech), and sentiment analysis (positive, negative, neutral).
The idea is to map a given text document to one or more categories that
best represent the content. Techniques of categorization involve super-
vised learning models including Naive Bayes, Support Vector Machines
(SVM), and Logistic Regression. Such models require labeled training
data to learn how to classify new, unseen data. The model can predict
the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating
a Document-Term Matrix (DTM) and using classification models such as
Naive Bayes. The sample code is shown in code window 10.

Code Window 10

134 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 134 10-Jan-25 3:52:18 PM


PREDICTIVE AND TEXTUAL ANALYTICS

6.9.3 Sentiment Analysis Notes


Sentiment analysis is the process of determining the emotional tone or
sentiment behind a piece of text. The aim is to classify text as expressing
a positive, negative, or neutral sentiment. This technique is widely used
for analyzing customer feedback, product reviews, social media posts,
and other forms of text to gauge public opinion or sentiment about a
particular topic.
There are two main approaches in sentiment analysis:
Lexicon-based Methods: These use pre-defined dictionaries of words
with positive, negative, or neutral sentiments. Words within the text are
matched up against the dictionary, and their sentiment is determined based
on the frequency or intensity of positive or negative words.
Machine Learning-based Approaches: This involves training the model
on labelled text data wherein the sentiment is known in advance and then
applies that model to classify new text. Techniques like Naive Bayes,
Support Vector Machines, and deep learning can be applied here.
R provides libraries like syuzhet, tidytext, and sentimentr to perform
sentiment analysis. The sample code is given in code window 11.

Code Window 11

IN-TEXT QUESTIONS
6. What technique is used to classify text into predefined categories?
7. What statistical tool helps identify sentiment in text?
8. What analysis method breaks down text into meaningful elements
for extraction?

PAGE 135
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 135 10-Jan-25 3:52:18 PM


BUSINESS ANALYTICS

Notes
6.10 Summary
This lesson introduces essential concepts in regression analysis, starting
from simple and multiple linear regression models. It shows the role of
confidence and prediction intervals in statistical prediction and teaches
how to interpret regression coefficients. It also addresses two common
challenges: heteroscedasticity and multicollinearity. The second half of
the lesson covers textual data analysis techniques such as text mining,
categorization, and sentiment analysis, outlining an overview of how such
techniques can be applied in the extraction of insights from unstructured
text data. The practical use of R applications ensures students are well-
equipped to not only carry out statistical analyses but also to perform
text analysis.

6.11 Answers to In-Text Questions


1. Simple
2. Confidence
3. Heteroscedasticity
4. Coefficients
5. Multicollinearity
6. Categorization
7. Sentiment
8. Text Mining

6.12 Self-Assessment Questions


1. Explain the difference between a confidence interval and a prediction
interval in regression analysis.
2. Describe how multicollinearity can affect the results of a multiple
linear regression model.
3. How does sentiment analysis help in analyzing customer feedback?
4. Explain heteroscedasticity in a regression model.

136 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 136 10-Jan-25 3:52:18 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Notes
6.13 References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Introduction
to statistical learning with applications in R. Springer.
Fox, J. (2016). Applied regression analysis and generalized linear
models (3rd ed.). Sage Publications.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy
approach. O’Reilly Media.

6.14 Suggested Readings


Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
statistical learning: Data mining, inference, and prediction (2nd
ed.). Springer.
Wickham, H., & Grolemund, G. (2017). R for data science: Import,
tidy, transform, visualize, and model data. O’Reilly Media.
Provost, F., & Fawcett, T. (2013). Data science for business: What
you need to know about data mining and data-analytic thinking.
O’Reilly Media.

PAGE 137
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 137 10-Jan-25 3:52:19 PM

You might also like