Ba Notes Short
Ba Notes Short
Definition:
Data Science is the study of data using tools, algorithms, and machine learning techniques to
extract knowledge and insights from both structured and unstructured data.
Importance:
Example:
Banks use data science to detect fraud by analyzing spending patterns and alerting unusual
activities.
3. What are the main classifications of analytics? Provide examples for each.
1. Descriptive Analytics
o Tells what happened.
o Example: Monthly sales reports.
2. Diagnostic Analytics
o Tells why it happened.
o Example: Drop in sales due to low customer satisfaction.
3. Predictive Analytics
o Forecasts what is likely to happen.
o Example: Predicting next month’s demand using previous data.
4. Prescriptive Analytics
o Suggests actions to take.
o Example: Recommending optimal inventory levels based on forecasts.
Example:
Amazon uses data analytics to recommend products and manage logistics efficiently.
Example:
Google processes more than 3.5 billion searches per day – a classic example of big data.
Would you like me to continue with the next five questions (6–10)?
4o
You said:
yes
ChatGPT said:
Great! Here are the answers for Questions 6–10, written clearly and simply for 15-mark
answers:
2. Retail:
4. Manufacturing:
2. Data Integration:
• Data from various sources (social media, sensors, etc.) is hard to merge.
• Need for experts in data science, statistics, and tools like R, Python.
6. Infrastructure Costs:
Definition:
Predictive modeling uses historical data to make future predictions using statistical and
machine learning techniques.
1. Identifies Patterns:
o Finds trends in past data to build predictive models.
2. Improves Accuracy:
o Uses algorithms (like regression, decision trees) to predict future events.
3. Optimizes Decisions:
o Helps in decision-making by forecasting demand, sales, or risks.
4. Applications:
o Marketing: Predicting customer churn.
o Finance: Forecasting stock prices.
o Retail: Demand forecasting.
o Healthcare: Predicting disease outbreaks.
1. Personalized Recommendations:
2. Real-Time Support:
4. Targeted Marketing:
Example:
Spotify uses Big Data to create personalized playlists for users.
1. Data Privacy:
• Organizations must protect personal data and comply with laws (e.g., GDPR).
2. Consent:
3. Bias in Algorithms:
4. Transparency:
5. Data Security:
6. Responsible Use:
• Data should be used for genuine and beneficial purposes, not manipulation.
Example:
Facebook faced criticism for misuse of user data during political campaigns.
Would you like me to continue with Questions 11–15 (Data Preparation section)?
You said:
yes
ChatGPT said:
Here are the answers to Questions 11–15 from Unit 2: Data Preparation, Summarization,
and Visualization using Spreadsheets — explained simply and suitable for 15 marks:
11. What is data preparation, and why is it important?
Definition:
Data preparation is the process of collecting, cleaning, organizing, and transforming raw data
into a structured format suitable for analysis.
Importance:
Example:
Removing empty rows and correcting spelling errors in customer feedback before analysis.
1. Remove Duplicates:
o Delete repeated rows or entries.
2. Handle Missing Values:
o Fill them with averages, medians, or remove them based on the case.
3. Correct Inconsistencies:
o Fix issues like "USA" and "United States" being treated as separate.
4. Standardize Formats:
o Dates, numbers, text should follow a consistent format.
5. Check for Outliers:
o Unusually high or low values should be investigated.
6. Spell Checks & Text Corrections:
o Especially important in survey or textual data.
7. Filter Irrelevant Data:
o Remove columns or rows that don’t add value to the analysis.
Sorting:
• Arranges data in ascending or descending order.
• Helps find top or bottom performers easily.
• Example: Sorting sales from highest to lowest.
Filtering:
Benefits:
Definition:
Conditional formatting highlights cells based on certain rules or values using colors, icons, or
data bars.
Uses:
Examples:
Definition:
"Text to Column" is an Excel feature that splits text in one column into multiple columns
based on a separator like comma, space, or tab.
Steps:
You said:
yes
ChatGPT said:
Here are the answers for Questions 16–20 from Unit 2 (continued) — explained clearly for
15-mark answers:
Tips:
Example:
If a customer survey has repeated entries from the same person, removing duplicates ensures
fair analysis.
17. Define data validation and its purpose.
Definition:
Data validation is a feature in spreadsheets used to control what type of data can be entered in
a cell.
Purpose:
1. Prevents Data Entry Errors – e.g., restricts a cell to only numbers between 1 and
100.
2. Maintains Data Consistency – ensures formats and values are uniform.
3. Guides Users – with drop-down lists or input messages.
4. Reduces Cleaning Work – correct data is entered from the beginning.
Examples:
Definition:
An outlier is a data point that is significantly different from other values in a dataset.
• They can skew results, affect averages, or indicate data entry errors or unique cases.
How to Identify:
1. Visual Methods:
o Box Plot: Values outside whiskers are outliers.
o Scatter Plot: Isolated points far from the cluster.
2. Statistical Methods:
o Z-Score: If >3 or <-3, it’s likely an outlier.
o IQR (Interquartile Range):
Outliers = Values below Q1 – 1.5×IQR or above Q3 + 1.5×IQR.
3. Using Excel:
o Use formulas or create box plots with chart tools.
Example:
A salary of ₹10 lakh in a dataset of salaries around ₹30,000–₹50,000 is an outlier.
Definition:
Covariance measures how two variables change together.
Importance:
1. Relationship Detection:
o Shows if variables move in the same or opposite direction.
2. Basis for Correlation & Regression:
o Covariance is used to calculate correlation.
3. Decision-Making:
o In finance, covariance helps understand how stock returns move together.
4. Predictive Modeling:
o Helps in selecting variables for multivariate analysis.
Example:
If advertising spend and sales have positive covariance, more ads likely lead to more sales.
Definition:
A correlation matrix is a table showing correlation coefficients between many variables.
Key Features:
Role in Analysis:
Example:
In a sales dataset, a correlation matrix can show how sales relate to advertising, discounts, or
social media engagement.
You said:
yes
ChatGPT said:
Here are the answers for Questions 21–25 from Unit 2 (continued) — written simply and
clearly for 15-mark answers:
21. How can missing values be found and handled in data preparation?
1. Manual Check:
o Look for blank cells in Excel.
2. Use Formulas:
o =ISBLANK(A2) returns TRUE for empty cells.
3. Filters:
o Use filters to show only empty rows.
4. In R or Python:
o Use is.na() in R or isnull() in Python to detect missing values.
1. Deletion:
o Remove rows or columns with many missing values.
2. Imputation:
o Replace missing values with:
▪ Mean, median, or mode.
▪ Previous/next values.
3. Prediction:
o Use machine learning to guess missing values.
4. Flagging:
o Create a new column indicating missing data.
Importance:
• Unhandled missing data can lead to inaccurate analysis or errors in calculations.
Definition:
Data summarisation is the process of condensing large datasets into meaningful information
using statistical summaries.
1. Central Tendency:
o Mean, Median, Mode.
2. Dispersion:
o Range, Variance, Standard Deviation.
3. Frequencies:
o Count how often each value appears.
4. Group-wise Summary:
o Average sales by product category.
Benefits:
Example:
In a list of sales from 100 stores:
23. Compare scatter plots, line charts, and histograms. When should each be
used?
Chart
Purpose Best Used For
Type
Scatter Show relationship between two
Finding patterns, trends, or outliers
Plot variables
Line Chart Display data over time Time series data like sales, stock prices
Understanding how data is spread (e.g.,
Histogram Show frequency distribution
marks)
Examples:
Definition:
A pivot table is a tool in Excel used to summarize, group, and analyze large datasets easily.
Features:
• Drag-and-drop interface.
• Summarizes data using count, average, sum, etc.
• Can display row-wise and column-wise views.
Applications:
1. Sales Summary:
o Total sales by region, product, or month.
2. Employee Data:
o Number of employees in each department.
3. Customer Data:
o Count of orders by customer type.
Benefits:
25. What are pivot charts, and how do they differ from interactive
dashboards?
Pivot Charts:
Interactive Dashboards:
Differences:
Feature Pivot Chart Interactive Dashboard
Based on Pivot
Yes May use multiple data sources
Table
Limited (linked to one High (filters, drill-downs, dynamic
Interactivity
table) views)
Tools Used Excel Excel (advanced), Power BI, Tableau
Use Case Simple summaries Full visual reports with controls
You said:
yes
ChatGPT said:
Here are the answers for Questions 26–30 from Unit 2 (last) and Unit 3 (starting) —
simple and well-explained for 15-mark answers:
Definition:
Data visualization is the process of representing data through charts, graphs, maps, and
dashboards to make information easier to understand.
Importance:
Examples of Visualizations:
27. What are the main advantages of using R for data analysis?
6. Community Support:
Example:
A company can use R to build a regression model to predict next quarter’s sales.
Step 1: Download R
• Open R or RStudio.
• Use command:
R
CopyEdit
install.packages("package_name")
• Use:
R
CopyEdit
library(package_name)
Example:
R
CopyEdit
install.packages("ggplot2")
library(ggplot2)
Note: Packages need internet access during installation but not every time they are used.
29. Explain the process of importing data from spreadsheet files into R.
CSV Example:
R
CopyEdit
data <- read.csv("C:/Users/Name/Documents/data.csv")
Excel Example:
R
CopyEdit
install.packages("readxl")
library(readxl)
data <- read_excel("C:/Users/Name/Documents/data.xlsx")
Things to Remember:
• Set correct file path.
• Use setwd() to set working directory, if needed.
• Check first few rows with head(data)
Creating Variables:
Examples:
R
CopyEdit
x <- 10 # numeric
name <- "Rahul" # character
is_active <- TRUE # logical
marks <- c(90, 85, 70) # vector
Tip:
Use str() to check the structure of any variable.
You said:
yes
ChatGPT said:
Here are the answers for Questions 31–35 from Unit 3: Getting Started with R —
explained clearly for 15-mark answers:
Example in Code:
R
CopyEdit
install.packages("dplyr") # installs the package
library(dplyr) # loads it into memory
Summary:
a) Vectors
R
CopyEdit
v <- c(1, 2, 3)
b) Matrices
• 2D structure with rows and columns, all elements are of same type.
R
CopyEdit
m <- matrix(1:6, nrow=2)
c) Arrays
R
CopyEdit
a <- array(1:8, dim = c(2,2,2))
d) Lists
e) Factors
R
CopyEdit
f <- factor(c("Male", "Female", "Male"))
f) Data Frames
R
CopyEdit
df <- data.frame(Name=c("A", "B"), Score=c(80, 90))
1. To CSV File:
R
CopyEdit
write.csv(dataframe_name, "filename.csv", row.names = FALSE)
R
CopyEdit
install.packages("writexl")
library(writexl)
write_xlsx(dataframe_name, "filename.xlsx")
3. To Text File:
R
CopyEdit
write.table(dataframe_name, "file.txt", sep="\t")
Example:
R
CopyEdit
students <- data.frame(Name=c("Aman", "Ravi"), Marks=c(90, 85))
write.csv(students, "students.csv", row.names = FALSE)
34. Why is R considered suitable for statistical computing and graphics?
Reasons:
Example:
Creating a histogram, fitting a regression model, and exporting the result in one script.
35. Explain the difference between scalar and vector data structures in R.
Key Point:
In R, even a single number (like x <- 10) is treated as a vector of length 1, not a separate
scalar type as in other languages.
You said:
yes
ChatGPT said:
Here are the answers for Questions 36–40 from Unit 3: Getting Started with R — written
clearly and simply for 15-mark answers:
Definition:
A data frame in R is a table-like structure where each column can have different data
types (numeric, character, logical, etc.), and each row represents an observation.
Importance:
1. Real-World Representation:
o Mirrors how data is stored in Excel or databases.
2. Flexible:
o Columns can hold different types (names, scores, dates).
3. Easy to Use:
o Works well with R functions for filtering, summarizing, and plotting.
4. Used in Most Packages:
o Data frames are the default format for modeling and visualization in R.
5. Supports Tidy Data:
o Ideal for organizing and cleaning datasets.
Example:
R
CopyEdit
students <- data.frame(Name = c("Aman", "Riya"), Age = c(21, 22), Score =
c(85, 90))
Usage:
37. What are factors in R, and how are they used for categorical data
representation?
Definition:
Factors are variables in R used to store categorical data such as labels or groupings.
Uses:
1. Categorical Representation:
o Converts strings like "Male", "Female" into factor levels.
2. Memory Efficient:
oStores categories as integers with labels — saves space.
3. Useful in Modeling:
o Many models in R require factors for categorical input.
How to Create:
R
CopyEdit
gender <- factor(c("Male", "Female", "Female", "Male"))
Ordered Factors:
R
CopyEdit
satisfaction <- factor(c("Low", "High", "Medium"), levels=c("Low",
"Medium", "High"), ordered=TRUE)
Matrix:
R
CopyEdit
m <- matrix(1:6, nrow=2, ncol=3)
Array:
R
CopyEdit
a <- array(1:8, dim = c(2, 2, 2)) # 3D array
Key Differences:
R
CopyEdit
df1 <- data.frame(Name = c("Aman"), Score = c(85))
df2 <- data.frame(Name = c("Riya"), Score = c(90))
combined <- rbind(df1, df2)
R
CopyEdit
df1 <- data.frame(Name = c("Aman", "Riya"))
df2 <- data.frame(Score = c(85, 90))
combined <- cbind(df1, df2)
R
CopyEdit
df1 <- data.frame(ID = c(1, 2), Name = c("A", "B"))
df2 <- data.frame(ID = c(1, 2), Marks = c(80, 90))
merged <- merge(df1, df2, by = "ID")
Methods to Handle:
R
CopyEdit
clean_data <- na.omit(data) # removes rows with NA
R
CopyEdit
data$Score[is.na(data$Score)] <- mean(data$Score, na.rm = TRUE)
• Models and analysis functions often throw errors or give wrong results when NA
values are present.
You said:
yes
ChatGPT said:
Here are the answers for Questions 41–45 from Unit 3: Getting Started with R —
explained simply and clearly for 15-mark answers:
1. ggplot2
o For creating beautiful and customizable plots and graphs.
o Example: Bar charts, scatter plots.
2. dplyr
o For data manipulation: filtering, selecting, summarizing.
o Example: filter(), mutate(), group_by() functions.
3. tidyr
o For data cleaning and reshaping.
o Example: spread(), gather().
4. readr
o For reading CSV and text files efficiently.
o Example: read_csv().
5. readxl / openxlsx
o For reading and writing Excel files.
o Example: read_excel().
6. caret
o For machine learning and model training.
7. lubridate
o For handling dates and times easily.
8. stringr
o For working with strings and text data.
9. shiny
o For building interactive web apps using R.
10. data.table
o For handling large datasets faster than base R.
Significance:
Proper understanding of R syntax is essential to write clean, efficient, and error-free code.
43. What are the advantages of using R over other programming languages
for statistical analysis?
Example:
R is preferred in academics and research institutions for its accuracy and built-in statistical
models.
Definition:
A list in R is a collection of different types of objects — vectors, data frames, functions, etc.
Creating a List:
R
CopyEdit
my_list <- list(name = "Riya", marks = c(90, 85, 80), pass = TRUE)
Manipulations:
1. Access Elements:
o my_list$name or my_list[["name"]]
2. Add Elements:
o my_list$age <- 21
3. Remove Elements:
o my_list$pass <- NULL
4. Update Elements:
o my_list$marks <- c(95, 90, 88)
5. Loop Through List:
R
CopyEdit
for (item in my_list) {
print(item)
}
6. Combine Lists:
o c(list1, list2)
Use Case:
Storing mixed information about students — name (text), scores (numeric), and pass status
(logical).
R
CopyEdit
install.packages("ggplot2")
R
CopyEdit
library(ggplot2)
To Update a Package:
R
CopyEdit
update.packages()
R
CopyEdit
install.packages("devtools")
library(devtools)
install_github("author/package")
Tips:
You said:
yes
ChatGPT said:
Here are the answers for Questions 46–50 from Unit 3 ending and Unit 4 beginning —
written simply for 15-mark answers:
Definition:
Exporting data in R means saving the results (data frames, summaries, etc.) to external files
like .csv, .txt, or .xlsx.
Significance:
1. Sharing Results:
o Easily share outputs with others who use Excel or other tools.
2. Record-Keeping:
o Save analysis results for future reference or reporting.
3. Interoperability:
o Use R-generated data in other software (Power BI, Tableau, Python).
4. Automation:
o Automatically export updated reports regularly.
Export Methods:
• To CSV:
R
CopyEdit
write.csv(dataframe, "filename.csv", row.names = FALSE)
R
CopyEdit
install.packages("writexl")
library(writexl)
write_xlsx(dataframe, "output.xlsx")
Example:
After analyzing sales data in R, a company exports the result to a .csv file and sends it to the
finance team for review.
Histogram Example:
R
CopyEdit
hist(data$Marks)
R
CopyEdit
barplot(table(data$Department))
Definition:
A box plot (box-and-whisker plot) is a graph that displays the distribution of a dataset using
five summary statistics.
Key Components:
Insights Provided:
Example:
R
CopyEdit
boxplot(data$Income)
49. Explain the use of line graphs and scatter plots in visualizing data.
Line Graphs:
R
CopyEdit
plot(data$Month, data$Sales, type = "l")
Scatter Plots:
R
CopyEdit
plot(data$AdSpend, data$Sales)
Comparison:
Definition:
Measures of central tendency describe the center or average of a dataset. They give a single
value that summarizes the entire dataset.
Main Types:
1. Mean (Average):
o Sum of values ÷ total number.
o Example: Average marks of students.
2. Median:
o Middle value when data is sorted.
o Better for skewed data (e.g., income).
3. Mode:
o Most frequent value.
o Useful for categorical data like product color.
Applications:
Example in R:
R
CopyEdit
mean(data$Salary)
median(data$Salary)
You said:
yes
ChatGPT said:
Here are the answers for Questions 51–55 from Unit 4: Descriptive Statistics Using R —
simplified and clear for 15-mark answers:
Definition:
Measures of dispersion show how spread out or varied the data values are around the central
value (mean or median).
Common Measures:
1. Range:
o Difference between maximum and minimum values.
o Simple but affected by outliers.
2. Variance:
o Average squared deviation from the mean.
o Shows overall variability.
3. Standard Deviation (SD):
o Square root of variance.
o Expressed in same units as data, easy to interpret.
4. Interquartile Range (IQR):
o Difference between 3rd quartile (Q3) and 1st quartile (Q1).
o Measures middle 50% spread, less affected by outliers.
Significance:
Example:
In exam scores, a low SD means most students scored close to the average.
Definition:
Covariance measures how two variables change together.
Role:
Limitations:
Magnitude of covariance depends on variable units, so it’s hard to interpret alone.
Correlation:
• Measures the strength and direction of the linear relationship between two variables.
• Values range from -1 to +1:
o +1 means perfect positive correlation.
o 0 means no correlation.
o -1 means perfect negative correlation.
Example:
Height and weight have a positive correlation indicating taller people tend to weigh more.
Definition:
The coefficient of determination (denoted as R²) measures how well a regression model
explains the variability of the dependent variable.
• R² ranges from 0 to 1.
• Closer to 1 means the model explains most of the variation.
Example:
If a model predicts sales with R² = 0.85, it means 85% of sales variation is explained by the
model.
Common Visualizations:
1. Histograms:
o Show distribution of a single variable.
o hist(data$variable)
2. Box Plots:
o Visualize spread, median, and outliers.
o boxplot(data$variable)
3. Bar Charts:
o Show frequencies for categories.
o barplot(table(data$category))
4. Scatter Plots:
o Show relationships between two variables.
o plot(data$x, data$y)
5. Summary Tables:
o Use summary(data) for central tendency and dispersion.
Using Packages:
• ggplot2:
For advanced, customizable plots.
Example:
R
CopyEdit
ggplot(data, aes(x=variable)) + geom_histogram()
You said:
yes
ChatGPT said:
Here are the answers for Questions 56–60 from Unit 4: Descriptive Statistics Using R —
simple and clear for 15-mark answers:
56. Compare the use of mean, median, and mode as measures of central
tendency.
Example:
57. What is the interquartile range, and how is it represented visually using
box plots?
IQR = Q3 – Q1
Visual Representation:
Purpose:
IQR helps identify spread and outliers.
Standard Deviation (SD): Measures the average distance of data points from the mean.
Calculate in R:
R
CopyEdit
sd(data$variable)
Steps Internally:
Interpretation:
Note:
Magnitude depends on units, so use correlation for standardized interpretation.
60. Discuss the steps to create a scatter plot in R for visualizing relationships
between variables.
Steps:
1. Prepare Data:
Have two numeric variables.
2. Basic Scatter Plot:
Use plot() function.
R
CopyEdit
plot(data$x, data$y, main="Scatter Plot", xlab="X Variable", ylab="Y
Variable")
R
CopyEdit
abline(lm(y ~ x, data=data), col="red")
4. Using ggplot2:
R
CopyEdit
library(ggplot2)
ggplot(data, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm")
Purpose:
Scatter plots help identify trends, clusters, or outliers between two variables.
You said:
yes
ChatGPT said:
Here are the answers for Questions 61–65 from Unit 4: Descriptive Statistics Using R —
simplified and clear for 15-mark answers:
61. Describe the significance of correlation coefficient in understanding
variable dependencies.
Definition:
The correlation coefficient (r) measures the strength and direction of a linear relationship
between two variables, ranging from -1 to +1.
Significance:
Why Important:
• The coefficient of determination, R², shows how well a model explains the variance in
the dependent variable.
• An R² close to 1 means the model fits data well.
• Used to assess model accuracy and compare models.
R
CopyEdit
barplot(height = table(data$category))
Customization:
• Add colors:
R
CopyEdit
barplot(table(data$category), col = c("red", "blue", "green"))
• Add labels:
R
CopyEdit
barplot(table(data$category), main = "Category Counts", xlab =
"Categories", ylab = "Frequency")
• Horizontal bars:
R
CopyEdit
barplot(table(data$category), horiz = TRUE)
64. What are the steps involved in creating histograms in R for frequency
distribution analysis?
R
CopyEdit
hist(data$variable, breaks = 10, col = "blue", main = "Histogram",
xlab = "Values")
65. Discuss the advantages of using R for visualizing descriptive statistics over
traditional methods.
Would you like me to continue with Questions 66–70 from Unit 5 next?
4.1-mini
You said:
yes
ChatGPT said:
Here are the answers for Questions 66–70 from Unit 5: Predictive and Textual Analytics
— simple and clear for 15-mark answers:
66. What is a simple linear regression model? Illustrate with an example.
Definition:
A simple linear regression model predicts the value of one variable (dependent) based on
another variable (independent) by fitting a straight line.
Formula:
Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilonY=β0+β1X+ϵ
Example:
Predicting sales based on advertising spend:
Sales = 500 + 10 × Advertising Spend
Difference:
Definition:
Multiple linear regression models the relationship between one dependent variable and two or
more independent variables.
Formula:
Y=β0+β1X1+β2X2+...+βnXn+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n
+ \epsilonY=β0+β1X1+β2X2+...+βnXn+ϵ
Applications:
• Heteroscedasticity:
When the variance of errors is not constant across observations, violating regression
assumptions, leading to inefficient estimates.
• Multi-collinearity:
Occurs when independent variables are highly correlated, making it difficult to
estimate individual effects.
Effects:
4.1-mini
You said:
yes
ChatGPT said:
Here are the answers for Questions 71–75 from Unit 5: Predictive and Textual Analytics
— simplified and clear for 15-mark answers:
Importance:
• Huge amounts of data from social media, reviews, emails, etc., are text-based.
• Helps businesses understand customer opinions and market trends.
• Enables automation of tasks like classification and summarization.
• Supports decision-making by revealing insights hidden in text.
• R provides packages like tm, textclean, tidytext, and quanteda for processing and
analyzing text.
• Functions include tokenization, removing stop words, stemming, and building term-
document matrices.
• Supports sentiment analysis and topic modeling.
• Enables visualization of word clouds and frequency distributions.
Text Mining is the process of extracting useful patterns and knowledge from text data using
computational techniques.
Role in Analytics:
You said:
yes
ChatGPT said:
Here are the answers for Questions 76–80 from Unit 5: Predictive and Textual Analytics
— clear and simple for 15-mark answers:
Definition:
Sentiment analysis is the process of determining whether text expresses positive, negative, or
neutral feelings.
Applications in Business:
77. What are the assumptions underlying simple linear regression models?
• Identify the impact of multiple factors on an outcome (e.g., sales influenced by price,
advertising, and season).
• Predict future values based on multiple inputs.
• Optimize resources by understanding which variables have the strongest effect.
• Support budgeting, marketing, and operational planning with data-driven insights.
• Heteroscedasticity occurs when the variability of errors changes across values of the
independent variable.
• It violates regression assumptions, making estimates less reliable.
• Indicates that the model's predictions may be more accurate for some data ranges than
others.
• Requires corrective measures like transformation or robust standard errors.
You said:
yes
ChatGPT said:
Here are the answers for Questions 81–85 from Unit 5: Predictive and Textual Analytics
— simple and clear for 15-mark answers:
Steps:
1. Text Preprocessing:
o Tokenization, removing stopwords, stemming.
2. Feature Extraction:
o Bag of Words, TF-IDF, or word embeddings.
3. Classification:
o Use machine learning models or lexicon-based methods to classify sentiment.
4. Evaluation:
o Measure accuracy, precision, recall of the sentiment model.
Significance:
85. What are the common techniques for categorizing textual data?
You said:
yes
ChatGPT said:
Here are the answers for Questions 86–90 from Unit 5: Predictive and Textual Analytics
— simplified and clear for 15-mark answers:
• Word Clouds: Show frequently used words with size representing frequency.
• Bar Charts: Display counts of top terms or categories.
• Heatmaps: Visualize relationships or frequencies between words.
• Network Graphs: Show connections between keywords or topics.
• Sentiment Graphs: Display sentiment trends over time.
• Topic Models Visualization: Show distribution of topics across documents.
Tools:
R packages like wordcloud, ggplot2, and igraph help create these visuals.
87. Discuss the challenges of analyzing unstructured textual data.
90. Discuss how predictive analytics can be integrated with textual analytics
for actionable insights.
• Use textual analytics to extract features from unstructured data like reviews or social
media posts.
• Combine these features with numerical data in predictive models to improve
accuracy.
• Examples:
o Predict customer churn by analyzing complaint texts along with usage data.
oForecast sales by incorporating social media sentiment.
• This integration helps businesses make better decisions by considering both
quantitative and qualitative data.
That completes all 90 questions! If you want, I can help you with summaries, revision notes,
or anything else for your Business Analytics exam. Just let me know!
You said:
Awesome! Let’s make your exam prep smooth and effective. Here’s what I suggest for your
Business Analytics exam revision:
1. Summary Notes
I’ll prepare concise summary notes for each Unit so you can quickly revise key concepts.
4. Study Plan
You said:
Perfect! I’ll prepare concise summary notes for each unit covering key concepts, important
points, and easy-to-remember explanations. Here’s the plan:
• Data Science: Extracting insights from data using tools and algorithms.
• Data Analytics vs Data Analysis: Analytics is broader, includes techniques; analysis
is detailed examination.
• Types of Analytics:
o Descriptive (What happened?)
o Diagnostic (Why?)
o Predictive (What will happen?)
o Prescriptive (What to do?)
• Big Data Characteristics: Volume, Velocity, Variety, Veracity, Value.
• Applications: Marketing, finance, healthcare, manufacturing, customer experience.
• Challenges: Data quality, integration, privacy, skill shortage.
• Ethical Considerations: Privacy, bias, transparency, consent.