2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making
2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making
COURSE KIT
Trimester – III
2024-2026 Batch
COURSE INSTRUCTOR
Dr Shaheen
Associate Professor
CONTENTS
S.No. Contents Page Nos.
Evaluation Criteria
Continuous Evaluation 1 – Online Course (NPTEL)
4 Continuous Evaluation 2 – Project Work 9 - 12
Continuous Evaluation 3 – Case Analysis
Rubric Matrix for Continuous Assessments
5 Glossary of Terms 13 - 16
6 Question Bank 17 - 25
Journals
Analytics Magazine.
Data Science and Business Analytics.
Harvard Business Review (HBR).
Information Systems Research.
Journal of Business Analytics.
Journal of Business Intelligence Research.
MIT Sloan Management Review.
Value Addition Tools
Excel: Widely used for basic data analysis, statistics, and visualization.
Python (with libraries such as Pandas, NumPy, and SciPy): For advanced data analysis, manipulation, and handling
large datasets efficiently. Plotly: A Python-based library for creating interactive graphs and visualizations, especially
useful for web-based data visualizations.
R: Useful for statistical analysis and graphical representations. ggplot2 (R): A data visualization package in R for
creating elegant and complex plots based on the "Grammar of Graphics.
SESSION PLAN (24 Sessions)
Mode of
CO Continuous
Session Unit Topic of the Session
Teaching Pedagogy Mapping Evaluation
No.
Assessment
Introduction to Business
Analytics: Analytics
Class
1 I Landscape, Framework for Lecture + Case Study CO1, CO2
Participation
Data-Driven Decision
Making
Class
Roadmap for Analytics Lecture + Group
2 I CO2, CO4 Participation +
Capability Building Discussion
Reflection Paper
Challenges in Data-Driven
3 I Decision Making and Lecture CO1, CO3 In-Class Activity
Future Trends
Foundations of Data
Class
Science – Data Types & Lecture + Hands-on
4 I CO1, CO2 Participation +
Scales of Variable Session
Quiz
Measurement
Feature Engineering &
Lecture + Practical Project Work
5 I Functional Applications of CO2, CO3
Case Studies Progress Review
Business Analytics
Demonstration +
Widely Used Analytical
6 I Tool Walkthrough CO1, CO4 Hands-on Project
Tools in Business Analytics
(Excel, R, Tableau)
Class
Ethics in Business Lecture + Debate on
7 I CO3, CO4 Participation +
Analytics Ethical Issues
Discussion Notes
Review of Unit I:
CO1, CO2,
8 I Consolidating Key Recap + Q&A Review Test
CO3
Concepts
Introduction to Python and Lecture + Python Quiz + Hands-on
9 II CO1, CO2
Jupyter Notebooks Setup Walkthrough Exercise
Basic Programming
Lecture + Coding
10 II Concepts and Syntax in CO2, CO3 Code Submission
Demo
Python
Core Libraries in Python Lecture + Code Code Submission
11 II CO2, CO3
(Pandas, NumPy) Exercises + Peer Review
Map and Filter Functions in Code
12 II Hands-on Session CO3, CO4
Python Implementation
Data Collection and Practical Exercise + Project Progress
13 II CO2, CO3
Wrangling Discussion Review
Data Visualization
Demonstration + Visualization
14 II Techniques in Python CO3, CO4
Hands-on Session Assignment
(Matplotlib, Seaborn)
Feature Engineering & Case Study + Coding
15 II CO3, CO4 Code Submission
Selection in Python Exercise
16 II Feature Scaling, Feature Practical Example + CO3, CO4 Class Activity +
Selection Techniques Group Discussion Assignment
Lecture + Hands-on
Building Models in Model Building
17 III Python CO3, CO5
Business Analytics Assignment
Implementation
Model Evaluation and
Lecture + Practical Quiz + Hands-on
18 III Tuning (Hyper-parameter CO5, CO6
Example Exercise
Tuning)
Model Interpretation and Lecture + Hands-on Deployment Plan
19 III CO5, CO6
Deployment Session Presentation
Exploratory Data Analysis Group
Workshop + Group
20 III (EDA) and Diagnostic CO3, CO4 Discussion +
Work
Analysis EDA Report
Regression
Steps in Building a Lecture + Example
21 III CO4, CO5 Model
Regression Model Walkthrough
Assignment
Building Simple and Regression
Practical Session +
22 III Multiple Regression CO5, CO6 Model Code
Discussion
Models Submission
Logistic
Binary Logistic Regression,
23 III Lecture + Case Study CO5, CO6 Regression
ROC and AUC
Assignment
Regularization Techniques Practical Exercise + Regularization
24 III CO5, CO6
(L1, L2) Discussion Implementation
Final Project
Review and Wrap-up of
25 III Recap + Q&A CO5, CO6 Review &
Unit III
Feedback
CO-PO-PSO SET MAPPINGS
Programme Outcomes
PO1 Graduates would exhibit clarity of thought in expressing their views.
PO2 Graduates will have the ability to communicate effectively across diverse channels.
PO3 Graduates will be able to flesh out key decision points when confronted with a business problem.
PO4 Graduates will have the capacity to formulate strategies in the functional areas of management.
Graduates would be able to analyze the health of an organization by perusing its MIS reports/
PO5
financial statements.
Graduates would be able to analyze the health of an organization by perusing its MIS reports/
PO6
financial statements.
PO7 Graduates would demonstrate a hunger for challenging assignments.
PO8 Graduates would display an empathetic attitude to alleviate societal problems.
The students will be evaluated based on Continuous Evaluation (CE) components (CE1, CE2, & CE3), Mid-
term, and End-term examinations. Continuous Evaluation components include active participation in class,
presentations, assignments, case study discussions etc.
As the course is 3-credit, 15 marks will be allocated for Continuous Evaluation. The student's performance
will be evaluated based on timely completion of continuous evaluation components. The allocation of the
marks will be on the following lines:
2. Mid-Examination: 15 Marks
3. End-Examination: 30 Marks
Total: 60 Marks
Rubric Matrix for Continuous Assessments
Weightage
Assessment Criteria Description
(out of 5)
Correctness of 2 Correctness of answers for questions (multiple
Answers choice, short answer, etc.).
Conceptual 1 Demonstrates understanding of key concepts
Understanding tested throughout the course.
CE 1: Online
Timeliness and 1 Submission of assignments within the
Tests/Quizzes
Submission deadlines.
Clarity and 1 Responses are clear, concise, and directly
Precision of address the question.
Responses
Research and 2 Depth and quality of research conducted and
Analysis analysis conducted in the project.
Practical 1 Effective application of theoretical knowledge
CE 2: Project Application to real-world scenarios.
Work Presentation and 1 Clear, logical organization of project report and
Structure presentation.
Creativity and 1 Originality and innovative approach in
Innovation presenting ideas.
Identification of 2 Ability to identify and articulate the core issues
Key Issues of the case effectively.
CE 3: Case Critical Thinking 2 Depth of analysis and quality of proposed
Analysis and Solutions solutions to the issues identified.
Clarity of 1 Logical structure, clarity, and organization in
Presentation presenting the analysis.
Detailed rubric below provides a framework for assessing student performance in the three continuous
assessments.
Needs
Criteria Excellent (5) Very Good (4) Good (3) Fair (2) Improvement
(1)
CE 1: Online Tests/Quizzes
Demonstrates a Demonstrates Demonstrates Limited Lacks
thorough good basic understanding understanding,
understanding understanding understanding with several answers many
of course of course of course inaccuracies, questions
concepts with concepts with concepts with answers some incorrectly or
Knowledge &
high accuracy. minor errors. some questions not at all.
Comprehension
Answers all Answers most inaccuracies. incorrectly.
questions questions Answers most
correctly and correctly and questions
comprehensivel with some correctly but
y. depth. lacks depth.
Application & Applies Applies Applies Unable to apply Lacks
Analysis concepts concepts with concepts with concepts application and
effectively to some accuracy limited effectively, critical
solve problems, and basic accuracy, lacks thinking skills.
demonstrating critical thinking shows difficulty understanding
critical thinking and problem- applying them of how to use
and problem- solving skills. to solve them.
solving skills. problems.
CE 2: Project Work
Thorough and Good research Basic research Limited No research or
in-depth and analysis and analysis, research and analysis, fails
research, with minor limited analysis, to draw
Research &
insightful limitations, conclusions, inaccurate or meaningful
Analysis
conclusions relevant some data irrelevant conclusions.
based on strong conclusions interpretation conclusions
analysis. drawn. limitations. drawn.
Project is well- Project is well- Project is Project is Project is
organized, organized and adequately poorly poorly
executed executed organized and organized and organized and
efficiently, and effectively with executed with executed with executed, with
delivered on minor delays. some delays. significant significant
Project
time. Demonstrates Demonstrates delays. delays and
Execution
Demonstrates good project basic project Demonstrates missed
excellent management management poor project deadlines.
project skills. skills. management
management skills.
skills.
Project report Project report Project report Project report Project report
and and and and and
presentation are presentation are presentation are presentation are presentation
well-written, well-written somewhat well- poorly written are poorly
clear, concise, and clear. written with and difficult to written and
Presentation &
and engaging. Demonstrates minor errors. understand. difficult to
Communicatio
Demonstrates good Demonstrates Demonstrates understand.
n
excellent communication adequate poor Demonstrates
communication and communication communication very poor
and presentation skills. skills. communicatio
presentation skills. n skills.
skills.
CE 3: Case Analysis
Demonstrates a Demonstrates Demonstrates Limited Lacks
deep good basic understanding understanding
understanding understanding understanding of the case of the case
of the case of the case of the case study, with study and fails
Case
study, study, study, missing significant to identify key
Understanding
identifying key identifying key some key inaccuracies in issues and
issues and issues with issues. identifying key challenges.
challenges some minor issues.
accurately. inaccuracies.
Accurately Identifies key Identifies some Fails to Unable to
identifies key problems and key problems accurately identify key
problems and challenges in but may miss identify key problems or
challenges in the case study important ones. problems and challenges in
the case study. with some Provides basic challenges. the case study.
Problem
Provides minor analysis with Provides
Identification &
insightful inaccuracies. limited limited or no
Analysis
analysis and Provides good supporting analysis.
supports analysis with evidence.
conclusions some
with strong supporting
evidence. evidence.
Develops Develops Develops basic Develops Fails to
creative and feasible solutions to the limited or develop any
insightful solutions to the identified unrealistic meaningful
solutions to the identified problems with solutions to the solutions or
Solution
identified problems. some identified recommendati
Development &
problems. Recommendati limitations. problems. ons to the
Recommendati
Recommendati ons are Recommendati Recommendati identified
ons
ons are well- supported with ons may lack ons are not problems.
supported, reasonable clarity or depth. well-supported.
realistic, and evidence.
actionable.
GLOSSARY OF TERMS (Unit-wise)
BTL
QNo. Questions
Level
Unit I
1 Define 'Feature Engineering' and give one example. BTL 1
2 Explain the term 'Ethics in Business Analytics. BTL 1
3 List the different types of data scales used in business analytics. BTL 1
4 Name three widely used analytical tools in business analytics. BTL 1
5 What are the challenges faced in data-driven decision making? BTL 1
6 What are the fundamental types of data used in analytics? BTL 1
7 What are the key components of the analytics landscape? BTL 1
8 What is the primary goal of building analytics capabilities in an organization? BTL 1
Describe how business analytics can be applied to solve real-world management
9 BTL 2
problems. Provide an example.
Discuss the role of widely used analytical tools in supporting decision-making in
10 BTL 2
business.
11 Explain the challenges in building an analytics capability within an organization. BTL 2
Explain the framework for data-driven decision making and its significance in
12 BTL 2
business analytics.
How can business analytics be ethically applied in decision-making? Discuss
13 BTL 2
potential ethical issues that may arise.
How does feature engineering enhance the predictive power of a data model? Provide
14 BTL 2
an example.
Interpret the concept of 'Roadmap for Analytics Capability Building' and how it
15 BTL 2
impacts an organization’s ability to leverage data analytics.
What is the relationship between data types and scales of measurement? How do they
16 BTL 2
influence data analysis?
Apply the ethics in business analytics by analysing a case where data could be
17 BTL 3
misused for business advantage. How would you propose a solution?
18 Demonstrate how feature engineering optimizes machine learning outcomes. BTL 3
Describe a data collection framework incorporating various data types and scales for a
19 BTL 3
specific business problem.
Given a dataset with customer purchase behaviour, apply the appropriate feature
20 BTL 3
engineering techniques to improve its quality for predictive modelling.
Given a management decision problem (e.g., marketing strategy, operational
21 efficiency), choose the appropriate data types and scales to collect, analyse, and BTL 3
present findings to support decision-making.
How can you apply business analytics to improve decision-making in human resource
22 BTL 3
management?
How can you apply ethical principles to ensure unbiased outcomes in predictive
23 BTL 3
analytics?
How can you apply the knowledge of variable measurement scales in designing
24 BTL 3
surveys for market research?
How would you apply feature engineering techniques to improve the performance of
25 BTL 3
a predictive model?
26 How would you apply tools like Python to conduct predictive analytics in finance? BTL 3
In a business scenario where data-driven decisions could impact stakeholders,
27 describe how ethical considerations should guide the analytics process and its BTL 3
outcomes.
Using a set of business metrics, identify the challenges a company might face in using
28 BTL 3
data-driven decision making to improve customer satisfaction.
Using one widely used analytical tool, propose how you would analyse sales data to
29 BTL 3
forecast demand for the next quarter in a manufacturing firm.
You are tasked with improving decision-making in a retail company. Use the
30 framework for data-driven decision making to propose a roadmap for integrating BTL 3
analytics into business processes.
You have been given the task of introducing business analytics into a traditional
31 organization. Apply your knowledge of analytics capabilities and outline steps to BTL 3
create an analytics roadmap for the organization.
Analyze a scenario where a company is struggling to integrate analytics into its
32 decision-making processes. What data types and scales would be most relevant to BTL 4
their analytics efforts?
Analyze the differences between commonly used analytical tools in terms of
33 BTL 4
functionality and ease of use.
Analyze the differences between nominal, ordinal, interval, and ratio scales in the
34 BTL 4
context of business applications.
Analyze the ethical dilemmas businesses face when using customer data for analytics
35 BTL 4
purposes.
36 Analyze the impact of business analytics on marketing strategies and outcomes. BTL 4
Analyze the impact of data quality issues on the effectiveness of data-driven decision
37 BTL 4
making in a company. How would you address these issues?
Analyze the role of feature engineering in handling imbalanced datasets in customer
38 BTL 4
segmentation.
Compare and contrast the challenges faced by companies in building analytics
39 BTL 4
capability in different industries (e.g., manufacturing vs. retail).
Examine the ethical implications of using customer data for predictive analytics in a
40 BTL 4
retail company. How would you assess the risks of misuse?
Given a case study where a company has started implementing business analytics,
41 BTL 4
analyse the strengths and weaknesses in their data-driven decision-making process.
Given a dataset with multiple variables, analyse how feature engineering can be
42 BTL 4
applied to improve the predictive model's accuracy.
In a case where an organization has adopted an analytical tool but faces resistance,
43 BTL 4
analyse the reasons for the resistance and suggest steps to overcome it.
Assess the ethical considerations in applying machine learning algorithms to
44 BTL 5
customer data. What guidelines would you propose to ensure responsible use of data?
Assess the impact of inaccurate or biased data on business analytics outcomes, and
45 BTL 5
provide recommendations for mitigating these issues in decision-making processes.
Based on an analysis of a company’s data analytics maturity, evaluate the next steps
46 BTL 5
they should take to further enhance their data-driven decision-making capabilities.
Critically evaluate the choice of analytical tools in a project. How do you decide
47 BTL 5
which tool is the best fit for a specific business problem?
Evaluate the effectiveness of an analytics capability-building roadmap for a global
48 BTL 5
company versus a local startup. Which would be more challenging and why?
Evaluate the role of feature engineering in a business analytics project. What criteria
49 would you use to determine whether feature engineering has added value to the BTL 5
model?
50 Given a management scenario where data-driven decisions are crucial, evaluate BTL 5
whether the company’s approach to data ethics aligns with industry standards. What
improvements would you suggest?
Create a business analytics framework that a new company can adopt to guide data-
51 BTL 6
driven decisions in a competitive market. What key components would you include?
Create a detailed proposal for integrating widely used analytical tools into the
52 decision-making processes of an organization. How would you ensure these tools BTL 6
align with the company’s goals and strategies?
Design a comprehensive analytics roadmap for a company looking to move from
53 traditional decision-making methods to a fully data-driven approach. What steps BTL 6
should be included in the roadmap?
Design an analytics-based solution for supply chain optimization in a manufacturing
54 BTL 6
company.
Design an ethical guidelines framework for the use of business analytics in customer
55 segmentation. How would you ensure the framework is adhered to in the company’s BTL 6
data practices?
Develop a business case for the use of business analytics in improving operational
56 efficiency in a supply chain. What key performance indicators would you track, and BTL 6
how would you measure success?
Develop an innovative solution using business analytics to improve customer
57 retention for a retail company. How would you collect, analyze, and act on the BTL 6
relevant data?
Propose a framework for ethical governance in business analytics to ensure
58 BTL 6
compliance and trust.
Propose a new feature engineering strategy for improving the predictive accuracy of a
59 BTL 6
model that forecasts product demand. Justify your choice of methods and techniques.
Unit II
60 Define the concept of feature scaling and why it is important in data analysis. BTL 1
61 Explain the basic data types in R. BTL 1
List three core libraries in Python that are commonly used for data analysis and
62 BTL 1
manipulation.
63 Name two types of data that can be handled during feature engineering. BTL 1
64 What are the basic components of a Python program? BTL 1
65 What are the different data types in Python used for data processing? BTL 1
66 What are the main assumptions of logistic regression? BTL 1
67 What is data wrangling, and why is it a crucial step in the data processing pipeline? BTL 1
68 What is Jupyter Notebook, and why is it commonly used in Python for data analysis? BTL 1
69 What is the difference between map and filter functions in Python? BTL 1
70 What is the primary function of the pandas library in Python? BTL 1
71 What is the purpose of feature selection in machine learning? BTL 1
72 Describe the importance of power analysis in experimental design. BTL 2
Describe the process of feature extraction and engineering. Why is it essential to
73 BTL 2
perform these tasks when working with machine learning models?
Describe the purpose of feature scaling and how it affects the performance of machine
74 BTL 2
learning algorithms like k-NN and linear regression.
Discuss how Python libraries like pandas and numpy contribute to data wrangling.
75 BTL 2
Provide an example of a common data wrangling task.
76 Discuss the key functions used to create a dataset in R. BTL 2
Discuss the role of data collection in the context of business analytics. How does data
77 BTL 2
collection influence the quality of the insights derived from the data?
Explain how Python's filter function differs from map, and provide an example where
78 BTL 2
filter would be more appropriate in a data processing task.
Explain the concept of data visualization and its importance in the data analysis
79 BTL 2
process. Provide an example of a visualization you might use for a sales dataset.
80 Explain the difference between data.frame() and tibble() in R. BTL 2
Explain the difference between feature engineering for numeric data, categorical data,
81 BTL 2
and text data.
82 Explain the difference between simple, multiple, and polynomial regression models. BTL 2
83 Explain the impact of missing values on data analysis results. BTL 2
Explain the role of Jupyter Notebooks in Python-based data analysis projects and its
84 BTL 2
advantages over traditional coding environments.
85 Explain the significance of Python in data analysis and visualization. BTL 2
How does feature selection help improve the performance of a model? Describe a
86 BTL 2
scenario where feature selection would be particularly useful.
How does the map function in Python work? Provide an example where it would be
87 BTL 2
useful in a data analysis pipeline.
88 What are the key objectives of exploratory data analysis? BTL 2
Apply feature selection techniques to a dataset with multiple features. Justify which
89 features should be selected and which should be discarded based on their relevance to BTL 3
the model.
Demonstrate how to use Python’s filter function to extract records from a dataset
90 BTL 3
where the sales value exceeds a specific threshold.
Given a dataset with customer purchase data, perform feature extraction and
91 engineering on both categorical and numeric features to prepare the data for BTL 3
modelling. Explain your approach.
Given a dataset with numeric and categorical features, apply feature scaling
92 BTL 3
techniques (e.g., standardization or normalization) to pre-process the data.
Given a dataset with sales and customer information, perform data wrangling steps
93 such as handling missing data, removing duplicates, and converting data types using BTL 3
Python.
94 How would you create a dataset in Python for analysing sales data? BTL 3
95 How would you create a scatter plot in Python with customized labels and colours? BTL 3
96 How would you create summary statistics and visualizations for EDA in Python? BTL 3
97 How would you fit and validate a multiple regression model in Python using lm()? BTL 3
98 How would you handle missing values using imputation techniques in Python? BTL 3
99 How would you install and load a package in Python to begin your analysis? BTL 3
100 How would you use Python to diagnose and validate a logistic regression model? BTL 3
101 How would you use Python to identify and remove duplicate records in a dataset? BTL 3
In a dataset with multiple continuous variables, apply feature scaling (e.g., Min-Max
102 Scaling) and describe how the scaling process influences the model's ability to learn BTL 3
patterns.
Use Python’s matplotlib or seaborn library to visualize the relationship between
103 product sales and advertising spend in a dataset. Provide insights based on the BTL 3
visualization.
Using Jupyter Notebooks, demonstrate how you would clean a dataset with missing
104 BTL 3
values and outliers, and provide a summary of the cleaned data.
Using Python, implement feature engineering for a text-based dataset (e.g., customer
105 BTL 3
reviews) by extracting useful features such as word count and sentiment.
106 What are the key steps in cleaning raw data before analysis? BTL 3
What are the steps to calculate measures of central tendency (mean, median, mode) in
107 BTL 3
Python?
Write a Python function using map to convert a list of strings (e.g., product names) to
108 BTL 3
uppercase.
Analyze the effectiveness of advanced graphs like boxplots and heatmaps in
109 BTL 4
identifying trends.
110 Analyze the effectiveness of logistic regression in predicting binary outcomes. BTL 4
111 Analyze the effects of multicollinearity in a regression model and propose methods to BTL 4
address it.
112 Analyze the impact of outliers on regression analysis. BTL 4
Analyze the impact of using different feature scaling techniques (e.g., normalization
113 vs. standardization) on a machine learning model’s performance. How would the BTL 4
choice of scaling method influence the results?
Analyze the role of feature selection in improving the performance of a machine
114 learning model. Given a dataset, how would you determine which features are BTL 4
irrelevant or redundant?
115 Compare and contrast the use of variance and standard deviation in analyzing data. BTL 4
Examine the process of feature engineering for text data. How would you extract
116 useful features from raw text (e.g., customer reviews) to enhance a machine learning BTL 4
model’s ability to predict customer satisfaction?
Given a dataset with different types of features (e.g., numeric, categorical, text),
117 analyze how you would combine these features to improve model accuracy. What BTL 4
techniques would you use to pre-process these features for model training?
Given a dataset with multiple missing values and categorical data, analyze the steps
118 you would take to clean and prepare the data for analysis using Python. How would BTL 4
you handle missing data for categorical vs. numerical variables?
Identify and describe the key differences between basic Python plotting functions and
119 BTL 4
ggplot2.
You are tasked with performing data wrangling on a large dataset with inconsistent
formatting, missing values, and outliers. Analyze how you would approach each of
120 BTL 4
these issues using Python and provide specific examples of functions or libraries you
would use.
After performing feature selection on a dataset, evaluate how the reduced feature set
121 affects model performance. How would you judge whether the feature selection has BTL 5
improved or degraded the model's predictive power?
Based on your analysis of a data wrangling process, evaluate the potential drawbacks
and risks of handling missing values by imputation versus removing rows with
122 BTL 5
missing data. Which method would be more suitable for a dataset with a high
proportion of missing values?
Critically evaluate the use of the map and filter functions in Python for data pre-
123 processing. What are the advantages and limitations of these functions when handling BTL 5
large datasets?
Evaluate the effectiveness of various feature engineering methods for categorical data
124 (e.g., one-hot encoding vs. label encoding). In what scenarios would one method be BTL 5
preferred over the other?
Evaluate the performance of a data visualization approach for exploring relationships
125 between multiple variables in a business analytics project. How would you assess BTL 5
whether the visualizations provide actionable insights?
Given a machine learning model’s performance on a dataset, evaluate how feature
126 scaling might improve its accuracy. Under what conditions would scaling not improve BTL 5
the performance of a model?
Create a Python program that uses the map and filter functions to clean a large
127 dataset. Describe how you would use these functions to transform categorical BTL 6
variables and filter out irrelevant data before applying machine learning models.
Create an interactive data visualization dashboard using Python that can be used by
business analysts to explore sales data, identify trends, and make informed decisions.
128 BTL 6
What libraries would you use, and how would you structure the dashboard to
maximize usability and insight generation?
Design a complete data processing pipeline in Python, from data collection to feature
129 engineering and selection. Provide a step-by-step approach for handling a dataset that BTL 6
includes missing values, outliers, and mixed data types (numeric and categorical).
130 Design a feature selection process that evaluates the relevance of features based on BTL 6
correlation with the target variable. How would you use Python to automate the
process of selecting the most significant features for a predictive model?
Develop a machine learning model using Python, incorporating feature scaling and
131 selection as part of the pre-processing pipeline. Explain how you would optimize this BTL 6
model by evaluating different feature engineering and scaling techniques.
Propose a feature engineering strategy for a dataset containing both numerical and
textual data. How would you handle the pre-processing and feature extraction for both
132 BTL 6
types of data, and what methods would you use to integrate them into a unified
feature set for model training?
Unit III
Explain the importance of exploratory data analysis (EDA) in the process of building
133
machine learning models. BTL 2
134 Define binary logistic regression and its application in business analytics. BTL 1
Explain what ROC (Receiver Operating Characteristic) curve represents in a model
135 BTL 1
evaluation.
136 List the steps involved in building a regression model. BTL 1
What are the key diagnostic tools used to assess the performance of a regression
137 BTL 1
model?
138 What are the key differences between L1 and L2 regularization techniques? BTL 1
139 What are the main features of Jupyter Notebooks that make it ideal for data analysis? BTL 1
140 What are the primary purposes of the NumPy and Pandas libraries in Python? BTL 1
141 What does AUC (Area Under the Curve) represent in a binary classification model? BTL 1
What is a Gain and Lift Chart, and how is it useful in evaluating classification
142 BTL 1
models?
What is regularization in machine learning, and what are the two types of
143 BTL 1
regularization techniques?
144 What is the difference between simple and multiple regression models? BTL 1
145 What is the purpose of model evaluation in the context of business analytics? BTL 1
Describe the role of visualization in exploratory data analysis and how it can aid in
146 BTL 2
identifying patterns or trends in the data.
Explain the concept of "optimal classification cut-off" in binary logistic regression
147 BTL 2
and why it is important in model evaluation.
148 How can hyper-parameter tuning enhance model performance? BTL 2
149 How does diagnostic analysis help in understanding the performance of a model? BTL 2
How does regularization (L1 and L2) help in reducing overfitting in machine learning
150 BTL 2
models?
How does the ROC curve assist in evaluating a binary classification model? What
151 BTL 2
does it mean when the curve is closer to the top-left corner?
Interpret the concept of a Gain and Lift Chart and explain how it helps in evaluating
152 BTL 2
the performance of a classification model.
153 What are the differences between mutable and immutable data types in Python? BTL 2
154 What is the difference between L1 (Lasso) and L2 (Ridge) regularization techniques? BTL 2
What is the difference between simple regression and multiple regression? How do
155 BTL 2
you decide which one to use?
What is the process of building a regression model? Explain the steps involved from
156 BTL 2
data collection to model deployment.
157 What is the significance of model tuning in improving model accuracy? BTL 2
After building a binary logistic regression model, evaluate its performance using
158 ROC and AUC. Find the optimal classification threshold and assess the impact of BTL 3
changing the threshold on the classification results.
Build a multiple regression model to predict sales based on advertising spend, price,
159 and competition. Apply model diagnostics to check for multicollinearity and BTL 3
heteroscedasticity.
Design a Python script to visualize and perform exploratory data analysis (EDA) on a
160 dataset. Use different types of plots (e.g., histograms, scatter plots, box plots) to BTL 3
uncover trends and patterns in the data.
Given a dataset with sales data, perform exploratory data analysis (EDA) using
161 BTL 3
Python. Visualize key relationships between features and describe your findings.
Given a dataset with several predictors, apply regularization (L1 and L2) to prevent
162 overfitting in a multiple regression model. Compare the performance of both BTL 3
regularization techniques and provide recommendations.
163 How can the map() function be applied to transform a list of strings into integers? BTL 3
How can you use descriptive statistics to summarize a dataset before applying
164 BTL 3
machine learning models?
How would you create a histogram to visualize the distribution of a numerical
165 BTL 3
variable using seaborn?
166 How would you handle missing values and duplicates in a dataset using Python? BTL 3
How would you use a gain chart to evaluate the effectiveness of a marketing
167 BTL 3
campaign?
How would you use Markdown in Jupyter Notebooks to document your data analysis
168 BTL 3
process?
How would you use the confusion matrix to evaluate the performance of a binary
169 BTL 3
logistic regression model?
170 How would you use the matplotlib library to create a line chart in Python? BTL 3
171 How would you write a Python function to calculate the factorial of a number? BTL 3
Using a dataset with binary outcomes (e.g., success/failure), implement a Gain and
172 Lift Chart. Analyze the chart to determine how well your classification model BTL 3
performs in predicting the outcomes.
Using a dataset with customer data, build and evaluate a binary logistic regression
173 model to predict customer churn. Include ROC and AUC analysis, and find the BTL 3
optimal classification cut-off.
174 What are the key steps involved in building and validating a regression model? BTL 3
175 What are the key steps involved in pre-processing a dataset for machine learning? BTL 3
You are given a dataset for a loan approval model. Build a multiple regression model
176 to predict loan approval status and use diagnostics to check for issues such as BTL 3
multicollinearity and residuals.
You are tasked with building a simple linear regression model to predict house prices
177 based on square footage. Apply the steps in building the model, evaluate its BTL 3
performance, and interpret the results.
You have built a regression model to predict employee performance. Conduct model
178 tuning and hyper-parameter optimization to improve the accuracy. Describe the steps BTL 3
you took to achieve this.
After performing exploratory data analysis (EDA) on a dataset, analyze how outliers
179 and missing values can influence the results of your regression and classification BTL 4
models.
Analyze how feature selection and feature engineering can improve model
180 performance in a logistic regression problem. What methods would you apply for BTL 4
both categorical and continuous features?
Analyze the advantages of using Jupyter Notebooks over traditional IDEs for Python
181 BTL 4
programming.
Analyze the effect of multicollinearity in a multiple regression model. How would
182 BTL 4
you detect multicollinearity, and what steps would you take to address it?
Analyze the effectiveness of combining statistical summaries with visualizations
183 BTL 4
during exploratory data analysis.
184 Analyze the impact of feature scaling on the performance of regression models. BTL 4
185 Analyze the impact of regularization (L1 vs. L2) on a regression model's coefficients. BTL 4
How does regularization affect the model’s bias and variance?
186 Analyze the residual plots of a regression model to assess its goodness of fit. BTL 4
Analyze the role of loops and conditionals in solving iterative programming
187 BTL 4
problems.
Analyze the trade-offs between using a simple regression model and a multiple
188 regression model. How would the addition of more features affect the model’s BTL 4
performance and interpretability?
Compare the effectiveness of scatter plots and box plots in identifying patterns in
189 BTL 4
data.
190 Compare the functionalities of Pandas and NumPy when working with tabular data. BTL 4
191 Compare the usage of map() and filter() functions for handling large datasets. BTL 4
Compare the use of gain charts and lift charts in determining the efficiency of a
192 BTL 4
predictive model.
Given a binary logistic regression model, analyze the relationship between the ROC
193 BTL 4
curve and AUC. What do these metrics tell you about the model’s performance?
Given a model that shows signs of overfitting, analyze the impact of tuning hyper-
194 parameters such as the learning rate or regularization strength on improving the BTL 4
model's generalization ability.
How would you interpret the p-value and R-squared value in a multiple regression
195 BTL 4
model summary?
Based on the model’s performance, evaluate the use of feature scaling before applying
196 a regression or classification model. How does scaling impact model accuracy and BTL 5
convergence speed?
Critically evaluate the use of Gain and Lift charts for assessing classification models.
197 How would you determine whether your model is performing well based on these BTL 5
charts?
Evaluate a model's performance after applying hyper-parameter tuning. How would
198 you assess whether the tuned model has improved over the original model in terms of BTL 5
accuracy, precision, and recall?
Evaluate a regression model's diagnostics (e.g., residual plots, R-squared value). What
199 metrics would you use to determine if the model fits the data well, and what actions BTL 5
would you take if the diagnostics indicate problems?
Evaluate the effect of different regularization methods (L1 vs. L2) on model
200 performance. In which scenarios would L1 regularization be preferred over L2, and BTL 5
why?
Evaluate the importance of diagnostic analysis in model building. How can you use
201 diagnostics to improve a model's generalizability and avoid overfitting or under- BTL 5
fitting?
Evaluate the performance of a binary logistic regression model using the ROC curve
202 and AUC. What thresholds would you set for classification, and how do these choices BTL 5
impact the model’s precision and recall?
Create a comprehensive model-building pipeline that includes the following stages:
203 data pre-processing, model selection, hyper-parameter tuning, and model evaluation. BTL 6
Justify your choices of methods and tools at each stage.
Create a Python script that builds, tunes, and evaluates a regression or classification
204 model. Include steps for model diagnostics and feature selection, and explain how the BTL 6
script will automate the model evaluation process.
Create a regression model to predict employee performance based on several
205 predictors. Apply diagnostic analysis to validate the model's assumptions and improve BTL 6
its accuracy.
Create an automated process for model monitoring and retraining after deployment.
206 How would you design this system to account for model drift and ensure that the BTL 6
model continues to deliver accurate predictions over time?
207 Design a logistic regression model to predict customer churn, and propose strategies BTL 6
for model tuning and hyper-parameter optimization. Include steps for evaluating the
model using AUC, ROC, and Gain/Lift charts.
Design an approach for creating a classification model using a dataset with
208 imbalanced classes. How would you balance the data, select appropriate evaluation BTL 6
metrics, and ensure the model generalizes well to new data?
Develop a business case for deploying a machine learning model that predicts product
209 demand based on historical sales data. Describe how you would integrate the model BTL 6
into an organization’s decision-making process.
Develop a strategy to improve a multiple regression model's performance using
210 feature engineering and regularization. Include the methods you would use to handle BTL 6
categorical variables, missing data, and scaling.
Propose a new framework for building and deploying machine learning models that
incorporates data collection, pre-processing, model training, evaluation, and
211 BTL 6
monitoring. How would you ensure the model’s long-term performance and
scalability?
Propose a solution for addressing overfitting in a machine learning model that uses
212 many predictors. What steps would you take to reduce complexity while maintaining BTL 6
model accuracy (e.g., regularization, feature selection)?
INSTITUTE OF PUBLIC ENTERPRISE
Shameerpet Campus
Hyderabad – 500101
POST GRADUATE DIPLOMA IN MANAGEMENT
Mid Trimester Examinations: February 2025
Section – I (4 X 1 = 4 Marks)
Q.NO QUESTION BTL CO
Q. 1
Q. 2
Q. 3
Q. 4
Q. 6a
OR
Q. 6b
OR
Q. 7b
Q. 8a
OR
Q. 8b
Q. 9a
OR
Q. 9b
Q. 10a
OR
Q. 10b
Section – III (2 X 6 = 12 Marks)
Analytics Landscape:
A comprehensive view of how analytics is utilized to support data-driven decision-making across various
domains. While the specific terminology and structure may vary slightly, the typical framework includes the
following key components:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics
5. Cognitive/AI-Driven Analytics
6. Operational Analytics
7. Strategic Analytics
This landscape illustrates how analytics is a multi-faceted domain that integrates data, tools, and
methodologies to drive decisions at various levels, from operational to strategic, across diverse industries.
A structured framework to guide data-driven decision-making processes which outlines five key steps:
1. Business Question: Clearly define the specific business problem or question that needs to be addressed.
2. Analysis Plan: Develop a detailed plan outlining the analytical approach, including hypotheses to test and
methodologies to employ.
3. Data Collection: Gather relevant data from appropriate sources, ensuring its quality and relevance to the
analysis.
4. Insights Derivation: Analyse the collected data to extract meaningful insights, identify patterns, and
validate hypotheses.
5. Recommendations: Formulate actionable recommendations based on the derived insights to inform
decision-making.
This roadmap emphasizes a structured approach to integrating analytics into business processes to enhance
data-driven decision-making. The key steps include:
1. Define Objectives and Goals:
o Clearly articulate the organization's strategic objectives and how analytics can support these
goals.
o Identify specific areas where analytics can provide value, such as improving customer
insights, optimizing operations, or enhancing product development.
2. Assess Current Capabilities:
o Evaluate the existing analytics infrastructure, including technology, data quality, and human
resources.
o Determine the organization's maturity level in terms of data management and analytical skills.
3. Develop a Strategic Analytics Plan:
o Create a comprehensive plan that outlines the steps needed to build or enhance analytics
capabilities.
o Set measurable targets and timelines for achieving analytics objectives.
4. Invest in Technology and Tools:
o Acquire the necessary analytics tools and platforms that align with the organization's needs.
o Ensure scalability and integration capabilities with existing systems.
5. Build a Skilled Analytics Team:
o Recruit and train personnel with expertise in data analysis, statistics, and domain-specific
knowledge.
o Foster a culture of continuous learning and development in analytics.
6. Establish Data Governance and Management:
o Implement policies and procedures to ensure data quality, security, and privacy.
o Define data ownership and stewardship roles within the organization.
7. Promote a Data-Driven Culture:
o Encourage decision-making based on data insights across all levels of the organization.
o Provide training and resources to help employees understand and utilize analytics in their
roles.
8. Implement Analytics Solutions:
o Deploy analytics projects that address identified business needs.
o Use pilot projects to demonstrate value and refine approaches before broader implementation.
9. Monitor and Evaluate Performance:
o Regularly assess the effectiveness of analytics initiatives against predefined metrics.
o Gather feedback to identify areas for improvement and to inform future analytics strategies.
10. Scale and Innovate:
o Expand successful analytics initiatives to other areas of the organization.
o Stay abreast of emerging analytics trends and technologies to maintain a competitive edge.
By following this roadmap, organizations can systematically develop their analytics capabilities, leading to
more informed decision-making and improved business outcomes.
several challenges organizations face in implementing data-driven decision-making and offers insights into
future trends in business analytics.
By addressing these challenges and staying abreast of emerging trends, organizations can effectively
leverage data-driven decision-making to achieve their strategic objectives.
Types of Data:
Categorization of data based on structure, source, and use case.
1. Based on data format:
3. Based on nature:
5. Based on Use:
6. Based on Content:
Feature Engineering:
Feature engineering involves creating new variables or modifying existing ones to enhance the performance
of predictive models. This process is crucial for improving model accuracy and uncovering hidden patterns
within the data.
Key Aspects of Feature Engineering:
1. Data Transformation:
o Applying mathematical functions to variables, such as logarithmic or square root
transformations, to stabilize variance or normalize distributions.
2. Interaction Features:
o Creating new features by combining two or more variables to capture interactions that may
influence the target outcome.
3. Binning:
o Grouping continuous variables into discrete bins or categories to reduce noise and handle
non-linear relationships.
4. Encoding Categorical Variables:
o Converting categorical data into numerical format using techniques like one-hot encoding or
label encoding to make them suitable for machine learning algorithms.
5. Handling Missing Values:
o Imputing missing data with appropriate values or creating indicator variables to flag missing
entries.
6. Scaling and Normalization:
o Adjusting the range of variables to ensure they contribute equally to the analysis, especially
important for distance-based algorithms.
7. Date and Time Feature Extraction:
o Deriving new features from date and time variables, such as day of the week, month, or time
of day, to capture temporal patterns.
By meticulously engineering features, analysts can significantly enhance the predictive power of their
models, leading to more accurate and reliable data-driven decisions.
Functional Applications of Business Analytics in Management:
The functional applications of business analytics in management span across various departments and
domains within an organization. These applications enable better decision-making, improve efficiency, and
drive strategic objectives. Below is an overview of key areas where business analytics plays a pivotal role:
Marketing Analytics
Customer Segmentation: Identifying and grouping customers based on behavior, preferences, and
demographics.
Campaign Performance: Measuring the effectiveness of marketing campaigns through KPIs like ROI
and conversion rates.
Personalization: Leveraging data to tailor marketing messages and product recommendations.
Churn Analysis: Predicting customer attrition and implementing retention strategies.
2. Financial Analytics
Budgeting and Forecasting: Using historical data and predictive models to create accurate financial
projections.
Risk Management: Identifying and mitigating financial risks through stress testing and scenario
analysis.
Profitability Analysis: Evaluating profitability at product, customer, or segment levels.
Fraud Detection: Employing machine learning algorithms to detect anomalies and fraudulent
activities.
5. Strategic Management
Market Trends Analysis: Monitoring market dynamics and competitive landscapes for informed
strategy formulation.
Scenario Planning: Evaluating potential outcomes of strategic decisions through simulation models.
Mergers and Acquisitions: Conducting due diligence and valuing target companies based on financial
and operational data.
KPI Monitoring: Developing dashboards to track organizational performance against strategic
objectives.
Customer Feedback Analysis: Extracting insights from surveys, reviews, and social media for service
improvement.
Loyalty Programs: Designing and optimizing loyalty initiatives to enhance customer retention.
System Performance: Monitoring IT systems for performance optimization and downtime reduction.
Data Management: Enhancing data governance and ensuring data quality for better decision-making.
By leveraging business analytics across these functional areas, organizations can achieve greater operational
efficiency, improve decision-making accuracy, and maintain a competitive edge in the market.
Widely used Analytical Tools:
Marketing Analytics Tools
Google Analytics: Web analytics tool to track and analyze website traffic and user behavior.
HubSpot: Inbound marketing platform for campaign performance, lead tracking, and customer
insights.
Tableau: Visualization tool to analyze customer segmentation, campaign effectiveness, and churn
patterns.
Adobe Analytics: Advanced analytics for understanding customer journeys and optimizing marketing
strategies.
Microsoft Power BI: Business intelligence tool to track financial KPIs and generate real-time
insights.
QuickBooks: Accounting software for small and medium businesses to manage financial data and
budgeting.
Alteryx: Analytics platform for financial data preparation, risk modeling, and profitability analysis.
JDA (now Blue Yonder): Advanced supply chain solutions for logistics and transportation
management.
Qlik Sense: Business intelligence tool for process improvement and operational analytics.
SAP Integrated Business Planning: Comprehensive platform for demand planning and supply chain
optimization.
Balanced Scorecard Software: Framework for monitoring KPI performance across strategic
objectives.
Microsoft Excel: Widely used for scenario planning and simulation modeling.
Domo: Analytics and BI platform for tracking organizational performance.
MetricStream: GRC (Governance, Risk, and Compliance) platform to manage regulatory adherence
and operational risks.
SAS Risk Management: Comprehensive tool for identifying, measuring, and mitigating financial
risks.
Splunk: Tool for monitoring and managing IT and operational risks.
IBM QRadar: Security information and event management (SIEM) tool for threat analysis.
Snowflake: Cloud-based data platform for managing large datasets and ensuring data quality.
Tableau Data Management: Enhances governance, quality, and scalability of IT systems.
By utilizing these tools, organizations can streamline their analytics processes, extract actionable insights,
and make informed decisions to achieve their strategic objectives.
Ethics in Business Analytics:
Ethics in Business Analytics is a crucial aspect of using data to drive decisions in any organization. As the
reliance on data-driven insights and algorithms increases, ensuring ethical practices in business analytics
becomes vital to avoid harmful consequences. Ethical concerns can arise at various stages of data analysis,
including data collection, analysis, and decision-making. Here are the key areas where ethics play a
significant role in business analytics:
Data Privacy and Protection
Informed Consent: Businesses must obtain explicit consent from individuals before collecting or
using their data, particularly in sensitive areas like healthcare or personal information.
Data Minimization: Collect only the data that is necessary for the intended purpose to reduce
exposure and minimize risks.
Compliance with Regulations: Adhering to laws and regulations such as GDPR (General Data
Protection Regulation), HIPAA (Health Insurance Portability and Accountability Act), and CCPA
(California Consumer Privacy Act) is crucial to protect consumer privacy.
Data Anonymization: Personal data should be anonymized or de-identified to reduce the risk of
misuse.
2. Transparency in Algorithms
Explainability: Algorithms and models used in decision-making should be transparent and
interpretable. Business stakeholders should be able to understand how decisions are made by the
system.
Fairness: Avoid using algorithms that unintentionally favor certain groups or individuals, leading to
biased outcomes. Fairness checks should be applied to ensure equitable results across different
demographics.
Bias Detection: Regularly audit algorithms for biases (e.g., racial, gender, or socio-economic biases)
that may distort outcomes. Model developers should ensure that their models do not perpetuate
societal inequalities.
In conclusion, ethics in business analytics is about balancing the potential benefits of data-driven decision-
making with the responsibility to protect privacy, avoid bias, and ensure transparency and fairness.
Organizations must adopt ethical frameworks and practices to ensure that their analytics initiatives create
positive value without causing harm to individuals, communities, or society at large.
UNIT II
Introduction to Python
Python was developed by Guido Van Rossum in the year 1991.
Python is a high level programming language that contains features of functional programming language like
C and object oriented programming language like Java.
FEATURES OF PYTHON
Simple
Python is a simple programming language because it uses English like sentences in its programs.
Easy to learn
Python uses very few keywords. Its programs use very simple structure.
Open source
Python can be freely downloaded from www.python.org website. Its source code can be read, modified and
can be used in programs as desired by the programmers.
High level language
High level languages use English words to develop programs. These are easy to learn and use. Like COBOL,
PHP or Java, Python also uses English words in its programs and hence it is called high level programming
language.
Dynamically typed
In Python, we need not declare the variables. Depending on the value stored in the variable, Python
interpreter internally assumes the datatype.
Platform independent
Hence, Python programs are not dependant on any computer with any operating system. We can use Python
on Unix, Linux, Windows, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, etc. almost all operating
systems. This will make Python an ideal programming language for any network or Internet.
Portable
When a program yields same result on any computer in the world, then it is called a portable program.
Python programs will give same result since they are platform independent.
Embeddable
Several applications are already developed in Python which can be integrated into other programming
languages like C, C++, Delphi, PHP, Java and .NET. It means programmers can use these applications for
their advantage in various software projects.
Huge library
Python has a big library that contains modules which can be used on any Operating system.
Scripting language
A scripting language uses an interpreter to translate the source code into machine code on the fly (while
running). Generally, scripting languages perform supporting tasks for a bigger application or software.
Python is considered as a scripting language as it is interpreted and it is used on Internet to support other
softwares.
Database connectivity
A database represents software that stores and manipulates data. Python provides interfaces to connect its
programs to all major databases like Oracle, Sybase, SQL Server or MySql.
Scalable
A program would be scalable if it could be moved to another Operating system or hardware and take full
advantage of the new environment in terms of performance.
Core Libraries in Python
The huge library of Python contains several small applications (or small packages) which are already
developed and immediately available to programmers. These libraries are called ‘batteries included’. Some
interesting batteries or packages are given here:
argparse is a package that represents command-line parsing library.
boto is Amazon web services library.
CherryPy is a Object-oriented HTTP framework.
cryptography offers cryptographic techniques for the programmers
Fiona reads and writes big data files
jellyfish is a library for doing approximate and phonetic matching of strings.
matplotlib is a library for electronics and electrical drawings.
mysql-connector-python is a driver written in Python to connect to MySQL database.
numpy is a package for processing arrays of single or multidimensional type.
pandas is a package for powerful data structures for data analysis, time series and statistics.
Pillow is a Python imaging library.
pyquery represents jquery-like library for Python.
scipy is the scientific library to do scientific and engineering calculations.
Sphinx is the Python documentation generator.
sympy is a package for Computer algebra system (CAS) in Python.
w3lib is a library of web related functions.
whoosh contains fast and pure Python full text indexing, search and spell checking library.
To know the entire list of packages included in Python, one can visit:
https://www.pythonanywhere.com/batteries_included/
Python Virtual Machine (PVM) is a software that contains an interpreter that converts the byte code into
machine code.
PVM is most often called Python interpreter. The PVM of PyPy contains a compiler in addition to the
interpreter. This compiler is called Just In Time (JIT) compiler which is useful to speed up execution of the
Python program.
Step 4) Then click on ‘Just Me’ radio button for installing your individual copy.
Step 5) It will show a default directory to install. Click on ‘Next’.
Step 6) In the next screen, select the checkbox ‘Create start menu shortcuts’. Also, unselect other
checkboxes.
Step 7) The installation starts in the next screen. We should wait for the installation to complete.
Step 10) In the final screen, do not check the checkboxes and then click on “Finish”.
Note: Once the installation is completed, we can find a new folder by the name “Anaconda3(64-bit)” created
in Window 10 applications which can be seen by pressing Windows “Start” button. When we click on this
folder, we can find several icons including “Jupyter Notebook” and “Spyder”.
Step 5) We can enter code in the next cell and so on. In this manner, we can run the program as blocks of
code, one block at a time. When input is required, it will wait for your input to enter, as shown in the
following screen. The blue box around the cell indicates command mode.
Step 6) Type the program in the cells and run each cell to see the results produced by each cell.
Note: To save the program, click on Floppy symbol below the “File” menu. Click on “Insert” to insert a new
cell either above or below the current cell. The programs in Jupyter are saved with the extension “.ipynb”
which indicates Interactive Python Notebook file. This file stores the program and other contents in the form
of JSON (JavaScript Object Notation). Click on ‘Logout’ to terminate Jupyter. Then close the server window
also.
Step 7) To reopen the program, first enter into Jupyter Notebook Home Page. In the “Files” tab, find out the
program named “first.ipynb” and click on it to open it in another page.
Step 8) Similarly, to delete the file, first select it and then click on the Delete Bin symbol.
RUNNING A PYTHON PROGRAM
Running a Python program can be done from 3 environments: 1. Command line window 2. IDLE graphics
window 3. System prompt
In IDLE window, click on help -> ‘Python Docs’ or F1 button to get documentation help.
Save a Python program in IDLE and reopen it and run it.
COMMENTS (2 types)
# single line comments
“”” or ‘’’ multi line comments
Docstrings
If we write strings inside “”” or ‘’’ and if these strings are written as first statements in a module, function,
class or a method, then these strings are called documentation strings or docstrings. These docstrings are
useful to create an API documentation file from a Python program. An API (Application Programming
Interface) documentation file is a text file or html file that contains description of all the features of a
software, language or a product.
DATATYPES
A datatype represents the type of data stored into a variable (or memory).
Built-in datatypes
The built-in datatypes are of 5 types:
None Type
Numeric types
Sequences
Sets
Mappings
NOTE:
Binary numbers are represented by a prefix 0b or 0B. Ex: 0b10011001
Hexadecimal numbers are represented by a prefix 0x or 0X. Ex: 0X11f9c
Octal numbers are represented by a prefix 0o or 0O. Ex: 0o145.
bool type: represents any of the two boolean values, True or False.
Ex: a = 10>5 # here a is treated as bool type variable.
print(a) #displays True
NOTE:
1. To convert a float number into integer, we can use int() function. Ex: int(num)
2. To convert an integer into float, we can use float() function.
3. bin() converts a number into binary. Ex: bin(num)
4. oct() converts a number into octal.
5. hex() converts a number into hexadecimal.
STRINGS
str datatype: represents string datatype. A string is enclosed in single quotes or double quotes.
Ex: s1 = “Welcome”
s2 = ‘Welcome’
A string occupying multiple lines can be inserted into triple single quotes or triple double quotes.
Ex: s1 = ‘’’ This is a special training on
Python programming that
gives insights into Python language.
‘’’
To display a string with single quotes.
Ex: s2 = “””This is a book ‘on Core Python’ programming”””
To find length of a string, use len() function.
Ex: s3 = ‘Core Python’
n = len(s3)
print(n) -> 11
We can do indexing, slicing and repetition of strings.
Ex: s = “Welcome to Core Python”
print(s) -> Welcome to Core Python
print(s[0]) -> W
print(s[0:7]) -> Welcome
print(s[:7]) -> Welcome
print(s[1:7:2]) -> ecm
print(s[-1] -> n
print(s[-3:-1]) -> ho
print(s[1]*3) -> eee
print(s*2) ->Welcome to CorePython Welcome to CorePython
Remove spaces using rstrip(), lstrip(), strip() methods.
Ex: name = “ Vijay Kumar “
print(name.strip())
We can find substring position in a string using find() method. It returns -1 if not found.
CHARACTERS
There is no datatype to represent a single character in Python. Characters are part of str datatype.
Ex:
str = "Hello"
print(str[0])
H
for i in str: print(i)
H
e
l
l
o
bytearray datatype: same as bytes type but its elements can be modified.
arr = [10,20,55,100,99]
x=bytearray(arr)
x[0]=11
x[1]=21
for i in x: print(i)
11
21
55
100
99
NOTE:
We can do only indexing in case of bytes or bytearray datatypes. We cannot do slicing or repetitions.
LISTS
A list is similar to an array that can store a group of elements. A list can store different types of elements and
can grow dynamically in memory. A list is represented by square braces [ ]. List elements can be modified.
Ex:
lst = [10, 20, 'Ajay', -99.5]
print(lst[2])
Ajay
To create an empty list.
lst = [] # then we can append elements to this list as lst.append(‘Vinay’)
NOTE:
Indexing, slicing and repetition are possible on lists.
print(lst[1])
20
print(lst[-3:-1])
[20, 'Ajay']
lst = lst*2
print(lst)
[10, 20, 'Ajay', -99.5, 10, 20, 'Ajay', -99.5]
We can use len() function to find the no. of elements in the list.
n = len(lst) -> 4
del() function is for deleting an element at a particular position.
del(lst[1]) -> deletes 20
remove() will remove a particular element. clear() wil delete all elements from the list.
lst.remove(‘Ajay’)
lst.clear()
We can update the list elements by assignment.
lst[0] = ‘Vinod’
lst[1:3] = 10, 15
43
max() and min() functions return the biggest and smallest elements.
max(lst)
min(lst)
TUPLES
A tuple is similar to a list but its elements cannot be modified. A tuple is represented by parentheses ( ).
Indexing, slicing and repetition are possible on tuples also.
Ex:
tpl=( ) # creates an empty tuple
tpl=(10, ) # with only one element – comma needed after the element
tpl = (10, 20, -30, "Raju")
print(tpl)
(10, 20, -30, 'Raju')
tpl[0]=-11 # error
print(tpl[0:2])
(10, 20)
tpl = tpl*2
print(tpl)
(10, 20, -30, 'Raju', 10, 20, -30, 'Raju')
NOTE: len(), count(), index(), max(), min() functions are same in case of tuples also.
We cannot use append(), extend(), insert(), remove(), clear() methods on tuples.
To sort the elements of a tuple, we can use sorted() method.
sorted(tpl) # sorts all elements into ascending order
sorted(tpl, reverse=True) # sorts all elements into descending order
To convert a list into tuple, we can use tuple() method.
tpl = tuple(lst)
RANGE DATATYPE
range represents a sequence of numbers. The numbers in the range cannot be modified. Generally, range is
used to repeat a for loop for a specified number of times.
Ex: we can create a range object that stores from 0 to 4 as:
r = range(5)
print(r[0]) -> 0
for i in r: print(i)
0
1
2
3
4
Ex: we can also mention step value as:
r = range(0, 10, 2)
for i in r: print(i)
0
2
4
6
8
r1 = range(50, 40, -2)
for i in r1: print(i)
50
48
46
44
42
SETS
A set datatype represents unordered collection of elements. A set does not accept duplicate elements where
as a list accepts duplicate elements. A set is written using curly braces { }. Its elements can be modified.
s = {1, 2, 3, "Vijaya"}
print(s)
{1, 2, 3, 'Vijaya'}
NOTE: Indexing, slicing and repetition are not allowed in case of a set.
To add elements into a set, we should use update() method as:
s.update([4, 5])
print(s)
{1, 2, 3, 4, 5, 'Vijaya'}
To remove elements from a set, we can use remove() method as:
s.remove(5)
print(s)
{1, 2, 3, 4, 'Vijaya'}
A frozenset datatype is same as set type but its elements cannot be modified.
Ex:
s = {1, 2, -1, 'Akhil'} -> this is a set
s1 = frozenset(s) -> convert it into frozenset
for i in s1: print(i)
1
2
Akhil
-1
NOTE: update() or remove() methods will not work on frozenset.
MAPPING DATATYPES
A map indicates elements in the form of key – value pairs. When key is given, we can retrieve the associated
value. A dict datatype (dictionary) is an example for a ‘map’.
d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
print(d)
{10: 'kamal', 11: 'Subbu', 12: 'Sanjana'}
keys() method gives keys and values() method returns values from a dictionary.
k = d.keys()
for i in k: print(i)
10
11
12
for i in d.values(): print(i)
kamal
Subbu
Sanjana
To display value upon giving key, we can use as:
Ex: d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
d[10] gives ‘kamal’
To create an empty dictionary, we can use as:
d = {}
Later, we can store the key and values into d, as:
d[10] = ‘Kamal’
d[11] = ‘Pranav’
We can update the value of a key, as: d[key] = newvalue.
Ex: d[10] = ‘Subhash’
We can delete a key and corresponding value, using del function.
Ex: del d[11] will delete a key with 11 and its value also.
PYTHON AUTOMATICALLY KNOWS ABOUT THE DATATYPE
The datatype of the variable is decided depending on the value assigned. To know the datatype of the
variable, we can use type() function.
Ex:
x = 15 #int type
print(type(x))
<class 'int'>
x = 'A' #str type
print(type(x))
<class 'str'>
x = 1.5 #float tye
print(type(x))
<class 'float'>
x = "Hello" #str type
print(type(x))
<class 'str'>
x = [1,2,3,4]
print(type(x))
<class 'list'>
x = (1,2,3,4)
print(type(x))
<class 'tuple'>
x = {1,2,3,4}
print(type(x))
<class 'set'>
Literals in Python
A literal is a constant value that is stored into a variable in a program.
a = 15
Here, ‘a’ is the variable into which the constant value ‘15’ is stored. Hence, the value 15 is called ‘literal’.
Since 15 indicates integer value, it is called ‘integer literal’.
Ex: a = ‘Srinu’ → here ‘Srinu’ is called string literal.
Ex: a = True → here, True is called Boolean type literal.
User-defined datatypes
The datatypes which are created by the programmers are called ‘user-defined’ datatypes. For example, an
array, a class, or a module is user-defined datatypes. We will discuss about these datatypes in the later
chapters.
Constants in Python
A constant is similar to variable but its value cannot be modified or changed in the course of the program
execution. For example, pi value 22/7 is a constant. Constants are written using caps as PI.
Assignment operators
To assign right side value to a left side variable.
Operator Example Meaning
= z = x+y Assignment operator. i.e. x+y
is stored into z.
+= z+=x Addition assignment
operator. i.e. z = z+x.
-= z-=x Subtraction assignment
operator. i.e. z = z-x.
*= z*=x Multiplication assignment
operator. i.e. z = z *x.
/= z/=x Division assignment operator.
i.e. z = z/x.
%= z%=x Modulus assignment
operator. i.e. z = z%x.
**= z**=y Exponentiation assignment
operator. i.e. z = z**y.
//= z//=y Floor division assignment
operator. i.e. z = z// y.
Ex:
a=b=c=5
print(a,b,c)
555
a,b,c=1,2,'Hello'
print(a,b,c)
1 2 Hello
x = [10,11,12]
a,b,c = 1.5, x, -1
print(a,b,c)
1.5 [10, 11, 12] -1
Ex:
1<2<3<4 will give True
1<2>3<4 will give False
Logical operators
Logical operators are useful to construct compound conditions. A compound condition is a combination of
more than one simple condition. 0 is False, any other number is True.
X=1, y=2
Operator Example Meaning Result
and x and y And operator. If x is False, it returns x, otherwise 2
it returns y.
or x or y Or operator. If x is False, it returns y, otherwise it 1
returns x.
not not x Not operator. If x is False, it returns True. If x is False
True it returns False.
Ex:
x=1; y=2; z=3
if(x<y or y>z):
print('Yes')
else:
print('No') -> displays Yes
Boolean operators
Boolean operators act upon ‘bool’ type values and they provide ‘bool’ type result. So the result will be again
either True or False.
x = True, y = False
Operator Example Meaning Result
and x and y Boolean and operator. False
If both x and y are
True, then it returns
True, otherwise
False.
or x or y Boolean or operator. True
If either x or y is
True, then it returns
True, else False.
not not x Boolean not operator. False
If x is True, it returns
False, else True.
INPUT AND OUTPUT
print() function for output
Example Output
print() Blank line
print(“Hai”) Hai
print(“This is the \nfirst line”) This is the
first line
print(“This is the \\nfirst line”) This is the \nfirst line
print(‘Hai’*3) HaiHaiHai
print(‘City=’+”Hyderabad”) City=Hyderabad
print(a, b) 12
print(a, b, sep=”,”) 1,2
print(a, b, sep=’-----‘) 1-----2
print("Hello") Hello
print("Dear") Dear
print("Hello", end='') HelloDear
print("Dear", end='')
a=2 You typed 2 as input
print('You typed ', a, 'as input')
%i, %f, %c, %s can be used as format strings. Hai Linda Your salary is 12000.5
name='Linda'; sal=12000.50 Hai Linda, Your salary is 12000.50
print('Hai', name, 'Your salary is', sal)
print('Hai %s, Your salary is %.2f' % (name,
sal))
print('Hai {}, Your salary is {}'.format(name, Hai Linda, Your salary is 12000.5
sal)) Hai Linda, Your salary is 12000.5
print('Hai {0}, Your salary is Hai 12000.5, Your salary is Linda
{1}'.format(name, sal))
print('Hai {1}, Your salary is
{0}'.format(name, sal))
Positional arguments
These are the arguments passed to a function in correct positional order. Here, the number of arguments and
their positions in the function definition should match exactly with the number and position of the argument
in the function call
def attach(s1, s2): # function definition
attach('New', 'York') # positional arguments
Keyword arguments
Keyword arguments are arguments that identify the parameters by their names.
def grocery(item, price): # function definition
grocery(item='Sugar', price=50.75) # key word arguments
Default arguments
We can mention some default value for the function parameters in the definition.
def grocery(item, price=40.00): # default argument is price
grocery(item='Sugar') # default value for price is used
ARRAYS
To work with arrays, we use numpy (numerical python) package.
For complete help on numpy: https://docs.scipy.org/doc/numpy/reference/
An array is an object that stores a group of elements (or values) of same datatype. Array elements should be
of same datatype. Arrays can increase or decrease their size dynamically.
NOTE: We can use for loops to display the individual elements of the array.
To work with numpy, we should import that module, as:
import numpy
import numpy as np
from numpy import *
Single dimensional (or 1D ) arrays
A 1D array contains one row or one column of elements. For example, the marks of a student in 5 subjects.
Creating single dimensional arrays
Creating arrays in numpy can be done in several ways. Some of the important ways are:
Using array() function
Using linspace() function
Using logspace() function
Using arange() function
Using zeros() and ones() functions.
Ex:
numpy.sort(arr)
numpy.max(arr)
numpy.sqrt(arr)
Aliasing the arrays
If ‘a’ is an array, we can assign it to ‘b’, as:
b=a
This is a simple assignment that does not make any new copy of the array ‘a’. It means, ‘b’ is not a new
array and memory is not allocated to ‘b’. Also, elements from ‘a’ are not copied into ‘b’ since there is no
memory for ‘b’. Then how to understand this assignment statement? We should understand that we are
giving a new name ‘b’ to the same array referred by ‘a’. It means the names ‘a’ and ‘b’ are referencing same
array. This is called ‘aliasing’.
‘Aliasing’ is not ‘copying’. Aliasing means giving another name to the existing object. Hence, any
modifications to the alias object will reflect in the existing object and vice versa.
Viewing and Copying arrays
We can create another array that is same as an existing array. This is done by view() method. This method
creates a copy of an existing array such that the new array will also contain the same elements found in the
existing array. The original array and the newly created arrays will share different memory locations. If the
newly created array is modified, the original array will also be modified since the elements in both the arrays
will be like mirror images.
We can create a view of ‘a’ as:
b = a.view()
Viewing is nothing but copying only. But it is called ‘shallow copying’ as the elements in the view when
modified will also modify the elements in the original array. So, both the arrays will act as one and the same.
Suppose we want both the arrays to be independent and modifying one array should not affect another array,
we should go for ‘deep copying’. This is done with the help of copy() method. This method makes a
complete copy of an existing array and its elements. When the newly created array is modified, it will not
affect the existing array or vice versa. There will not be any connection between the elements of the two
arrays.
We can create a copy of ’a’ as:
b = a.copy()
Multi-dimensional arrays (2D, 3D, etc)
They represent more than one row and more than one column of elements. For example, marks obtained by a
group of students each in five subjects.
Creating multi-dimensional arrays
We can create multi dimensional arrays in the following ways:
Using array() function
Using ones() and zeroes() functions
Using eye() function
Using reshape() function discussed earlier
INTRODUCTION TO OOPS
Features of OOPS
1. classes and objects
2. encapsulation
3. abstraction
4. inheritance
5. polymorphism
Self variable
‘self’ is a default variable that contains the memory address of the instance of the current class. So, we can
use ‘self’ to refer to all the instance variables and instance methods.
Constructor
A constructor is a special method that is used to initialize the instance variables of a class. In the constructor,
we create the instance variables and initialize them with some starting values. The first parameter of the
constructor will be ‘self’ variable that contains the memory address of the instance.
A constructor may or may not have parameters.
Ex:
def __init__(self): # default constructor
self.name = ‘Vishnu’
self.marks = 900
Ex:
def __init__(self, n = ‘’, m=0): # parameterized constructor with 2 parameters
self.name = n
self.marks = m
Types of variables
The variables which are written inside a class are of 2 types:
Instance variables
Class variables or Static variables
Instance variables are the variables whose separate copy is created in every instance (or object). Instance
variables are defined and initialized using a constructor with ‘self’ parameter. Also, to access instance
variables, we need instance methods with ‘self’ as first parameter. Instance variables can be accessed as:
obj.var
Unlike instance variables, class variables are the variables whose single copy is available to all the instances
of the class. If we modify the copy of class variable in an instance, it will modify all the copies in the other
instances. A class method contains first parameter by default as ‘cls’ with which we can access the class
variables. For example, to refer to the class variable ‘x’, we can use ‘cls.x’.
NOTE: class variables are also called ‘static variables’. class methods are marked with the decorator
@classmethod .
NOTE: instance variables can be accessed as: obj.var or classname.var
Namespaces
A namespace represents a memory block where names are mapped (or linked) to objects. A class maintains
its own namespace, called ‘class namespace’. In the class namespace, the names are mapped to class
variables. Similarly, every instance (object) will have its own name space, called ‘instance namespace’. In
the instance namespace, the names are mapped to instance variables.
When we modify a class variable in the class namespace, its modified value is available to all instances.
When we modify a class variable in the instance namespace, then it is confined to only that instance. Its
modified value will not be available to other instances.
Types of methods
By this time, we got some knowledge about the methods written in a class. The purpose of a method is to
process the variables provided in the class or in the method. We already know that the variables declared in
the class are called class variables (or static variables) and the variables declared in the constructor are called
instance variables. We can classify the methods in the following 3 types:
Instance methods
(a) Accessor methods
(b) Mutator methods
Class methods
Static methods
An instance method acts on instance variables. There are two types of methods.
1. Accessor methods: They read the instance vars. They do not modify them. They are also called getter()
methods.
2. Mutator methods: They not only read but also modify the instance vars. They are also called setter()
methods.
PROGRAMS
4. Create getter and setter methods for a Manager with name and salary instance variables.
Static methods
We need static methods when the processing is at class level but we need not involve the class or instances.
Static methods are used when some processing is related to the class but does not need the class or its
instances to perform any work. For example, setting environmental variables, counting the number of
instances of the class or changing an attribute in another class etc. are the tasks related to a class. Such tasks
are handled by static methods. Static methods are written with a decorator @staticmethod above them. Static
methods are called in the form of classname.method().
Inner classes
Writing a class within another class is called inner class or nested class. For example, if we write class B
inside class A, then B is called inner class or nested class. Inner classes are useful when we want to sub
group the data of a class.
Encapsulation
Bundling up of data and methods as a single unit is called ‘encapsulation’. A class is an example for
encapsulation.
Abstraction
Hiding unnecessary data from the user is called ‘abstraction’. By default all the members of a class are
‘public’ in Python. So they are available outside the class. To make a variable private, we use double
underscore before the variable. Then it cannot be accessed from outside of the class. To access it from
outside the class, we should use: obj._Classname__var. This is called name mangling.
Inheritance
Creating new classes from existing classes in such a way that all the features of the existing classes are
available to the newly created classes – is called ‘inheritance’. The existing class is called ‘base class’ or
‘super class’. The newly created class is called ‘sub class’ or ‘derived class’.
Sub class object contains a copy of the super class object. The advantage of inheritance is ‘reusability’ of
code. This increases the overall performance of the organization.
Syntax: class Subclass(Baseclass):
Constructors in inheritance
In the previous programs, we have inherited the Student class from the Teacher class. All the methods and
the variables in those methods of the Teacher class (base class) are accessible to the Student class (sub
class). The constructors of the base class are also accessible to the sub class.
When the programmer writes a constructor in the sub class, then only the sub class constructor will get
executed. In this case, super class constructor is not executed. That means, the sub class constructor is
replacing the super class constructor. This is called constructor overriding.
super() method
super() is a built-in method which is useful to call the super class constructor or methods from the sub class.
super().__init__() # call super class constructor
super().__init__(arguments) # call super class constructor and pass arguments
super().method() # call super class method
Types of inheritance
There are two types:
1. Single inheritance: deriving sub class from a single super class.
Syntax: class Subclass(Baseclass):
2. Multiple inheritance: deriving sub class from more than one super class.
Syntax: class Subclass(Baseclass1, Baseclass2, … ):
NOTE: ‘object’ is the super class for all classes in Python.
Polymorphism
poly + morphos = many + forms
If something exists in various forms, it is called ‘Polymorphism’. If an operator or method performs various
tasks, it is called polymorphism.
Ex:
Duck typing: Calling a method on any object without knowing the type (class) of the object.
Operator overloading: same operator performing more than one task.
Method overloading: same method performing more than one task.
Method overriding: executing only sub class method in the place of super class method.
ABSTRACT CLASSES AND INTERFACES
An abstract method is a method whose action is redefined in the sub classes as per the requirement of the
objects. Generally abstract methods are written without body since their body will be defined in the sub
classes
anyhow. But it is possible to write an abstract method with body also. To mark a method as abstract, we
should use the decorator @abstractmethod. On the other hand, a concrete method is a method with body.
An abstract class is a class that generally contains some abstract methods. PVM cannot create objects to an
abstract class.
Once an abstract class is written, we should create sub classes and all the abstract methods should be
implemented (body should be written) in the sub classes. Then, it is possible to create objects to the sub
classes.
A meta class is a class that defines the behavior of other classes. Any abstract class should be derived from
the meta class ABC that belongs to ‘abc’ module. So import this module, as:
from abc import ABC, abstractmethod
(or) from abc import *
Interfaces in Python
We learned that an abstract class is a class which contains some abstract methods as well as concrete
methods also. Imagine there is a class that contains only abstract methods and there are no concrete methods.
It becomes an interface. This means an interface is an abstract class but it contains only abstract methods.
None of the methods in the interface will have body. Only method headers will be written in the interface.
So an interface can be defined as a specification of method headers. Since, we write only abstract methods in
the interface, there is possibility for providing different implementations (body) for those abstract methods
depending on the requirements of objects. In Python, we have to use abstract classes as interfaces.
Since an interface contains methods without body, it is not possible to create objects to an interface. In this
case, we can create sub classes where we can implement all the methods of the interface. Since the sub
classes will have all the methods with body, it is possible to create objects to the sub classes. The flexibility
lies in the fact that every sub class can provide its own implementation for the abstract methods of the
interface.
EXCEPTIONS
An exception is a runtime error which can be handled by the programmer. That means if the programmer can
guess an error in the program and he can do something to eliminate the harm caused by that error, then it is
called an ‘exception’. If the programmer cannot do anything in case of an error, then it is called an ‘error’
and not an exception.
All exceptions are represented as classes in Python. The exceptions which are already available in Python
are called ‘built-in’ exceptions. The base class for all built-in exceptions is ‘BaseException’ class. From
BaseException class, the sub class ‘Exception’ is derived. From Exception class, the sub classes
‘StandardError’ and ‘Warning’ are derived.
All errors (or exceptions) are defined as sub classes of StandardError. An error should be compulsorily
handled otherwise the program will not execute. Similarly, all warnings are derived as sub classes from
‘Warning’ class. A
warning represents a caution and even though it is not handled, the program will execute. So, warnings can
be neglected but errors cannot be neglected.
Just like the exceptions which are already available in Python language, a programmer can also create his
own exceptions, called ‘user-defined’ exceptions. When the programmer wants to create his own exception
class, he should derive his class from ‘Exception’ class and not from ‘BaseException’ class. In the Figure,
we are showing important classes available in Exception hierarchy.
Exception handling
The purpose of handling the errors is to make the program robust. The word ‘robust’ means ‘strong’. A
robust program does not terminate in the middle. Also, when there is an error in the program, it will display
an appropriate message to the user and continue execution. Designing the programs in this way is needed in
any software development. To handle exceptions, the programmer should perform the following 3 tasks:
Step 1: The programmer should observe the statements in his program where there may be a possibility of
exceptions. Such statements should be written inside a ‘try’ block. A try block looks like as follows:
try:
statements
The greatness of try block is that even if some exception arises inside it, the program will not be terminated.
When PVM understands that there is an exception, it jumps into an ‘except’ block.
Step 2: The programmer should write the ‘except’ block where he should display the exception details to the
user. This helps the user to understand that there is some error in the program. The programmer should also
display a message regarding what can be done to avoid this error. Except block looks like as follows:
except exceptionname:
statements # these statements form handler
The statements written inside an except block are called ‘handlers’ since they handle the situation when the
exception occurs.
Step 3: Lastly, the programmer should perform clean up actions like closing the files and terminating any
other processes which are running. The programmer should write this code in the finally block. Finally block
looks like as follows:
finally:
statements
The specialty of finally block is that the statements inside the finally block are executed irrespective of
whether there is an exception or not. This ensures that all the opened files are properly closed and all the
running processes are properly terminated. So, the data in the files will not be corrupted and the user is at the
safe-side.
However, the complete exception handling syntax will be in the following format:
try:
statements
except Exception1:
handler1
except Exception2:
handler2
else:
statements
finally:
statements
‘try’ block contains the statements where there may be one or more exceptions. The subsequent ‘except’
blocks handle these exceptions. When ‘Exception1’ occurs, ‘handler1’ statements are executed. When
‘Exception2’ occurs, ‘hanlder2’ statements are executed and so forth. If no exception is raised, the
statements inside the ‘else’ block are executed. Even if the exception occurs or does not occur, the code
inside ‘finally’ block is always executed. The following points are noteworthy:
FILES IN PYTHON
A file represents storage of data. A file stores data permanently so that it is available to all the programs.
Types of files in Python
In Python, there are 2 types of files. They are:
Text files
Binary files
Text files store the data in the form of characters. For example, if we store employee name “Ganesh”, it will
be stored as 6 characters and the employee salary 8900.75 is stored as 7 characters. Normally, text files are
used to store characters or strings.
Binary files store entire data in the form of bytes, i.e. a group of 8 bits each. For example, a character is
stored as a byte and an integer is stored in the form of 8 bytes (on a 64 bit machine). When the data is
retrieved from the binary file, the programmer can retrieve the data as bytes. Binary files can be used to store
text, images, audio and video.
Opening a file
We should use open() function to open a file. This function accepts ‘filename’ and ‘open mode’ in which to
open the file.
file handler = open(“file name”, “open mode”, “buffering”)
Ex: f = open(“myfile.txt”, “w”)
Here, the ‘file name’ represents a name on which the data is stored. We can use any name to reflect the
actual data. For example, we can use ‘empdata’ as file name to represent the employee data. The file ‘open
mode’ represents the purpose of opening the file. The following table specifies the file open modes and their
meanings.
File open mode Description
w To write data into file. If any data is
already present in the file, it would be
deleted and the present data will be
stored.
r To read data from the file. The file
pointer is positioned at the beginning
of the file.
a To append data to the file. Appending
means adding at the end of existing
data. The file pointer is placed at the
end of the file. If the file does not
exist, it will create a new file for
writing data.
w+ To write and read data of a file. The
previous data in the file will be
deleted.
r+ To read and write data into a file. The
previous data in the file will not be
deleted. The file pointer is placed at
the beginning of the file.
a+ To append and read data of a file. The
file pointer will be at the end of the
file if the file exists. If the file does
not exist, it creates a new file for
reading and writing.
x Open the file in exclusive creation
mode. The file creation fails if the file
already exists.
The above Table represents file open modes for text files. If we attach ‘b’ for them, they represent modes for
binary files. For example, wb, rb, ab, w+b, r+b, a+b are the modes for binary files.
A buffer represents a temporary block of memory. ‘buffering’ is an optional integer used to set the size of
the buffer for the file. If we do not mention any buffering integer, then the default buffer size used is 4096 or
8192 bytes.
Closing a file
A file which is opened should be closed using close() method as:
f.close()
Files with characters
To write a group of characters (string), we use: f.write(str)
To read a group of characters (string), we use: str = f.read()
PROGRAMS
25. Create a file and store a group of chars.
26. Read the chars from the file.
Files with strings
To write a group of strings into a file, we need a loop that repeats: f.write(str+”\n”)
To read all strings from a file, we can use: str = f.read()
Knowing whether a file exists or not
The operating system (os) module has a sub module by the name ‘path’ that contains a method isfile(). This
method can be used to know whether a file that we are opening really exists or not. For example,
os.path.isfile(fname) gives True if the file exists otherwise False. We can use it as:
if os.path.isfile(fname): # if file exists,
f = open(fname, 'r') # open it
else:
print(fname+' does not exist')
sys.exit() # terminate the program
with statement
‘with’ statement can be used while opening a file. The advantage of with statement is that it will take care of
closing a file which is opened by it. Hence, we need not close the file explicitly. In case of an exception also,
‘with’ statement will close the file before the exception is handled. The format of using ‘with’ is:
with open(“filename”, “openmode”) as fileobject:
Ex: writing into a flie
# with statement to open a file
with open('sample.txt', 'w') as f:
f.write('Iam a learner\n')
f.write('Python is attractive\n')
Ex: reading from a file
# using with statement to open a file
with open('sample.txt', 'r') as f:
for line in f:
print(line)
Data Science
To work with datascience, we need the following packages to be installed:
C:\> pip install pandas
C:\> pip install xlrd //to extract data from Excel sheets
C:\> pip install matplotlib
Data plays important role in our lives. For example, a chain of hospitals contain data related to medical
reports and prescriptions of their patients. A bank contains thousands of customers’ transaction details. Share
market data represents minute to minute changes in the share values. In this way, the entire world is roaming
around huge data.
Every piece of data is precious as it may affect the business organization which is using that data. So, we
need some mechanism to store that data. Moreover, data may come from various sources. For example in a
business organization, we may get data from Sales department, Purchase department, Production department,
etc. Such data is stored in a system called ‘data warehouse’. We can imagine data warehouse as a central
repository of integrated data from different sources.
Once the data is stored, we should be able to retrieve it based on some pre-requisites. A business company
wants to know about how much amount they spent in the last 6 months on purchasing the raw material or
how many items found defective in their production unit. Such data cannot be easily retrieved from the huge
data available in the data warehouse. We have to retrieve the data as per the needs of the business
organization. This is called data analysis or data analytics where the data that is retrieved will be analyzed to
answer the questions raised by the management of the organization. A person who does data analysis is
called ‘data analyst’.
Once the data is analyzed, it is the duty of the IT professional to present the results in the form of pictures or
graphs so that the management will be able to understand it easily. Such graphs will also help them to
forecast the future of their company. This is called data visualization. The primary goal of data visualization
is to communicate information clearly and efficiently using statistical graphs, plots and diagrams.
Data science is a term used for techniques to extract information from the data warehouse, analyze them and
present the necessary data to the business organization in order to arrive at important conclusions and
decisions. A person who is involved in this work is called ‘data scientist’. We can find important differences
between the roles of data scientist and data analyst in following table:
Data Scientist Data Analyst
Data scientist formulates the questions that Data analyst receives questions from the
will help a business organization and then business team and provides answers to
proceed in solving them. them.
Data scientist will have strong data Data analyst simply analyzes the data and
visualization skills and the ability to provides information requested by the
convert data into a business story. team.
Perfection in mathematics, statistics and Perfection in data warehousing, big data
programming languages like Python and R concepts, SQL and business intelligence is
are needed for a Data scientist. needed for a Data analyst.
A Data scientist estimates the unknown A Data analyst looks at the known data
information from the known data. from a new perspective.
Please see the following sample data in the excel file: empdata.xlsx.
CREATING DATA FRAMES
is possible from csv files, excel files, python dictionaries, tuples list, json data etc.
Creating data frame from .csv file
>>> import pandas as pd
>>> df = pd.read_csv("f:\\python\PANDAS\empdata.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
Data Wrangling
Data wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw
data into a format that is suitable for analysis. Python is a popular language for data wrangling due to its
powerful libraries and tools. Below is an overview of the key steps and libraries used in data wrangling with
Python:
Key Steps in Data Wrangling
1. Data Collection: Gather data from various sources (e.g., CSV files, databases, APIs, web scraping).
2. Data Cleaning: Handle missing values, remove duplicates, correct inconsistencies, and fix errors.
3. Data Transformation: Reshape, aggregate, or filter data to make it suitable for analysis.
4. Data Integration: Combine data from multiple sources.
5. Data Validation: Ensure data quality and consistency.
6. Data Export: Save the cleaned and transformed data into a usable format (e.g., CSV, Excel,
database).
Python Libraries for Data Wrangling
1. Pandas: The most widely used library for data manipulation and analysis.
o Key features: DataFrames, handling missing data, merging datasets, reshaping data.
o Example: import pandas as pd
2. NumPy: Used for numerical computations and handling arrays.
o Example: import numpy as np
3. OpenPyXL: For working with Excel files.
o Example: from openpyxl import Workbook
4. SQLAlchemy: For interacting with databases.
o Example: from sqlalchemy import create_engine
5. BeautifulSoup and Requests: For web scraping and collecting data from websites.
o Example: from bs4 import BeautifulSoup, import requests
6. PySpark: For handling large-scale data wrangling tasks in distributed environments.
Data Transformation
# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Concatenate DataFrames
df_concat = pd.concat([df1, df2], axis=0)
Exporting Data
# Save
Visualizing Data
DATA VISUALIZATION USING MATPLOTLIB
Complete reference is available at:
https://matplotlib.org/api/pyplot_summary.html
CREATE DATAFRAME FROM DICTIONARY
>>> empdata = {"empid": [1001, 1002, 1003, 1004, 1005, 1006],
"ename": ["Ganesh Rao", "Anil Kumar", "Gaurav Gupta", "Hema Chandra", "Laxmi Prasanna", "Anant
Nag"],
"sal": [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99],
"doj": ["10-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", "10-8-2000", "9-9-1999"]}
>>> import pandas as pd
>>> df = pd.DataFrame(empdata)
TAKE ONLY THE COLUMNS TO PLOT
>>> x = df['empid']
>>> y = df['sal']
DRAW THE BAR GRAPH
Bar chart shows data in the form of bars. It is useful for comparing values.
>>> import matplotlib.pyplot as plt
>>> plt.bar(x,y)
<Container object of 6 artists>
>>> plt.xlabel('employee id nos')
Text(0.5,0,'employee id nos')
>>> plt.ylabel('employee salaries')
Text(0,0.5,'employee salaries')
>>> plt.title('XYZ COMPANY')
Text(0.5,1,'XYZ COMPANY')
>>> plt.legend()
>>> plt.show()
Feature Engineering
Feature engineering is the process of creating new features or transforming existing ones to better represent
the underlying problem and improve model performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill with mean, median, or mode.
o Use advanced techniques like KNN imputation.
2. df['column'].fillna(df['column'].mean(), inplace=True)
# Standardization
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])
# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df['scaled_column'] = minmax_scaler.fit_transform(df[['column']])
Binning:
Convert continuous variables into discrete bins.
df['binned_column'] = pd.cut(df['continuous_column'], bins=5)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['text_column'])
Feature Selection
Feature selection involves identifying and selecting the most relevant features to improve model
performance and reduce overfitting.
Common Techniques for Feature Selection
1. Filter Methods:
o Use statistical measures to select features.
o Examples: Correlation, Chi-Square, Mutual Information.
# Correlation-based feature selection
correlation_matrix = df.corr()
relevant_features = correlation_matrix['target'].abs().sort_values(ascending=False)
Wrapper Methods:
Use a machine learning model to evaluate feature subsets.
Examples: Recursive Feature Elimination (RFE).
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_]
Embedded Methods:
Features are selected as part of the model training process.
Examples: Lasso Regression, Decision Trees.
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]
Dimensionality Reduction:
Reduce the number of features while preserving information.
Examples: PCA, t-SNE.
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
Feature Engineering and Selection Workflow
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
# Load data
df = pd.read_csv('data.csv')
# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)
# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])
# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])
# Feature Selection
X = df.drop('target', axis=1)
y = df['target']
Feature Engineering
Feature engineering involves creating new features or transforming existing ones to improve model
performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill missing values with mean, median, mode, or use advanced techniques like KNN
imputation.
Encoding Categorical Variables:
One-Hot Encoding: Convert categorical variables into binary columns.
Label Encoding: Convert categories into numerical labels.
Scaling and Normalization:
Standardization: Scale features to have zero mean and unit variance.
Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1).
Creating Interaction Features:
Combine two or more features to create new ones.
Binning:
Convert continuous variables into discrete bins.
Date/Time Feature Extraction:
Extract useful information from date/time columns (e.g., day, month, year).
Polynomial Features:
Create polynomial combinations of features.
Example: Feature Extraction and Engineering Workflow
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Load data
df = pd.read_csv('data.csv')
# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)
# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])
# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])
# Feature Extraction
X = df.drop('target', axis=1)
y = df['target']
Feature Engineering on Numeric Data, Categorical Data, Text Data, & Image Data
Feature engineering is the process of transforming raw data into meaningful features that improve the
performance of machine learning models. The techniques used depend on the type of data (numeric,
categorical, text, or image). Below is a detailed guide on feature engineering for each type of data:
Feature Scaling and Feature Selection are two important preprocessing techniques in machine learning and
data analysis. They play a crucial role in improving model performance, reducing computational complexity,
and ensuring better interpretability of the data.
Feature Scaling
Feature scaling is the process of normalizing or standardizing the range of independent variables (features)
in the dataset. This is particularly important for algorithms that are sensitive to the magnitude of the data,
such as distance-based algorithms or gradient descent-based optimization.
Why is Feature Scaling Important?
Ensures that all features contribute equally to the model.
Improves convergence speed for optimization algorithms (e.g., gradient descent).
Prevents features with larger magnitudes from dominating those with smaller magnitudes.
Common Techniques for Feature Scaling
1. Normalization (Min-Max Scaling):
o Scales features to a fixed range, usually [0, 1].
o Formula: Xscaled=X−XminXmax−XminXscaled=Xmax−XminX−Xmin
o Suitable for algorithms like neural networks and k-nearest neighbors (KNN).
2. Standardization (Z-score Normalization):
o Scales features to have a mean of 0 and a standard deviation of 1.
o Formula: Xscaled=X−μσXscaled=σX−μ, where μμ is the mean and σσ is the standard
deviation.
o Suitable for algorithms like linear regression, logistic regression, and support vector machines
(SVM).
3. Robust Scaling:
o Uses the median and interquartile range (IQR) to scale features, making it robust to outliers.
o Formula: Xscaled=X−medianIQRXscaled=IQRX−median
4. Max Abs Scaling:
o Scales each feature by its maximum absolute value.
o Suitable for sparse data.
Feature Selection
Feature selection is the process of selecting a subset of relevant features (variables) to use in model
construction. It helps reduce overfitting, improve model interpretability, and decrease computational costs.
Why is Feature Selection Important?
Reduces the dimensionality of the dataset, which can improve model performance.
Removes irrelevant or redundant features, reducing noise.
Speeds up training and inference times.
Common Techniques for Feature Selection
1. Filter Methods:
o Select features based on statistical measures (e.g., correlation, mutual information, chi-
square).
oExamples:
Correlation coefficient for linear relationships.
Mutual information for non-linear relationships.
Chi-square test for categorical features.
2. Wrapper Methods:
o Use a machine learning model to evaluate the performance of subsets of features.
o Examples:
Forward Selection: Start with no features and add one at a time.
Backward Elimination: Start with all features and remove one at a time.
Recursive Feature Elimination (RFE): Recursively removes the least important
features.
3. Embedded Methods:
o Perform feature selection as part of the model training process.
o Examples:
Lasso (L1 regularization): Penalizes less important features by shrinking their
coefficients to zero.
Ridge (L2 regularization): Reduces the impact of less important features but does not
eliminate them.
Tree-based methods: Feature importance scores from decision trees, random forests,
or gradient boosting.
4. Dimensionality Reduction:
o Transform features into a lower-dimensional space.
o Examples:
Principal Component Analysis (PCA): Reduces dimensions while preserving variance.
Linear Discriminant Analysis (LDA): Reduces dimensions while preserving class
separability.
t-SNE and UMAP: Non-linear dimensionality reduction for visualization.
Unit-3
Building Machine Learning Models:
In today’s rapidly evolving business landscape, data has become one of the most valuable
assets. Machine learning (ML), a subset of artificial intelligence (AI), is revolutionizing how
businesses derive insights, optimize processes, and enhance decision-making. Understanding
how to build and apply machine learning models is not only a technical skill but also a
strategic advantage.
Machine learning is the process of developing algorithms that enable computers to learn from
data and improve their performance over time without being explicitly programmed. Unlike
traditional programming, where rules are hard-coded, ML models identify patterns and
relationships in data, enabling them to make predictions or decisions.
In a business context, machine learning applications range from predictive analytics and
customer segmentation to supply chain optimization and fraud detection. For Management
students, understanding ML is vital for leveraging data-driven strategies to create value.
Steps to Build Machine Learning Models
Building an ML model involves a structured process. Here’s an overview of the key steps:
3. Finance
Credit scoring: Evaluating creditworthiness of borrowers.
Fraud detection: Identifying unusual patterns in transactions.
Portfolio management: Automating investment strategies.
4. Human Resources
Talent acquisition: Screening candidates using resume parsing and scoring algorithms.
Employee retention: Predicting attrition risks and devising retention strategies.
Performance evaluation: Analysing employee productivity metrics.
Classification Metrics:
o Accuracy: Measures the percentage of correctly predicted instances but may
be misleading for imbalanced datasets.
o Precision and Recall: Precision focuses on the correctness of positive
predictions, while recall measures the model’s ability to capture all actual
positives. These are often combined into the F1-score for a balanced
assessment.
o ROC-AUC: Evaluates the trade-off between true positive and false positive
rates across different thresholds.
Regression Metrics:
o Mean Absolute Error (MAE): Represents the average absolute difference
between predicted and actual values.
o Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Highlight
larger errors more significantly, providing insights into model performance.
o R-squared: Indicates the proportion of variance in the target variable explained
by the model.
Clustering Metrics:
o Silhouette Score: Measures the quality of clustering based on intra-cluster and
inter-cluster distances.
o Dunn Index: Evaluates cluster compactness and separation.
3. Cross-Validation
Cross-validation ensures that the model’s performance is consistent across different subsets
of the data. Techniques like k-fold cross-validation divide the dataset into k parts, training the
model on k-1 parts and testing it on the remaining part. This approach minimizes overfitting
and provides a more robust performance estimate.
4. Bias-Variance Trade-off
Understanding the bias-variance trade-off is vital for model evaluation. High bias indicates
underfitting, where the model is too simple to capture the complexity of the data. High
variance suggests overfitting, where the model performs well on training data but poorly on
unseen data. Balancing these factors is key to building reliable models.
5. Explainability and Interpretability
In a business setting, it is crucial to understand how a model makes decisions. Techniques
like feature importance, SHAP (Shapley Additive Explanations), and LIME (Local
Interpretable Model-agnostic Explanations) help interpret model outputs. Explainability
builds trust with stakeholders and ensures compliance with ethical and regulatory standards.
6. Ethical Considerations
Management students must recognize the ethical implications of deploying machine learning
models. Models can perpetuate biases present in training data, leading to unfair or
discriminatory outcomes. Regular audits and fairness metrics, such as demographic parity or
equalized odds, help ensure ethical deployment.
Business Context in Model Evaluation
Evaluating machine learning models in a business context goes beyond technical metrics.
Management professionals should consider the following factors:
Alignment with Business Goals: The chosen model should directly support
organizational objectives, such as improving customer retention, optimizing supply
chains, or increasing profitability.
Cost-Benefit Analysis: Evaluate the financial implications of deploying the model.
For instance, in fraud detection, the cost of false negatives (missed fraud cases) may
outweigh the cost of false positives.
Scalability and Deployment: Ensure that the model can handle increasing data
volumes and integrate seamlessly with existing systems.
Stakeholder Buy-In: Clear communication of the model’s benefits and limitations to
non-technical stakeholders is essential for successful implementation.
1. Grid Search:
o A systematic approach where all possible combinations of hyperparameters
are tested.
o Example: For a decision tree, hyperparameters like maximum depth and
minimum samples per leaf can be varied systematically.
2. Random Search:
o Instead of testing all combinations, a random subset of hyperparameter values
is selected.
o This method is often faster and can yield comparable results to grid search.
3. Bayesian Optimization:
o Uses probabilistic models to predict the best hyperparameters.
3. Ethical Considerations:
o Bias in data can be amplified during tuning. Business leaders must ensure that
tuning processes do not inadvertently propagate discrimination.
4. Scalability:
o Models tuned on small datasets may not perform well when scaled. Businesses
should simulate real-world scenarios to validate tuning outcomes.
Real-World Examples:
Netflix: Netflix’s recommendation engine uses tuned ML models to suggest content,
directly driving user engagement and retention.
Amazon: Tuning ML models for demand forecasting allows Amazon to optimize its
supply chain, reducing costs and improving delivery times.
Healthcare: Hospitals use tuned ML models for patient outcome predictions,
enabling better resource planning and treatment strategies.
1. Feature Importance:
o Identifies which features influence the model's predictions the most.
3. Overfitting:
o Misinterpreted models may show patterns that are only relevant to training
data and not real-world scenarios.
B. Deployment Process
1. Selecting a Deployment Strategy:
o Batch Processing: Predictions are made on a schedule (e.g., generating daily
sales forecasts).
o Real-Time Processing: Predictions occur instantly (e.g., fraud detection in
online payments).
2. Choosing an Infrastructure:
o Cloud Platforms: AWS, Google Cloud Platform (GCP), and Microsoft Azure
provide scalable deployment environments.
o On-Premises: Useful for businesses with specific security or compliance
needs.
3. Packaging the Model: Convert the model into a deployable format using tools like
Docker for containerization.
4. API Development: Expose the model as a REST API or gRPC service so other
applications can interact with it.
5. Monitoring and Maintenance: Track model performance in production and retrain it
when necessary to prevent degradation.
o Uncover hidden patterns that may reveal customer behaviour, market trends,
or operational inefficiencies.
3. Assess Assumptions and Hypotheses:
o Validate existing business hypotheses with preliminary insights.
2. Descriptive Statistics:
o Calculate measures like mean, median, mode, variance, and standard deviation
to understand central tendencies and variability.
o Use frequency distributions to assess categorical data.
3. Visualization:
o Use histograms, bar charts, and boxplots to visualize distributions and detect
outliers.
o Create scatterplots and heatmaps to identify correlations or relationships
between variables.
o Employ time series plots for trend analysis in metrics like revenue or customer
acquisition.
4. Segment Analysis:
o Divide the data into meaningful segments, such as customer demographics or
product categories, to identify key differentiators.
o Analyse performance across these segments to find growth opportunities.
5. Correlation Analysis:
o Use correlation matrices to evaluate how variables are related, identifying key
drivers of success (e.g., customer satisfaction linked to sales).
6. Hypothesis Testing:
o Perform statistical tests like t-tests, chi-square tests, or ANOVA to validate
relationships or patterns.
2. Mitigating Risks:
o Spot potential issues, such as customer churn, through anomaly detection.
3. Strategic Planning:
o Use insights to set measurable goals and benchmarks.
4. Improving Efficiency:
o Streamline operations by identifying bottlenecks or inefficiencies.
5. Driving Innovation:
o Identify patterns that suggest emerging trends, enabling businesses to innovate
ahead of competitors.
2. Bivariate Analysis
Bivariate analysis examines the relationship between two variables, uncovering
correlations and dependencies. Common visualization methods include:
o Scatter Plots: Show relationships between two continuous variables, helping
to identify linear or non-linear correlations.
o Line Graphs: Ideal for exploring trends over time, such as monthly sales
figures.
o Heatmaps: Represent correlations or interactions between variables, often
used in market segmentation studies.
Example: A finance team can use scatter plots to explore the correlation between marketing
spend and revenue growth.
3. Multivariate Analysis
Multivariate analysis involves exploring relationships among three or more variables
simultaneously. Visualization methods include:
o Bubble Charts: Add a third dimension to scatter plots using bubble size to
represent another variable.
o Parallel Coordinate Plots: Visualize multi-dimensional data by plotting
variables as parallel axes, highlighting patterns across different dimensions.
o Clustered Heatmaps: Combine clustering algorithms with heatmaps to
segment data into meaningful groups.
Example: A telecom company analysing customer churn can use clustered heatmaps to group
customers based on demographics and usage patterns.
4. Geospatial Visualization
Geospatial visualization is critical for businesses that deal with location-based data.
Common methods include:
o Choropleth Maps: Display data intensity across geographic regions, such as
sales density by state.
o Point Maps: Plot individual data points on a map, useful for tracking store
locations or delivery routes.
o Density Maps: Highlight areas with high concentrations of activity, such as
customer demand in urban regions.
Example: A logistics company can use density maps to optimize delivery routes in high-
demand areas.
5. Time-Series Analysis
Time-series analysis focuses on exploring temporal data to identify trends,
seasonality, and anomalies. Key visualization techniques include:
o Line Charts: Depict trends over time, ideal for tracking performance metrics
like revenue or customer growth.
o Seasonal Decomposition Plots: Break down time-series data into trend,
seasonal, and residual components.
o Area Charts: Emphasize cumulative trends over time, useful for visualizing
market share growth.
Example: An e-commerce platform can analyse seasonal variations in sales using line charts
to plan inventory and marketing strategies.
6. Interactive Dashboards
Interactive dashboards integrate multiple visualization types into a single interface,
allowing users to explore data dynamically. Features such as drill-downs, filters,
and real-time updates enable deeper analysis.
o Tools like Tableau, Power BI, and Google Data Studio are widely used for
creating interactive dashboards.
o Dashboards allow users to customize views based on specific business
questions, fostering collaborative decision-making.
Example: A sales manager can use an interactive dashboard to monitor regional performance
and adjust strategies accordingly.
1. Learning Outcome:
In regression analysis, we basically attempt to predict the value of one variable from known
values of another variable. The variable that is used to estimate the variable of interest is
known as “independent variable” or “explanatory variable” and the variable which we are
trying to predict is termed as “dependent variable” or “explained variable”. Usually,
dependent variable is denoted by Y and independent variable as X.
It may be noted here that term ‘dependent’ and ‘independent’ refer to the mathematical or
functional meaning of dependence; i.e. they do not imply that there is necessarily any cause
and effect relationship between the variables. It simply means that estimates of values of the
dependent variables Y may be obtained for given values of independent variable X from a
mathematical function involving X and Y. In that sense, values of Y are dependent upon the
values of X. The variable X may or may not cause the variation in the variable Y. For
instance, while estimating demand of a FMCG product from figures on sales promotion
expenditures, demand is generally considered as the dependent variable. However, there may
or may not be causal relationship between these two variables in the sense that changes in
sales promotion cause changes in demand. In fact, in few cases, the cause-effect relationship
may be just opposite what appears to be the obvious one.
This is termed as ‘simple’ as there is only one independent variable and ‘linear’ as it assumes
a linear relationship between dependent and independent variable. This means that average
relationship may be expressed by a straight line equation Y= a + b X which is also called as
“regression line”. In this regression line, Y is dependent variable and X independent variable
and a & b are constants. It may be viewed in the following figure.
regression line
It is easy to understand from the above regression equation that for every incremental
increase in the dependent variable X, the value of dependent variable Y increases by ‘b’;
slope of equation.
Y = 2 + 3X
When X = 0; Y = 2
When X = 1; Y = 5
When X = 2; Y = 8
When X = 3; Y = 11
Thus, is apparent from above that as value of X increases by 1; the value of Y increases by 3
which are equal to the slope of the given equation.
Regression analysis is specialized branch of statistical theory and is of immense use almost in
all the scientific disciplines. It is also very useful particularly in economics as it is
predominantly used in estimating the relationship among various economic variables that
comprise the epitome of economic life. Its applications are extended to almost all the natural,
physical and social sciences. It attempts to accomplish mainly the followings:
(1) With the help of regression line that signifies an average relationship between two
variables, it may predict the unknown value of dependent variable for the given values of
independent variable;
(2) Regression line or prediction line that is used for the estimation never makes one hundred
percent correct estimation for the given values of independent variable. There is always
some difference between the actual value and estimated value of dependent variable that
is known as ‘estimation error’. Regression analysis also computes this error as standard
error of estimation and reveals the accuracy of prediction. The amount of error depends
upon the spread of scatter diagramme which is prepared by plotting the actual
observations of Y and X variables on a graph. In the following figures, one can easily
note that amount of estimation error would be more for more spread of actual observation
and lesser for less spread.
More spread of observations Less spread of observations
More error of prediction Less error of prediction
(3) Regression analysis also depicts the relationship or association between the variables (i.e.
r, Pearson’s coefficient of correlation). The square of r is called as Coefficient of
Determination (r2) that assesses the variance in dependent variables that has been
accounted for by the regression equation. In general, the greater the value of r 2 the better
is the fit and the more useful the regression equation as a predictive instrument.
It may be interpreted as out of total variations (100 percent), 70 percent variations in Y are
explained by the X in the suggested regression equation/ model.
A statistical model is a set of mathematical formulas and assumptions related to a real world
situation. We wish to develop our simple regression model in such a way that explains the
process underlying our data as much as possible. Since it is almost impossible for any model
to explain each and everything due to inherent uncertainty in the real world, therefore we will
always have some remaining errors which occur due to many unknown outside factors
affecting the process generating our data.
A good statistical model is prudent that makes the use of few mathematical terms which
explain the real situation as much as possible. The model attempts to capture the rational
behaviour of our data set and leaves out the factors that are nonsystematic and cannot be
predicted/ estimated. The following figure explains a well defined statistical model.
Systematic
component
Data Statistical model
+
Random errors
The errors (ε); also termed as residuals are that part of our data generation that cannot be
estimated by the model because it is not systematic. Hence, errors (ε) constitute a random
component in the model. It is easy to understand that a good statistical model splits our data
process into two components namely; systematic component which is well explained by a
mathematical term contained in the model, and a pure random component, the source of
which is absolutely unknown and therefore, it cannot be estimated by the model. The
effectiveness of a statistical model depends upon the amount of error associated with it. There
is an inverse relationship between the effectiveness of model and the amount of error that
means less the error, more effective the model is and more the error, less effective the model
would be.
ε N(0, σ2)
In the figure below, one may observe that all errors are identically distributed, all centred on
regression line. Therefore, the mean of error distribution and variance are always equal to
zero and constant respectively.
Mean = 0
Variance = constant
Secondly, errors/residuals are independent/ uncorrelated of one another. It can be ensured
by making a scatter diagramme for these errors. If errors depict any pattern, it means they are
correlated with one another otherwise, independent of each other. It goes without saying that
in the following figure, errors are absolutely randomly distributed i.e. they are uncorrelated
with each other.
Now, under above two assumptions, we may attempt to develop a model to illustrate our data
or actual situation. We may propose a simple linear regression model for explaining the
relationship between two variables and estimate the parameters of model from our sample
data. After fitting the model to our data set, we consider the errors/residuals that are not
explained by the model. Having obtained the random component (errors), we analyze it to
determine whether it contains any systematic component or not. If it contains then we may re-
evaluate our proposed model and if feasible, adjust it to include the systematic component
found in the errors/residuals.
Otherwise, we may have to reject the model and try new one. Here, it is important to note that
random component (errors/residuals) must not contain any systematic components of our data
set, it should be purely random. Only then we can use this model for explaining the
relationship between the variables, prediction purpose and controlling of a variable etc.
In the above proposed model Y is the dependent variable which we wish to explain/predict
for given values of X that is the independent variable/predictor. α and β are population
model parameters where α equals the population equivalent of the sample ‘a’. Similarly, β is
parameter analogous to ‘b’ that is slope of sample regression line. In our model, ε is the error
term that curtails the predictive strength of our proposed model.
A careful insight reveals that above model contains two components. First, systematic
(nonrandom) component which is line (Y = α + β X) itself and secondly, pure random
component that is error term ε. The systematic component is the equation for the mean of Y,
given X. We may represent the conditional mean of Y, given X, by E(Y) as below:
E(Y) = α + β X...................................(ii)
By comparing equations (i) and (ii), we can notice that each value of Y comprises the average
Y for the given value of X (this is a straight line), plus a random error. As X increases, the
average value of Y also increases, assuming a positive slope of the line (or decreases, if the
slope is negative). The actual value of Y is equal to the average Y conditional on X, plus a
random error ε. Thus, we have, for a given value of X,
This model may be used only if the true straight line relationship exists between variables X
and Y. If it is not straight line, then we need to use some other models.
Until now, we have described the population model which is assumed true based on the
relationship between X and Y. Now we wish to have an idea about unknown relationship in
population and estimate it from the sample information. For this, we get a random sample of
observations on two variables and X and Y, then compute the parameters a & b of sample
regression line which are analogous to population parameters α and β. This is done with the
help of method of least squares, discussed in the next section.
In the above section we have described a simple linear regression model. Now we wish to
compute the parameters (α and β) of proposed model so as to estimate the value of Y for the
given values of X. Our model to be effective, we wish to keep random component of our
model at the minimum. For this, we use the ‘method of least squares’ that predicts the value
of Y in such a way that the average of estimated Y for given values of X is always equal to
average of actual Y in order to minimize the random component (errors). This method is also
considered as best linear unbiased estimator (BLUE) that gives both unbiased estimators.
Now we will use Ŷ to show the individual values of the estimate points which lie on
estimation line for a given X. The best fitted estimation line will be as follows:
Ŷi = a + b X i
Where, i= 1,2,3,…………..n; are actual observations. Here, Ŷ1 is the first fitted (estimated)
value corresponding to X1 which is the value of Y1 without error ε1, and so on for all i=
1,2,3…….n. If we do not know the actual value of Y, this is the fitted value which we will
predict from the estimated regression line for given X. Thus, ε 1 will be the first residual, the
distance from the first observation to the fitted regression line; ε 2 is second one and so on to
εn the nth error. The total error εi is taken as estimates of the true population error.
Thus,
Error (Ʃ εi) = Ʃ (Y – Ŷ)
It would noteworthy to mention here that the process of summing individual difference for
computing the total error is not a reliable way to decide the goodness of fit of an estimation
line because of canceling effect of positive and negative values. Similarly, adding absolute
values also does not give clear impression of the goodness of model fit as the sum of absolute
value does not stress the magnitude of the error. Therefore, we always prefer to take sum of
square of the individual error.
Method of least squares minimizes the SSE (random component). This suggests two linear
equations to compute two parameters of the model as followings:
Ʃ Y = N.a + b. Ʃ X--------(1)
Ʃ X.Y = a. Ʃ X + b. Ʃ X2----------------------(2)
By solving both above equations, we can find out the values of a & b and it is done in such a
way that SSE between the actual value and estimated value of Y is always minimum.
a = ӯ̅ – b 𝑋̅
Where
a = Y- intercept
b = Slope of estimation line
𝑌̅ = Mean of values of Y
𝑋̅ = Mean of values of X
∑𝑋𝑌−𝑛
Where𝑋̅𝑌̅
b = ∑𝑋2−𝑛
𝑋̅2
The value of α and β; parameters of population model will be equal to value of ‘a’ and ‘b
respectively; calculated as above.
Here, we also need to measure the significance of the parameters a & b so that we can take a
decision whether they should be retained in proposed model or not. For that, we may consider
the null and alternative hypothesis as following:
H0: a = 0
Ha: a ≠ 0
This means if value of ‘a’ is significantly different (higher or lower) from zero, we may reject
the null hypothesis. This implies that value of ‘a’ is significant and it must be included in
estimation line. This is done using t-test basically.
If value of ‘a’ is found to be significant, then at 5% level of significance we can say that the
value of equivalent population parameter α would be in this range {a ± S.E. (t)}.
This means if value of ‘b’ is significantly different (higher or lower) from zero, we may reject
the null hypothesis. This implies that value of ‘b’ is significant and there is a linear
relationship between X and Y. therefore, it must be included in estimation line.
If value of ‘b’ is found to be significant, then at 5% level of significance we can say that the
value of equivalent population parameter β would be in this range {b ± S.E. (t)}.
After having calculated the values of parameters a & b from the aforementioned method, we
may claim that we have obtained our estimation equation that is supposed to be the best fit
line. We can check it in a very simple way. As per the following figure, we may plot the
actual sample observations and populations estimation line (Ŷi = α + β Xi) on graph.
According to one of the mathematical properties of a line fitted by method of least squares;
the sum of individual positive and negative errors must equal to zero.
Ʃ Error = Ʃ d = Ʃ (Y - Ŷ) = zero
Using above information, we may check the sum of individual errors in our case and if it is
equal to zero, it implies that we have not committed any serious mathematical mistake in
finding out the estimation line and this is the best fit line.
On the contrary, the standard error of estimate (Se) measures the dispersion/variation of the
actual observations around the regression line. This can be computed as below:
It is interesting to note in above formula that sum of the squared deviations are divided by (n-
2) instead of n. As the value of parameters a & b are computed from the actual data points
and when the same points are used in estimating the line, we actually loss 2 degrees of
freedom.
Ʃ(Y−Ŷ)²
Se = √ n−2
Where
(observations)
Se = √
ƩY² − aƩY−bƩXY
n−2
Where
If we assume that all observations are normally distributed around the line, one may easily
note in the following figure that then observations will lie in the following pattern:
It is easy to understand that if second assumption is not true, then the standard error at one
point on line could differ from the standard error at another point on line.
Standard error calculated as above is a good instrument to find out a prediction interval
around an estimated value of Ŷ, within which the actual value of Y lies. In the above figure,
we may be 68% confident that actual value of Y lie within ± 1 S e interval of the estimated
line. Since, prediction interval is based on normal distribution of data points; larger sample
size (n ≥ 30) is required. For small size sample, we cannot get accurate intervals. One may
keep in mind that these prediction intervals are only approximates.
7. Summary:
Regression analysis is an important technique to develop a model that may estimate the value
of a dependent variable by at least one independent variable. In simple linear regression
analysis, there is one dependent and one independent variable. The variable whose is to be
estimated is termed as dependent variable and the variable which is used for prediction is
called independent variable. This analysis signifies an average relationship between two
variables. This assumes a linear relationship between the variables and is expressed as a
straight line equation. To test the goodness of fit of our proposed model, we may look at
followings:
(a) Making the scatter plot of given observation and fitting the regression line. If observed
data point are highly scattered around the line then our model may not be fit.
(b) Coefficient of Determination (r2): if it is greater than 60%, model may be considered as
good fit;
(c) We can see the significance of parameters of a & b using t-test. If any parameter is found
no significant (that means it is equal to zero), it cannot be included in our model.
(d) We may plot the residuals/ errors against the independent variable to check the
assumptions of independence of error and constant variation around the regression line.
(e) While running our regression model on computers, computer printouts usually contain an
analysis of variance (ANOVA) table with an F test of the regression model. This table
contains information about:
SSE is that component of total error that cannot be explained by our regression model.
Therefore, usually termed as unexplained variation
If the value of F is found to be significant, it means variables X and Y are linearly related and
our model is good fit.
Thus, by examining our regression model on above parameters, we may assess the goodness
of fit for our model.
*******
Case: Mr. Atal Sharma, a psychologist for Hero group, has designed a test to show the
danger of over supervising the workers by the superiors. He selects eight workers from the
assembly line who are given a series of complicated tasks to perform. During their
performance, they are continuously interrupted by their superiors assisting them to complete
the work. Upon completion of work, all workers are given a test designed to measure the
worker’s hostility towards superior (a high score equals to low hostility). Their corresponding
scores on the hostility are given as below:
In the problem, worker’s score is dependent variable Y whose value is to be estimated and
number of times he was interrupted is independent variable X. Now to propose an estimation
line, first we plot the given data points using MS Excel as below:
70
60
50
test score
40
30 Y
20 Linear (Y)
10
0
0 10 20 30
number of times interrupted
From the following scatter diagramme, we can observe easily that there is linear relationship
between both the variable, now we may propose an estimation line as following:
Ŷ=α+βX
𝑋̅𝑌̅
∑𝑋𝑌−𝑛
b = ∑𝑋2−𝑛 a = 𝑌̅ – b𝑋̅
𝑋̅2
Ŷ = 70.5 – 2.8 X
Now, we calculate the Coefficient of Determination (r2) to test the goodness of fit estimation
line:
The value of r2 is very high. Therefore, our estimation line seems good fit
Now, we see the significance of the parameters of our estimation line. For this, we can
observe in the MS Excel output shown as hereunder:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9928
R Square 0.9858
Adjusted R
Square 0.9834
Standard Error 2.3805
Observations 8
ANOVA
Df SS MS F Significance F
Regression 1 2352 2352.0000 415.0588 0.0000
Residual 6 34 5.6667
Total 7 2386
Standard
Coefficients Error t Stat P-value
Intercept 70.5 2.2267 31.66075 0.0000
X -2.8 0.1374 -20.373 0.0000
Thus, both values are significant and this shows that our model is good fit.
Next, in ANOVA table, value of F = 415.0588 (p ˂ 0.05) is also found very significant. This
shows that both the variables are strongly linearly related. This also shows that our model is
good fit.
Now we can predict the expected test score if the worker is interrupted 18 minutes using our
estimation line as following:
Test Score = 70.5 – 2.8 * (number of times interrupted)
Test score = 70.5 – 2.8* (18)
Multiple Regression
I
n Chapter 27 we tried to predict the percent body fat of male
WHO 250 Male subjects from their waist size, and we did pretty well. The R2 of
subjects WHAT 67.8% says that we ac- counted for almost 68% of the variability in
Body fat and waist size %body fat by knowing only the waist size. We completed the analysis
UNITS %Body fat and inches
by performing hypothesis tests on the coef-
ficients and looking at the residuals.
WHEN 1990s But that remaining 32% of the variance has been bugging us.
WHERE United States Couldn’t we do a better job of accounting for %body fat if we weren’t
WHY Scientific research
limited to a single predictor? In the full data set there were 15 other
measurements on the 250 men. We might be able to use other
predictor variables to help us account for that leftover varia- tion that
wasn’t accounted for by waist size.
What about height? Does height help to predict %body fat? Men
with the same waist size can vary from short and corpulent to tall and
emaciated. Knowing a man has a 50-inch waist tells us that he’s likely
to carry a lot of body fat. If we found out that he was 7 feet tall, that
might change our impression of his body type. Knowing his height as
well as his waist size might help us to make a more ac- curate
prediction.
Just Do It
Does a regression with two predictors even make sense? It does—and
that’s fortu- nate because the world is too complex a place for simple
linear regression alone to model it. A regression with two or more
predictor variables is called a multiple
regression. (When we need to note the difference, a regression on a
single predic- tor is called a simple regression.) We’d never try to find
a regression by hand, and even calculators aren’t really up to the task.
This is a job for a statistics program on a computer. If you know how
to find the regression of %body fat on waist size with a statistics
package, you can usually just add height to the list of predictors
without having to think hard about how to do it.
29-1
29-2 Part VI I • Inferenc e When Variables Are Related
For simple regression we found the Least Squares solution, the one whose
A Note on Terminology
When we have two or coef-
ficients made the sum of the squared residuals as small as possible.
more predictors and fit a
For multiple regression, we’ll do the same thing but this time with
linear model by least more coefficients. Remark- ably enough, we can still solve this
squares, we are formally problem. Even better, a statistics package can find the coefficients of
said to fit a least squares the least squares model easily.
linear multiple re- Here’s a typical example of a multiple regression table:
gression. Most folks just
Dependent variable is: Pct BF
call it “multiple
R-squared 5 71.3% R-squared (adjusted) 5 71.1%
regression.” You may also
s 5 4.460 with 250 2 3 5 247 degrees of freedom
see the abbreviation OLS
Variable Coefficient SE(Coeff) t-ratio P-value
used with this kind of
analy- sis. It stands for Intercept 23.10088 7.686 20.403 0.6870
Waist 1.77309 0.0716 24.8 #0.0001
Height 20.60154 0.1099 25.47 #0.0001
You should recognize most of the numbers in this table. Most of them
mean what you expect them to.
R2 gives the fraction of the variability of %body fat accounted for by
Metalware Prices. the multiple regression model. (With waist alone predicting %body
Multi- ple regression is a
valuable tool for businesses.
fat, the R2 was 67.8%.) The multiple regression model accounts for
Here’s the story of one 71.3% of the variability in %body fat. We shouldn’t be surprised that
company’s analysis of its R2 has gone up. It was the hope of accounting for some of that leftover
manufac- turing process. variability that led us to try a second predictor.
The standard deviation of the residuals is still denoted s (or
sometimes se to dis- tinguish it from the standard deviation of y).
The degrees of freedom calculation follows our rule of thumb: the
degrees of free- dom is the number of observations (250) minus one for
each coefficient estimated— for this model, 3.
For each predictor we have a coefficient, its standard error, a t-
Compute a Multiple ratio, and the corresponding P-value. As with simple regression, the t-
Regression. We always find ratio measures how many standard errors the coefficient is away from
multi- ple regressions with a
computer. Here’s a chance to
0. So, using a Student’s t-model, we can use its P-value to test the null
try it with the statistics hypothesis that the true value of the coefficient is 0.
package you’ve been using. Using the coefficients from this table, we can write the regression model:
¿
%body fat 5 23.10 1 1.77 waist 2 0.60 height.
As before, we define the residuals as
¿
residuals 5 %body fat 2 %body fat.
We’ve fit this model with the same least squares principle: The sum
of the squared residuals is as small as possible for any choice of
coefficients.
Reading the Multiple are often misinterpreted. We’ll show some examples to help make the
Regression Table. You may be meaning clear.
sur- prised to find that you Second, multiple regression is an extraordinarily versatile
already know how to interpret
most of the values in the calculation, underly- ing many widely used Statistics methods. A
table. Here’s a narrated sound understanding of the multiple regression model will help you to
review. understand these other applications.
Third, multiple regression offers our first glimpse into statistical
models that use more than two quantitative variables. The real world
is complex. Simple mod- els of the kind we’ve seen so far are a great
start, but often they’re just not detailed enough to be useful for
understanding, predicting, and decision making. Models that use
several variables can be a big step toward realistic and useful
modeling of complex phenomena and relationships.
40
30
% Body
20
Fat
10
66 69 72 75
Height (in.)
It doesn’t look like height tells us much about %body fat. You just
can’t tell much about a man’s %body fat from his height. Or can you?
Remember, in the multiple regression model, the coefficient of height
was 20.60, had a t-ratio of 25.47, and had a very small P-value. So it
did contribute to the multiple regression model. How could that be?
The answer is that the multiple regression coefficient of height
takes account of the other predictor, waist size, in the regression
model.
To understand the difference, let’s think about all men whose waist
size is about 37 inches—right in the middle of our sample. If we think
only about these men, what do we expect the relationship between
height and %body fat to be? Now a negative association makes sense
because taller men probably have less body fat than shorter men who
have the same waist size. Let’s look at the plot:
29-4 Part VI I • Inferenc e When Variables Are Related
40
30
% Body
20
Fat
10
66 69 72 75
Height (in.)
Here we’ve highlighted the men with waist sizes between 36 and 38
inches. Overall, there’s little relationship between %body fat and
height, as we can see from the full set of points. But when we focus on
particular waist sizes, there is a relationship between body fat and
height. This relationship is conditional because we’ve restricted our
set to only those men within a certain range of waist sizes. For men
with that waist size, an extra inch of height is associated with a
decrease of about 0.60% in body fat. If that relationship is consistent
for each waist size, then the multiple regression coefficient will
estimate it. The simple regression co- efficient simply couldn’t see it.
We’ve picked one particular waist size to highlight. How could we
look at the relationship between %body fat and height conditioned on
all waist sizes at the same time? Once again, residuals come to the
rescue.
We plot the residuals of %body fat after a regression on waist size
As their name reminds us,
against the residuals of height after regressing it on waist size. This
residuals are what’s left display is called a partial re- gression plot. It shows us just what we
over after we fit a model. asked for: the relationship of %body fat to height after removing the
That lets us remove the linear effects of waist size.
effects of some variables.
The residuals are what’s
left.
7.5
% Body Fat
0.0
Residuals
–7.5
–4 0 4
Height Residuals (in.)
Linearity Assumption
We are fitting a linear model.1 For that to be the right kind of model,
we need an underlying linear relationship. But now we’re thinking
about several predictors. To see whether the assumption is
reasonable, we’ll check the Straight Enough Condition for each of the
predictors.
Multiple Regression Straight Enough Condition: Scatterplots of y against each of the
Assumptions. The assumptions predictors are reasonably straight. As we have seen with height in the
and conditions we check for body fat example, the scat- terplots need not show a strong (or any!)
multi- ple regression are slope; we just check that there isn’t a bend or other nonlinearity. For
much like those we checked the %body fat data, the scatterplot is beautifully lin- ear in waist as we
for simple regression. Here’s
an animated discussion of the saw in Chapter 27. For height, we saw no relationship at all, but at
assumptions and conditions least there was no bend.
for multiple regression. As we did in simple regression, it’s a good idea to check the
residuals for linear- ity after we fit the model. It’s good practice to
plot the residuals against the
1 By linear we mean that each x appears simply multiplied by its coefficient and added to
the model. No x appears in an exponent or some other more complicated function. That
means that as we move along any x-variable, our prediction for y will change at a
constant rate (given by the coefficient) if noth- ing else changes.
29-6 Part VI I • Inferenc e When Variables Are Related
predicted values and check for patterns, especially for bends or other
Check the Residual Plot nonlineari- ties. (We’ll watch for other things in this plot as well.)
(Part 1) If we’re willing to assume that the multiple regression model is
The residuals should reasonable, we can fit the regression model by least squares. But we
appear to have no pattern must check the other assumptions and conditions before we can
with re- spect to the interpret the model or test any hypotheses.
predicted values.
Independence Assumption
As with simple regression, the errors in the true underlying regression
model must be independent of each other. As usual, there’s no way to
be sure that the In- dependence Assumption is true. Fortunately, even
Check the Residual Plot though there can be many pre- dictor variables, there is only one
(Part 2) response variable and only one set of errors. The Independence
The residuals should Assumption concerns the errors, so we check the corresponding
appear to be randomly conditions on the residuals.
scattered and show no
Randomization Condition: The data should arise from a random
sample or randomized experiment. Randomization assures us that the
patterns or clumps when data are representa- tive of some identifiable population. If you can’t
plotted against the pre- identify the population, you can’t interpret the regression model or
dicted values. any hypothesis tests because they are about a regression model for
that population. Regression methods are often ap- plied to data that
were not collected with randomization. Regression models fit to such
data may still do a good job of modeling the data at hand, but without
some reason to believe that the data are representative of a particular
population, you should be reluctant to believe that the model
generalizes to other situations.
We also check displays of the regression residuals for evidence of
patterns, trends, or clumping, any of which would suggest a failure of
independence. In the special case when one of the x-variables is
related to time, be sure that the residu- als do not have a pattern
when plotted against that variable.
The %body fat data were collected on a sample of men. The men
were not related in any way, so we can be pretty sure that their
measurements are independent.
0 5
Residual
0
–5
–10 –5
s
–10
66 69 72 75
30 35 40 45 50
78
Waist (in.)
Height (in.)
Normality Assumption
We assume that the errors around the idealized regression model at
any specified values of the x-variables follow a Normal model. We
40 need this assumption so that we can use a Student’s t-model for
inference. As with other times when we’ve used Student’s t, we’ll
30 settle for the residuals satisfying the Nearly Normal Condition.
Count
10
residuals, this is the same set of conditions we had for simple
regression. Look at a histogram or Normal probability plot of the
residuals. The histogram of residuals in the %body fat regression
–12.0 –4.5 3.0 10.5 certainly looks nearly Normal, and the Normal probability plot is fairly
Residuals straight. And, as we have said before, the Normality Assumption
Check a histogram of the
becomes less important as the sample size grows.
residuals. The distribution of Let’s summarize all the checks of conditions that we’ve made and
the residuals should be the order that we’ve made them:
unimodal and symmet- ric. Or
check a Normal probability 1. Check the Straight Enough Condition with scatterplots of the y-
plot to see whether it is variable against each x-variable.
straight.
Figure 29.5
2. If the scatterplots are straight enough (that is, if it looks like the
regression model is plausible), fit a multiple regression model to
the data. (Otherwise, either stop or consider re-expressing an x-
or the y-variable.)
3. Find the residuals and predicted values.
4. Make a scatterplot of the residuals against the predicted values. 2
This plot should look patternless. Check in particular for any bend
(which would suggest that the data weren’t all that straight after
all) and for any thickening. If there’s a bend and especially if the
plot thickens, consider re-expressing the y-variable and starting
over.
5. Think about how the data were collected. Was suitable
randomization used? Are the data representative of some
identifiable population? If the data are measured over time, check
for evidence of patterns that might suggest they’re not
independent by plotting the residuals against time to look for pat-
terns.
6. If the conditions check out this far, feel free to interpret the
regression model and use it for prediction. If you want to
Partial Regression investigate a particular coefficient, make a partial regression plot
Plots vs. Scatterplots. When for that coefficient.
should you use a partial
regression plot? And why? 7. If you wish to test hypotheses about the coefficients or about the
This activity shows you. overall re- gression, then make a histogram and Normal
probability plot of the residuals to check the Nearly Normal
Condition.
Multiple Regression
Let’s try finding and interpreting a multiple regression model for the
body fat data.
10
Residuals (% body
–5
fat)
–10
10 20 30 40
Predicted (% body fat)
Actually, you need the Nearly
Normal Condition only if we ✔ Nearly Normal Condition: A histogram of the
want to do inference. residuals is unimodal and symmetric.
30
20
Count
s
10
–11.25
–5.00 1.25 7.50
Residuals (% body fat)
Chapter 29 • Multiple Regression 29-9
10
Residuals (% body
5
–5
fat)
–10
Sum Mean
Source of DF Square F-ratio P-
Square value
s
Regression 12216.6 2 6108.28 307 ,0.0001
Residual 4912.26 247 19.8877
regres- sion in the proper %body fat 5 23.10 1 1.77 waist 2 0.60 height.
context. The R2 for the regression is 71.3%. Waist size and
height to- gether account for about 71% of the
variation in %body fat among men. The regression
equation indicates that each inch in waist size is
associated with about a 1.77 increase in %body fat
among men who are of a particular height.
Each inch of height is associated with a decrease in
%body fat of about 0.60 among men with a particular
waist size.
The standard errors for the slopes of 0.07 (waist) and
0.11 (height) are both small compared with the
slopes them- selves, so it looks like the coefficient
estimates are fairly precise. The residuals have a
standard deviation of 4.46%, which gives an
indication of how precisely we can predict
%body fat with this model.
29-10 Part VI I • Inferenc e When Variables Are Related
3 If you skipped over Chapter 28, you can just take our word for this and read on.
4 There are F tables on the CD, and they work pretty much as you’d expect. Most
regression tables in- clude a P-value for the F-statistic, but there’s almost never a need
to perform this particular test in a multiple regression. Usually we just glance at the F-
statistic to see that it’s reasonably far from 1.0, the value it would have if the true
coefficients were really all zero.
Chapter 29 • Multiple Regression 29-11
for the individual coefficients. Those tests look like what we did for
the slope of a simple regression. For each coefficient we test
H0: bj 5 0
against the (two-sided) alternative that it isn’t zero. The regression
table gives a standard error for each coefficient and the ratio of the
estimated coefficient to its standard error. If the assumptions and
conditions are met (and now we need the
Nearly Normal condition), these ratios follow a Student’s t-distribution.
bj 2
tn2k2 0
1 SEsbj
d
How many degrees of freedom? We have a rule of thumb and it
works here. The degrees of freedom is the number of data values
minus the number of predic- tors (in this case, counting the intercept
term). For our regression on two predic- tors, that’s n 2 3. You
shouldn’t have to look up the t-values. Almost every regres- sion
report includes the corresponding P-values.
We can build a confidence interval in the usual way, as an
estimate 6 a margin of error. As always, the margin of error is just
the product of the standard error and a critical value. Here the
critical value comes from the t-distribution on n 2 k 2 1 degrees of
freedom. So a confidence interval for bj is
bj 6 t*n2k21 SEsbjd.
The tricky parts of these tests are that the standard errors of the
coefficients now require harder calculations (so we leave it to the
technology) and the meaning of a
coefficient, as we have seen, depends on all the other predictors in the
multiple re-
gression model.
That last bit is important. If we fail to reject the null hypothesis for a
multiple regression coefficient, it does not mean that the
corresponding predictor variable has no linear relationship to y. It
means that the corresponding predictor con- tributes nothing to
modeling y after allowing for all the other predictors.
y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
It looks like each bj tells us the effect of its associated predictor, xj,
on the response variable, y. But that is not so. This is, without a doubt,
How Regression
the most common error that people make with multiple regression:
Coefficients Change with New • It is possible for there to be no simple relationship between y and xj,
Variables. When the and yet bj in a multiple regression can be significantly different from
regression model grows by
including a new prdictor, all 0. We saw this hap- pen for the coefficient of height in our example.
the coefficients are likely to • It is also possible for there to be a strong two-variable relationship
change. That can help us between y and xj, and yet bj in a multiple regression can be almost 0
understand what those with a large P-value so that we must retain the null hypothesis
coefficients that the true coefficient is zero. If
mean.
29-12 Part VI I • Inferenc e When Variables Are Related
we’re trying to model the horsepower of a car, using both its weight
and its en- gine size, it may turn out that the coefficient for engine
size is nearly 0. That doesn’t mean that engine size isn’t important
for understanding horsepower. It simply means that after allowing
for the weight of the car, the engine size doesn’t give much
additional information.
Multiple Regression
• It is even possible for there to be a significant linear relationship
Coefficients. You may be between y and xj in one direction, and yet bj can be of the opposite
thinking that multiple sign and strongly significant in a multiple regression. More
regression coefficients must expensive cars tend to be bigger, and since big- ger cars have worse
be more consistent than this fuel efficiency, the price of a car has a slightly negative as- sociation
discussion suggests. Here’s a with fuel efficiency. But in a multiple regression of fuel efficiency on
hands-on analysis for you to
investigate. weight and price, the coefficient of price may be positive. If so, it
means that among cars of the same weight, more expensive cars
have better fuel efficiency. The simple regression on price, though,
has the opposite direction because, overall, more expensive cars are
bigger. This switch in sign may seem a little strange at first, but it’s
not really a contradiction at all. It’s due to the change in the
meaning of the coefficient of price when it is in a multiple
regression rather than a simple regression.
So we’ll say it once more: The coefficient of xj in a multiple
regression depends as much on the other predictors as it does on xj.
Remember that when you inter- pret a multiple regression model.
5The data are available from the Kids Count section of the Annie E. Casey Foundation, and
are all for 1999.
6 In the interest of complete honesty, we should point out that the original data include the District
of
Columbia, but it proved to be an outlier on several of the variables, so we’ve restricted
attention to the 50 states here.
Chapter 29 • Multiple Regression 29-13
Mortality
vertical and hori- zontal axes
Infant
are consistent across rows
and down columns. The diag-
onal cells may hold Normal
proba- bility plots (as they do
here), his- tograms, or just the
Child Death
names of the variables. These
are a great way to check the
Straight Enough Condi- tion
Rate
and to check for simple out-
liers. Figure 29.6
HS Dropout Rate
Low Birth
Weight
Teen Births
Teen Deaths
The individual scatterplots show at a glance that each of the
relationships is straight enough for regression. There are no obvious
bends, clumping, or outliers. And the plots don’t thicken. So it looks
like we can examine some multiple regres- sion models with
inference.
Let’s try to model infant mortality with all of the available predictors.
0
births)
–1
6 7 8 9 10
Predicted (deaths/10,000 live births)
15
10
–2.0
0.0 2.0
Residual
s
Chapter 29 • Multiple Regression 29-15
Adjusted
R2
You may have noticed that the full regression tables shown in this
chapter include another statistic we haven’t discussed. It is called
adjusted R2 and sometimes ap- pears in computer output as
R2(adjusted). The adjusted R2 statistic is a rough at-
tempt to adjust for the simple fact that when we add another predictor to a
multi-
ple regression, the R2 can’t go down and will most likely get larger.
Only if we were to add a predictor whose coefficient turned out to be
exactly zero would the R2 remain the same. This fact makes it difficult
to compare alternative regression models that have different numbers
of predictors.
We can write a formula for R2 using the sums of squares in the
ANOVA table portion of the regression output table:
SSRegression SSRegression
R2 5 5 .
SSRegression 1 SSResidual SSTotal
Adjusted R2 simply substitutes the corresponding mean squares for the SS’s:
2 MSRegression
Rad 5 .
j
MSTotal
Because the mean squares are sums of squares divided by their
degrees of free- dom, they are adjusted for the number of predictors
in the model. As a result, the adjusted R2 value won’t necessarily
increase when a new predictor is added to the multiple regression
model. That’s fine. But adjusted R2 no longer tells the fraction of
variability accounted for by the model and it isn’t even bounded by 0
and 100%, so it can be awkward to interpret.
Comparing alternative regression models is a challenge, especially
when they have different numbers of predictors. The search for a
summary statistic to help us
Chapter 29 • Multiple Regression 29-17
7 With several predictors we can wander beyond the data because of the combination of
values even when individual values are not extraordinary. For example, both 28-inch
waists and 76-inch heights can be found in men in the body fat study, but a single
individual with both these measurements would not be at all typical. The model we fit is
probably not appropriate for predicting the % body fat for such a tall and skinny
individual.
29-18 Part VI I • Inferenc e When Variables Are Related
allowing for the linear effects of the other predictors. The sign of
a variable can change depending on which other predictors are in
or out of the model. For example, in the regression model for
infant mortality, the coefficient of high school dropout rate was
negative and its P-value was fairly small, but the simple
association between dropout rate and infant mortality is positive.
(Check the plot matrix.)
●
If a coefficient’s t-statistic is not significant, don’t
interpret it at all. You can’t be sure that the value of the
corresponding parameter in the un- derlying regression model
isn’t really zero.
l se
WhatE Can Go Wrong? ● Don’t fit a linear regression to data that aren’t straight. This is the
most
^
fundamental regression assumption. If the relationship between
the x’s and y isn’t approximately linear, there’s no sense in fitting
a linear model to it. What we mean by “linear” is a model of the
form we have been writing for the regression. When we have two
predictors, this is the equation of a plane, which is linear in the
sense of being flat in all directions. With more predic- tors, the
geometry is harder to visualize, but the simple structure of the
model is consistent; the predicted values change consistently
with equal size changes in any predictor.
Usually we’re satisfied when plots of y against each of the x’s
are straight enough. We’ll also check a scatterplot of the
residuals against the predicted values for signs of nonlinearity.
●
Watch out for the plot thickening. The estimate of the error
standard devi- ation shows up in all the inference formulas. If se
changes with x, these esti- mates won’t make sense. The most
common check is a plot of the residuals against the predicted
values. If plots of residuals against several of the pre- dictors all
show a thickening, and especially if they also show a bend, then
consider re-expressing y. If the scatterplot against only one
predictor shows thickening, consider re-expressing that
predictor.
●
Make sure the errors are nearly Normal. All of our
inferences require that the true errors be modeled well by a
Normal model. Check the his- togram and Normal probability plot
of the residuals to see whether this as- sumption looks
reasonable.
●
Watch out for high-influence points and outliers. We
always have to be on the lookout for a few points that have
undue influence on our model, and regression is certainly no
exception. Partial regression plots are a good place to look for
influential points and to understand how they affect each of the
coefficients.
C O N N E CT I O N S
We would never consider a regression analysis without first making scatterplots. The
aspects of scat- terplots that we always look for—their direction, shape, and scatter—
relate directly to regression.
Regression inference is connected to just about every inference method we have
seen for mea- sured data. The assumption that the spread of data about the line is
constant is essentially the same as the assumption of equal variances required for the
pooled-t methods. Our use of all the residuals together to estimate their standard
deviation is a form of pooling.
Chapter 29 • Multiple Regression 29-19
Of course, the ANOVA table in the regression output connects to our consideration
of ANOVA in Chapter 28. This, too, is not coincidental. Multiple Regression, ANOVA,
pooled t-tests, and inference for means are all part of a more general statistical model
known as the General Linear Model (GLM).
TE R M S
Multiple regression A linear regression with two or more predictors whose coefficients are found to
minimize the sum of the squared residuals is a least squares linear multiple
regression. But it is usually just called a multiple regression. When the
distinction is needed, a least squares linear regression with a single predictor is
called a simple regression. The multiple regression model is
y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
Least squares We still fit multiple regression models by choosing the coefficients that make
the sum of the squared residuals as small as possible. This is called the
method of least squares.
Partial regression The partial regression plot for a specified coefficient is a display that helps in
understanding the
plot meaning of that coefficient in a multiple regression. It has a slope equal to the
coefficient value and shows the influences of each case on that value. A partial
regression plot for a specified x displays the residuals when y is regressed on
the other predictors against the residuals when the specified x is regressed on
the other predictors.
29-20 Part VI I • Inferenc e When Variables Are Related
Assumptions for ● Linearity. Check that the scatterplots of y against each x are straight enough and that the
inference in scatterplot of residuals against predicted values has no obvious pattern. (If we find the
regression (and relationships straight enough, we may fit the regression model to find residuals for further
conditions to check checking.)
for some of them) ● Independent errors. Think about the nature of the data. Check a residual
plot. Any evident pattern in the residuals can call the assumption of
independence into question.
●
Constant variance. Check that the scatterplots show consistent spread across
the ranges of the x-variables and that the residual plot has constant variance
too. A common problem is increasing spread with increasing predicted values
—the plot thickens!
●
Normality of the residuals. Check a histogram or a Normal probability plot of the residuals.
ANOVA The Analysis of Variance table that is ordinarily part of the multiple regression
results offers an F-test to test the null hypothesis that the overall regression is
no improvement over just model- ing y with its mean:
H0 : b1 5 b2 5 c 5 bk 5 0.
If this null hypothesis is not rejected, then you should not proceed to test the
individual coefficients.
t-ratios for the The t -ratios for the coefficients can be used to test the null hypotheses that the true value of
coefficients each coefficient is zero against the alternative that it is not.
Scatterplot matrix A scatterplot matrix displays scatterplots for all pairs of a collection of variables,
arranged so that all the plots in a row have the same variable displayed on their
y-axis and all plots in a col- umn have the same variable on their x-axis. Usually,
the diagonal holds a display of a single variable such as a histogram or Normal
probability plot, and identifies the variable in its row and column.
Adjusted R2 An adjustment to the R 2 statistic that attempts to allow for the number of
predictors in the model. It is sometimes used when comparing regression
models with different numbers of predictors.
• Know how to use the ANOVA F-test to check that the overall regression
model is better than just using the mean of y.
Chapter 29 • Multiple Regression 29-21
• Know how to test the standard hypotheses that each regression coefficient
is really zero. Be able to state the null and alternative hypotheses. Know
where to find the rele-
vant numbers in standard computer regression output.
DATA DESK
• Select Y- and X-variable icons.
• From the Calc menu, choose Regression.
Comments
You can change the regression by dragging the icon of
• Data Desk displays the regression table.
another variable over either the Y- or an X-variable
• Select plots of residuals from the Regression
name in the table and dropping it there. You can add a
table’s HyperView menu.
predictor by dragging its icon into that part of the table.
The regression will recompute auto- matically.
29-22 Part VI I • Inferenc e When Variables Are Related
EXCEL
• From the Tools menu, select Data Analysis.
• Select Regression from the Analysis Tools list.
Comments
The Y and X ranges do not need to be in the same rows
• Click the OK button.
of the spreadsheet, although they must cover the same
• Enter the data range holding the Y-variable in the box
number of cells. But it is a good idea to arrange your
labeled “Y-range.”
data in parallel columns as in a data table. The X-
• Enter the range of cells holding the X-variables in
variables must be in adjacent columns. No cells in the
the box labeled “X-range.”
data range may hold non-numeric values.
• Select the New Worksheet Ply option.
Although the dialog offers a Normal probability plot of
• Select Residuals options. Click the OK button.
the residu- als, the data analysis add-in does not make a
correct probability plot, so don’t use this option.
JMP
• From the Analyze menu select Fit Model.
Comments
• Specify the response, Y. Assign the predictors, X, in
JMP chooses a regression analysis when the response
the Con- struct Model Effects dialog box.
variable is “Continuous.” The predictors can be any
• Click on Run Model.
combination of quanti- tative or categorical. If you get a
different analysis, check the variable types.
MINITAB
• Choose Regression from the Stat menu.
• Choose Regression. . . from the Regression submenu.
• In the Regression dialog, assign the Y-variable to
the Re- sponse box and assign the X-variables to the
Predictors box.
• Click the Graphs button.
• In the Regression-Graphs dialog, select Standardized
residu- als, and check Normal plot of residuals and
Residuals ver- sus fits.
• Click the OK button to return to the Regression dialog.
• To specify displays, click Graphs, and check the
displays you want.
• Click the OK button to return to the Regression dialog.
• Click the OK button to compute the regression.
SPSS
• Choose Regression from the Analyze menu.
• Choose Linear from the Regression submenu.
• When the Linear Regression dialog appears, select the
Y- variable and move it to the dependent target. Then
move the X- variables to the independent target.
• Click the Plots button.
• In the Linear Regression Plots dialog, choose to plot the
*SRESIDs against the *ZPRED values.
• Click the Continue button to return to the Linear
Regression dialog.
• Click the OK button to compute the regression.
Chapter 29 • Multiple Regression 29-23
TI-83/84 Plus
Comments
You need a special program to compute a multiple
regression on the TI-83.
TI-89
EXERCISES
1. Interpretations. A regression performed to One of the interpretations below is
predict selling price of houses found the correct. Which is it? Explain what’s wrong
equation with the others.
¿
price 5 169328 1 35.3 area 1 0.718 lotsize a) If they did no advertising, their
2 6543 age income would be
$250 million.
where price is in dollars, area is in square feet,
lotsize is in square feet, and age is in years. The
R2 is 92%. One of the interpretations below is
correct. Which is it? Explain what’s wrong with
the others.
a) Each year a house ages it is worth $6543
less.
b) Every extra square foot of area is associated
with an additional $35.30 in average price,
for houses with a given lotsize and age.
c) Every dollar in price means lotsize increases
0.718 square feet.
d) This model fits 92% of the data points exactly.
2. More interpretations. A household appliance
manu- facturer wants to analyze the relationship
between total sales and the company’s three
primary means of adver- tising (television,
magazines, and radio). All values were in
millions of dollars. They found the regression
equation
¿
sales 5 250 1 6.75 TV 1 3.5 radio 1 2.3
magazines.
b) Every million dollars spent on radio makes sales in-
crease $3.5 million, all other things being equal.
c) Every million dollars spent on magazines increases
TV spending $2.3 million.
d) Sales increase on average about $6.75 million for
each million spent on TV, after allowing for the
effects of the other kinds of advertising.
3. Predicting final exams. How well do exams given
during the semester predict performance on the final?
One class had three tests during the semester.
Computer output of the regression gives
Dependent variable is Final
s 513.46 R-Sq 5 77.7% R-Sq(adj) 5 74.1%
Analysis of Variance
Source DF SS MS F P-value
Regression 3 11961.8 3987.3 22.02 ,0.0001
Error 19 3440.8 181.1
Total 22 15402.6
29-24 Part VI I • Inferenc e When Variables Are Related
0
these are size of the house (square feet), lot size,
and number of bathrooms. Information for a
random sample of homes for sale in the –750
Statesboro, GA, area was obtained from the
Internet. Regression output modeling the asking
price with square footage and number of
bathrooms gave the following result:
Dependent Variable is: Price
s 5 67013 R-Sq 5 71.1% R-Sq (adj) 5 64.6%
3000 6000
Predictor Coeff SE(Coeff) T P-value 900 12000
0
Intercept 2152037 85619 21.78 0.110
Predicted (min)
Baths 9530 40826 0.23 0.821
Sq ft 139.87 46.67 3.00 0.015 b) Discuss the residuals and what they say
about the assumptions and conditions for this
regression.
Chapter 29 • Multiple Regression 29-25
0
(points)
1
that the axes of the Normal probability plot are
0 swapped relative to the plots we’ve made in the
–20 –15 –10 –5 0 5 10 15 20 text. We only care about the pattern of this plot,
Residuals (points) so it shouldn’t af- fect your interpretation.)
Examine these plots and dis- cuss whether the
assumptions and conditions for the multiple
regression seem reasonable.
29-26 Part VI I • Inferenc e When Variables Are Related
% Body
20
Fat
1
10
Normal
0
Score
Variable Coefficient SE(Coeff) t-ratio P-value A regression of %body fat on chest size gives the
Intercept 231.4830 11.54 22.73 0.0068 following equation:
Waist 2.31848 0.1820 12.7 ,0.0001 Dependent variable is: Pct BF
Height 20.224932 0.1583 21.42 0.1567 R-squared 5 49.1% R-squared (adjusted) 5 48.9%
Weight 20.100572 0.0310 23.25 0.0013 s 5 5.930 with 250 2 2 5 248 degrees of freedom
c) Interpret the slope for weight. How can the
Variable Coefficient SE(Coeff) t-ratio P-value
coefficient for weight in this model be
negative when its coeffi- cient was positive in Intercept 252.7122 4.654 211.3 ,0.0001
the simple regression model? Chest 0.712720 0.0461 15.5 ,0.0001
d) What does the P-value for height mean in this a) Is the slope of %body fat on chest size
regres- sion? (Perform the hypothesis test.) statistically dis- tinguishable from 0?
T 12. Breakfast cereals. We saw in Chapter 8 that (Perform a hypothesis test.)
the calorie content of a breakfast cereal is b) What does the answer in part a mean about
linearly associated with its sugar content. Is that the rela- tionship between %body fat and
the whole story? Here’s the out- put of a chest size?
regression model that regresses calories for We saw before that the slopes of both waist size
each serving on its protein(g), fat(g), fiber(g), and height are statistically significant when
carbohydrate(g), and sugars(g) content. entered into a multiple regression equation.
Dependent variable is: calories What happens if we add chest size to that
R-squared 5 84.5% R-squared (adjusted) 5 83.4% regression? Here is the output from a re-
s 5 7.947 with 77 2 6 5 71 degrees of freedom gression on all three variables:
Sum of Mean Dependent variable is: Pct BF
Source Squares df Square F-ratio R-squared 5 72.2% R-squared (adjusted) 5 71.9%
Regression 24367.5 5 4873.50 77.2 s 5 4.399 with 250 2 4 5 246 degrees of freedom
Residual 4484.45 71 63.1613
Sum of Mean
Variable Coefficient SE(Coeff) t-ratio P-value Source Squares df Square F-ratio P
Intercept 20.2454 5.984 3.38 0.0012 Regression 12368.9 3 4122.98 213 ,0.0001
Protein 5.69540 1.072 5.32 ,0.0001 Residual 4759.87 246 19.3491
Fat 8.35958 1.033 8.09 ,0.0001 Variable Coefficient SE(Coeff) t-ratio P-value
Fiber 21.02018 0.4835 22.11 0.0384 Intercept 2.07220 7.802 0.266 0.7908
Carbo 2.93570 0.2601 11.3 ,0.0001 Waist 2.19939 0.1675 13.1 ,0.0001
Sugars 3.31849 0.2501 13.3 ,0.0001 Height 20.561058 0.1094 25.13 ,0.0001
Assuming that the conditions for multiple
Chest 20.233531 0.0832 22.81 0.0054
regression are met,
a) What is the regression equation? c) Interpret the coefficient for chest.
b) Do you think this model would do a d) Would you consider removing any of the variables
reasonably good job at predicting calories? from this regression model? Why or why not?
Explain.
c) To check the conditions, what plots of the T 14. Grades. The table below shows the five scores
data might you want to examine? from an introductory Statistics course. Find a
model for predicting final exam score by trying
d) What does the coefficient of fat mean in this all possible models with two predictor variables.
model? Which model would you choose? Be sure to
T 13. Body fat again. Chest size might be a good check the conditions for multiple regression.
predictor of body fat. Here’s a scatterplot of
%body fat vs. chest size. Midterm Midterm Home
Name Final 1 2 Project work
Timothy F. 117 82 30 10.5 61
40
Karen E. 183 96 68 11.3 72
Verena Z. 124 57 82 11.3 69
30
Jonathan A. 177 89 92 10.5 84
%Body
Sum of Mean
Source Squares df Square F-ratio P-value
Regression 11211.1 3 3737.05 15.1 ,0.0001
Residual 17583.5 71 247.655
J = Sensitivity+Specificity−1
Maximizing this statistic involves finding the threshold where the sum of sensitivity and
specificity is maximized, which balances the trade-off between false positives and false
negatives.
d. Cost-Based Optimization
In real-world applications, the costs of false positives and false negatives are rarely equal. For
example, in medical diagnosis, a false negative may have far more serious consequences than
a false positive. To account for this, a cost-sensitive approach can be used, where the
threshold is chosen to minimize the expected cost of misclassifications. This can be done by
assigning different weights to false positives and false negatives, and adjusting the threshold
to minimize the weighted sum of these errors.
e. ROC Curve and AUC
By analysing the ROC curve, you can visually identify the threshold that best balances true
positives and false positives. A common approach is to choose the threshold at which the sum
of sensitivity and specificity is maximized, or where the point on the ROC curve is closest to
the top-left corner, representing the ideal trade-off between true positives and false positives.
f. Precision-Recall Trade-Off
For imbalanced datasets, where one class is much more frequent than the other, precision-
recall curves often provide more meaningful insights than ROC curves. By adjusting the cut-
off to achieve the desired balance of precision and recall, one can optimize model
performance for rare events, such as fraud detection or disease diagnosis.
Evaluating the Impact of Cut-Off Selection
The selection of the optimal classification cut-off must be carefully evaluated, as it directly
affects the model’s performance, and the costs associated with misclassifications. A poor
choice of cut-off can lead to overfitting or underfitting, depending on whether the threshold is
too stringent or too lenient.
In practice, the following factors should be considered when evaluating the impact of cut-off
selection:
Class Distribution: If the dataset is imbalanced, focusing on the minority class with a
higher recall may be more important than optimizing for accuracy.
Application Requirements: Different applications may have different priorities. For
example, in fraud detection, catching as many fraudulent transactions as possible
(high recall) may be more important than minimizing false positives.
Evaluation Metrics: Choose evaluation metrics that align with the problem's
objectives. The importance of precision, recall, or F1 score should be considered in
the context of the problem's real-world costs.
Cross-Validation: Always test the cut-off using cross-validation to ensure that the
chosen threshold generalizes well to unseen data.
The gain chart and lift chart are two measures that are used for Measuring the benefits of
using the model and are used in business contexts such as target marketing. It’s not just
restricted to marketing analysis. It can also be used in other domains such as risk modelling,
supply chain analytics, etc. In other words, Gain and Lift charts are two approaches used
while solving classification problems with imbalanced data sets.
Example: In target marketing or marketing campaigns, the customer responses to campaigns
are usually very low (in many cases the customers who respond to marketing campaigns are
less than 1%). The organization will raise the cost for each customer contact and hence would
like to minimize the cost of the marketing campaign and at the same time achieve desired
response level from the customers.
The gain chart and lift chart is the measures in logistic regression that will help organizations
to understand the benefits of using that model. So that better and more efficient output carry
out.
The gain and lift chart are obtained using the following steps:
1. Predict the probability Y = 1 (positive) using the LR model and arrange the
observation in the decreasing order of predicted probability [i.e., P (Y = 1)].
2. Divide the data sets into deciles. Calculate the number of positives (Y = 1) in each
decile and the cumulative number of positives up to a decile.
3. Gain is the ratio between the cumulative number of positive observations up to a
decile to the total number of positive observations in the data. The gain chart is a chart
drawn between the gain on the vertical axis and the decile on the horizontal axis.
4. Lift is the ratio of the number of positive observations up to decile i using the model
to the expected number of positives up to that decile i based on a random model. Lift
chart is the chart between the lift on the vertical axis and the corresponding decile on
the horizontal axis.
Ratio between the cumulative number of positive responses up to a decile to the total
number of positive responses in the data
Gain Chart:
Lift Chart Calculation:
Ratio of the number of positive responses up to decile i using the model to the expected
number of positives up to that decile i based on a random model
Lift Chart:
Cumulative gains and lift charts are visual aids for measuring model performance.
Both charts consist of Lift Curve (In Lift Chart) / Gain Chart (In Gain Chart) and
Baseline (Blue Line for Lift, Orange Line for Gain).
The Greater the area between the Lift / Gain and Baseline, the Better the model.
SSE=∑ ( yi− ^y i ¿) 2¿
i=1
However, this approach may lead to overfitting, where the model becomes too complex and
captures noise in the data, leading to poor generalization on unseen data. Regularization aims
to mitigate this issue by introducing a penalty term that discourages overly complex models.
L1 Regularization (Lasso)
L1 regularization introduces a penalty term proportional to the sum of the absolute values of
the model coefficients.
The effect of the L1 penalty is that it drives some coefficients to exactly zero. This is because
the L1 penalty causes the optimization process to shrink some coefficients entirely,
effectively excluding them from the model. This characteristic makes L1 regularization
useful for feature selection, as it automatically selects a subset of relevant features by
eliminating irrelevant ones.
Disadvantages of L1 Regularization:
Instability: Lasso (L1 regularization) can be unstable if the number of observations is
much smaller than the number of features, leading to high variance in the model
coefficients.
Non-differentiability: The L1 penalty function is not differentiable at zero, which
can complicate the optimization process, though this issue is usually addressed by
using optimization algorithms such as coordinate descent.
L2 Regularization (Ridge)
L2 regularization, on the other hand, penalizes the sum of the squared values of the model
coefficients.
The L2 penalty encourages the model to keep the coefficients small, but unlike L1
regularization, it does not set coefficients to zero. Instead, it shrinks the coefficients towards
zero but keeps them non-zero, which typically results in a model where all features are
retained but their influence is reduced.
Key Features of L2 Regularization:
Shrinkage: L2 regularization reduces the magnitude of coefficients without
eliminating them entirely. This leads to a more stable model, especially when the
features are highly correlated.
Stability: L2 regularization is less likely to lead to extreme values in the model
coefficients, which contributes to its stability, particularly in the presence of
multicollinearity (correlation between features).
No Feature Selection: Unlike L1 regularization, L2 regularization does not set
coefficients to zero, meaning that all features are retained in the model, albeit with
reduced influence.
Disadvantages of L2 Regularization:
No Sparsity: L2 regularization does not perform feature selection, so it may not be as
effective in high-dimensional settings where some features are irrelevant.
Limited Interpretability: Since all features are retained, L2 regularization can lead
to more complex models that are harder to interpret compared to L1 regularized
models.
Choosing Between L1 and L2 Regularization
The choice between L1 and L2 regularization depends on the specific goals and
characteristics of the dataset:
184 Business Analytics for Decision Making | Institute of Public Enterprise (IPE)