0% found this document useful (0 votes)
13 views184 pages

2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making

The document outlines the course kit for the Post Graduate Diploma in Management focusing on Business Analytics for Decision Making for the 2024-2026 batch. It includes the course syllabus, evaluation criteria, session plans, and suggested readings, aiming to develop proficiency in business analytics and ethical decision-making. The course is taught by Dr. Shaheen at the Institute of Public Enterprise, Hyderabad.

Uploaded by

2401217
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views184 pages

2025 - Course Kit & Lesson Plan - Business Analytics For Decision Making

The document outlines the course kit for the Post Graduate Diploma in Management focusing on Business Analytics for Decision Making for the 2024-2026 batch. It includes the course syllabus, evaluation criteria, session plans, and suggested readings, aiming to develop proficiency in business analytics and ethical decision-making. The course is taught by Dr. Shaheen at the Institute of Public Enterprise, Hyderabad.

Uploaded by

2401217
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 184

POST GRADUATE DIPLOMA IN MANAGEMENT

BUSINESS ANALYTICS FOR DECISION MAKING

COURSE KIT

Trimester – III

2024-2026 Batch

COURSE INSTRUCTOR

Dr Shaheen
Associate Professor

INSTITUTE OF PUBLIC ENTERPRISE


HYDERABAD, 500101

CONTENTS
S.No. Contents Page Nos.

1 Course Syllabus 3–4

2 Session Plan 5-6

3 CO - PO - PSO Set Mappings 7-8

Evaluation Criteria
Continuous Evaluation 1 – Online Course (NPTEL)
4 Continuous Evaluation 2 – Project Work 9 - 12
Continuous Evaluation 3 – Case Analysis
Rubric Matrix for Continuous Assessments

5 Glossary of Terms 13 - 16

6 Question Bank 17 - 25

Question Paper Template - Mid Term Examination and


7 26 – 28
End Term Examination

8 Reference Material 29 - 174


COURSE SYLLABUS
Code 24PC304 BUSINESS ANALYTICS FOR DECISION MAKING Credits 3
Course Objectives
 Develop a comprehensive understanding of business analytics.
 Develop proficiency in Python for business analytics.
 Apply analytical techniques ethically to real-world business problems.
Course Outcomes
The students would be able to:
Demonstrate the ability to organize, compare, transform, and summarize business data to
CO 1
derive meaningful insights.
Apply analytical tools and techniques to effectively analyze and interpret business data,
CO 2
leading to the development of data-driven strategies.
Clearly and effectively communicate data insights to stakeholders using visualizations,
CO 3
reports, and presentations.
Cultivate a commitment to ethical decision-making, ensuring choices that positively
CO 4
impact organizations and stakeholders.
Syllabus
Introduction to Business Analytics: Analytics Landscape; Framework for Data-Driven
Decision Making; Roadmap for Analytics Capability Building; Challenges in Data-
Driven Decision Making and Future; Foundations of Data Science – Data Types and
Unit – I
Scales of Variable Measurement, Feature Engineering; Functional Applications of
Business Analytics in Management; Widely Used Analytical Tools; Ethics in Business
Analytics.
Introduction to Python – Introduction to Jupyter Notebooks; Basic Programming
Concepts and Syntax; Core Libraries in Python; Map and Filter; Processing, Wrangling,
and Visualizing Data – Data Collection, Data Description, Data Wrangling, Data
Unit – II
Visualization; Feature Engineering and Selection – Feature Extraction and Engineering,
Feature Engineering on Numeric Data, Categorical Data, Text Data, & Image Data,
Feature Scaling, Feature Selection.
Building, Tuning, and Deploying Models – Building Models, Model Evaluation, Tuning,
Interpretation, & Deployment; Exploratory Data Analysis; Diagnostic Analysis;
Exploration of Data using Visualization; Steps in Building a Regression Model; Building
Unit – III
Simple and Multiple Regression Models - Model Diagnostics; Binary Logistic
Regression – Model Diagnostics, ROC and AUC, Finding Optimal Classification Cut-
Off; Gain and Lift Chart; Regularization – L1 and L2.
Suggested Readings
 Kumar, U. Dinesh. Business Analytics: The Science of Data-Driven Decision Making. 2nd ed.
New Delhi: Wiley India Pvt. Ltd., 2024. ISBN 978-93-5424-619-7.
 Sarkar, D., Bali, R., & Sharma, T. (2019). Practical machine learning with Python: A problem-
solver's guide to building real-world intelligent systems. APress Pvt. Ltd.
 Arnab K. Laha. How to Make The Right Decision. Gurgaon: Random House Publishers India Pvt.
Ltd. 2015. ISBN 978-81-8400-162-4.
 Kumar, U. Dinesh. Business Analytics: The Science of Data-Driven Decision Making. 2nd ed. New
Delhi: Wiley India Pvt. Ltd., 2024. ISBN 978-93-5424-619-7.
 Motwani, Bharti. Data Analytics Using Python. 1st ed. New Delhi: Wiley India Pvt. Ltd., 2022.
ISBN 978-81-265-0295-0.
 Pradhan, Manaranjan, and U. Dinesh Kumar. Machine Learning Using Python. 2nd ed. Birmingham:
Packt Publishing, 2020. ISBN 978-81-265-7990-7.
 Sarkar, Dipanjan, Raghav Bali, and Tushar Sharma. Practical Machine Learning with Python.
Replika Press Pvt. Ltd., 2019. ISBN 978-1-484-24049-6.
Cases
 Besbes, Omar, Daizhuo Chen, and Robert L. Phillips. Nomis Solutions (B). Harvard Business
Publishing Education. October 10, 2014. CU125-PDF-ENG. Length: 5 pages.
 Datar, Srikant M., and Caitlin N. Bowler. Predicting Purchasing Behavior at PriceMart (B). Harvard
Business Publishing Education. August 23, 2018. 119026-PDF-ENG. Length: 11 pages.
 Rahul Kumar, and Dinesh Kumar Unnikrishnan. HR Analytics at ScaleneWorks: Behavioral
Modeling to Predict Renege. Harvard Business Publishing Education. January 18, 2016. Length: 12
pages.
 Sriram TK, Shailaja Grover, Satyabala Hariharan, and Dinesh Kumar Unnikrishnan. Package
Pricing at Mission Hospital. Harvard Business Publishing Education. July 21, 2015. IMB527-PDF-
ENG, Length: 9 pages.
 Unnikrishnan, Dinesh Kumar, and Kshitiz Ranjan. Pricing of Players in the Indian Premier League.
Harvard Business Publishing Education. August 1, 2012. IMB379-PDF-ENG. Length: 16 pages.
Suggested Online Resources
 NPTEL course on Python for Data Science - https://onlinecourses.nptel.ac.in/noc25_cs60/preview
Instructor: Prof. Ragunathan Rengasamy, IIT Madras
Course Duration: 20-01-2025 - 14-02-2025
Full Term - 4 weeks course
Last date for enrollment (extended): February 3, 2025 11p.m.
Assignment Due Dates
 Week 1: 05 February 2025, 23:59 IST
 Week 2: 05 February 2025, 23:59 IST
 Week 3: 12 February 2025, 23:59 IST
 Week 4: Yet to be announced
Certification Exam Date: March 22, 2025

Journals
 Analytics Magazine.
 Data Science and Business Analytics.
 Harvard Business Review (HBR).
 Information Systems Research.
 Journal of Business Analytics.
 Journal of Business Intelligence Research.
 MIT Sloan Management Review.
Value Addition Tools
 Excel: Widely used for basic data analysis, statistics, and visualization.
 Python (with libraries such as Pandas, NumPy, and SciPy): For advanced data analysis, manipulation, and handling
large datasets efficiently. Plotly: A Python-based library for creating interactive graphs and visualizations, especially
useful for web-based data visualizations.
 R: Useful for statistical analysis and graphical representations. ggplot2 (R): A data visualization package in R for
creating elegant and complex plots based on the "Grammar of Graphics.
SESSION PLAN (24 Sessions)

Mode of
CO Continuous
Session Unit Topic of the Session
Teaching Pedagogy Mapping Evaluation
No.
Assessment
Introduction to Business
Analytics: Analytics
Class
1 I Landscape, Framework for Lecture + Case Study CO1, CO2
Participation
Data-Driven Decision
Making
Class
Roadmap for Analytics Lecture + Group
2 I CO2, CO4 Participation +
Capability Building Discussion
Reflection Paper
Challenges in Data-Driven
3 I Decision Making and Lecture CO1, CO3 In-Class Activity
Future Trends
Foundations of Data
Class
Science – Data Types & Lecture + Hands-on
4 I CO1, CO2 Participation +
Scales of Variable Session
Quiz
Measurement
Feature Engineering &
Lecture + Practical Project Work
5 I Functional Applications of CO2, CO3
Case Studies Progress Review
Business Analytics
Demonstration +
Widely Used Analytical
6 I Tool Walkthrough CO1, CO4 Hands-on Project
Tools in Business Analytics
(Excel, R, Tableau)
Class
Ethics in Business Lecture + Debate on
7 I CO3, CO4 Participation +
Analytics Ethical Issues
Discussion Notes
Review of Unit I:
CO1, CO2,
8 I Consolidating Key Recap + Q&A Review Test
CO3
Concepts
Introduction to Python and Lecture + Python Quiz + Hands-on
9 II CO1, CO2
Jupyter Notebooks Setup Walkthrough Exercise
Basic Programming
Lecture + Coding
10 II Concepts and Syntax in CO2, CO3 Code Submission
Demo
Python
Core Libraries in Python Lecture + Code Code Submission
11 II CO2, CO3
(Pandas, NumPy) Exercises + Peer Review
Map and Filter Functions in Code
12 II Hands-on Session CO3, CO4
Python Implementation
Data Collection and Practical Exercise + Project Progress
13 II CO2, CO3
Wrangling Discussion Review
Data Visualization
Demonstration + Visualization
14 II Techniques in Python CO3, CO4
Hands-on Session Assignment
(Matplotlib, Seaborn)
Feature Engineering & Case Study + Coding
15 II CO3, CO4 Code Submission
Selection in Python Exercise
16 II Feature Scaling, Feature Practical Example + CO3, CO4 Class Activity +
Selection Techniques Group Discussion Assignment
Lecture + Hands-on
Building Models in Model Building
17 III Python CO3, CO5
Business Analytics Assignment
Implementation
Model Evaluation and
Lecture + Practical Quiz + Hands-on
18 III Tuning (Hyper-parameter CO5, CO6
Example Exercise
Tuning)
Model Interpretation and Lecture + Hands-on Deployment Plan
19 III CO5, CO6
Deployment Session Presentation
Exploratory Data Analysis Group
Workshop + Group
20 III (EDA) and Diagnostic CO3, CO4 Discussion +
Work
Analysis EDA Report
Regression
Steps in Building a Lecture + Example
21 III CO4, CO5 Model
Regression Model Walkthrough
Assignment
Building Simple and Regression
Practical Session +
22 III Multiple Regression CO5, CO6 Model Code
Discussion
Models Submission
Logistic
Binary Logistic Regression,
23 III Lecture + Case Study CO5, CO6 Regression
ROC and AUC
Assignment
Regularization Techniques Practical Exercise + Regularization
24 III CO5, CO6
(L1, L2) Discussion Implementation
Final Project
Review and Wrap-up of
25 III Recap + Q&A CO5, CO6 Review &
Unit III
Feedback
CO-PO-PSO SET MAPPINGS

Programme Outcomes
PO1 Graduates would exhibit clarity of thought in expressing their views.
PO2 Graduates will have the ability to communicate effectively across diverse channels.
PO3 Graduates will be able to flesh out key decision points when confronted with a business problem.
PO4 Graduates will have the capacity to formulate strategies in the functional areas of management.
Graduates would be able to analyze the health of an organization by perusing its MIS reports/
PO5
financial statements.
Graduates would be able to analyze the health of an organization by perusing its MIS reports/
PO6
financial statements.
PO7 Graduates would demonstrate a hunger for challenging assignments.
PO8 Graduates would display an empathetic attitude to alleviate societal problems.

Program Specific Outcomes (PSOs)


The PGDM graduates will demonstrate proficiency in analyzing complex business scenarios,
formulating strategies, and communicating recommendations effectively across diverse mediums.
They will exhibit clarity of thought, critical thinking skills, and the ability to assess organizational
health through comprehensive analysis of financial statements and management information
PSO1
systems reports. Graduates will be adept at identifying key decision points, evaluating
alternatives, and proposing viable solutions to address business challenges. They will showcase a
hunger for challenging assignments, a commitment to continuous learning, and the capacity to
thrive in dynamic and demanding professional environments.
PGDM graduates will cultivate a deep understanding of the interplay between businesses and
society, recognizing the pivotal role organizations play in addressing societal problems. They will
develop an empathetic attitude, coupled with a strong ethical framework, enabling them to
approach complex issues with compassion and a commitment to sustainable solutions. Graduates
PSO2
will be equipped with the knowledge and skills to formulate strategies that harmonize
organizational goals with societal well-being. They will emerge as responsible leaders, driven by
a sense of purpose and a determination to create positive and lasting impacts on the communities
they serve.
CO – PO & PSO Mapping

Name of the PSO


Course Outcomes PO1 PO2 PO3 PO4 PO5 PSO2 Average
Course 1

CO1: Demonstrate the ability


Business
to organize, compare,
Analytics for
transform, and summarize 3 2 4 3 4 4 2 3.14
Decision
business data to derive
Making
meaningful insights
CO2: Apply analytical tools
and techniques to effectively Business
analyze and interpret business Analytics for
4 3 5 5 5 5 3 4.29
data, leading to the Decision
development of data-driven Making
strategies
CO3: Clearly and effectively
Business
communicate data insights to
Analytics for
stakeholders using 4 5 4 3 3 4 5 4.29
Decision
visualizations, reports, and
Making
presentations
CO4: Cultivate a commitment
Business
to ethical decision-making,
Analytics for
ensuring choices that 5 4 3 4 3 4 5 4.14
Decision
positively impact
Making
organizations and stakeholders
Average 4.00 3.50 4.00 3.75 4.25 4.25 3.75 4.22
EVALUATION CRITERIA

The students will be evaluated based on Continuous Evaluation (CE) components (CE1, CE2, & CE3), Mid-
term, and End-term examinations. Continuous Evaluation components include active participation in class,
presentations, assignments, case study discussions etc.

As the course is 3-credit, 15 marks will be allocated for Continuous Evaluation. The student's performance
will be evaluated based on timely completion of continuous evaluation components. The allocation of the
marks will be on the following lines:

1. Continuous Evaluation: 15 Marks


 Continuous Evaluation includes both Quantitative and Qualitative evaluation.
 Quantitative evaluation includes online tests/ quizzes. Three tests will be conducted throughout the
semester, and the average of the best two scores will be assigned as the final score to the student.
This approach ensures a fair assessment of the student’s performance and provides an opportunity for
improvement.
 Qualitative evaluation includes project work, and case analysis.

Three Components (CE1, CE2, and CE3) each carry 5 marks.


 CE 1 5 Marks (online course)
 CE 2 5 Marks (project work)
 CE 3 5 Marks (case analysis)

2. Mid-Examination: 15 Marks

3. End-Examination: 30 Marks

Total: 60 Marks
Rubric Matrix for Continuous Assessments
Weightage
Assessment Criteria Description
(out of 5)
Correctness of 2 Correctness of answers for questions (multiple
Answers choice, short answer, etc.).
Conceptual 1 Demonstrates understanding of key concepts
Understanding tested throughout the course.
CE 1: Online
Timeliness and 1 Submission of assignments within the
Tests/Quizzes
Submission deadlines.
Clarity and 1 Responses are clear, concise, and directly
Precision of address the question.
Responses
Research and 2 Depth and quality of research conducted and
Analysis analysis conducted in the project.
Practical 1 Effective application of theoretical knowledge
CE 2: Project Application to real-world scenarios.
Work Presentation and 1 Clear, logical organization of project report and
Structure presentation.
Creativity and 1 Originality and innovative approach in
Innovation presenting ideas.
Identification of 2 Ability to identify and articulate the core issues
Key Issues of the case effectively.
CE 3: Case Critical Thinking 2 Depth of analysis and quality of proposed
Analysis and Solutions solutions to the issues identified.
Clarity of 1 Logical structure, clarity, and organization in
Presentation presenting the analysis.

Detailed rubric below provides a framework for assessing student performance in the three continuous
assessments.
Needs
Criteria Excellent (5) Very Good (4) Good (3) Fair (2) Improvement
(1)
CE 1: Online Tests/Quizzes
Demonstrates a Demonstrates Demonstrates Limited Lacks
thorough good basic understanding understanding,
understanding understanding understanding with several answers many
of course of course of course inaccuracies, questions
concepts with concepts with concepts with answers some incorrectly or
Knowledge &
high accuracy. minor errors. some questions not at all.
Comprehension
Answers all Answers most inaccuracies. incorrectly.
questions questions Answers most
correctly and correctly and questions
comprehensivel with some correctly but
y. depth. lacks depth.
Application & Applies Applies Applies Unable to apply Lacks
Analysis concepts concepts with concepts with concepts application and
effectively to some accuracy limited effectively, critical
solve problems, and basic accuracy, lacks thinking skills.
demonstrating critical thinking shows difficulty understanding
critical thinking and problem- applying them of how to use
and problem- solving skills. to solve them.
solving skills. problems.
CE 2: Project Work
Thorough and Good research Basic research Limited No research or
in-depth and analysis and analysis, research and analysis, fails
research, with minor limited analysis, to draw
Research &
insightful limitations, conclusions, inaccurate or meaningful
Analysis
conclusions relevant some data irrelevant conclusions.
based on strong conclusions interpretation conclusions
analysis. drawn. limitations. drawn.
Project is well- Project is well- Project is Project is Project is
organized, organized and adequately poorly poorly
executed executed organized and organized and organized and
efficiently, and effectively with executed with executed with executed, with
delivered on minor delays. some delays. significant significant
Project
time. Demonstrates Demonstrates delays. delays and
Execution
Demonstrates good project basic project Demonstrates missed
excellent management management poor project deadlines.
project skills. skills. management
management skills.
skills.
Project report Project report Project report Project report Project report
and and and and and
presentation are presentation are presentation are presentation are presentation
well-written, well-written somewhat well- poorly written are poorly
clear, concise, and clear. written with and difficult to written and
Presentation &
and engaging. Demonstrates minor errors. understand. difficult to
Communicatio
Demonstrates good Demonstrates Demonstrates understand.
n
excellent communication adequate poor Demonstrates
communication and communication communication very poor
and presentation skills. skills. communicatio
presentation skills. n skills.
skills.
CE 3: Case Analysis
Demonstrates a Demonstrates Demonstrates Limited Lacks
deep good basic understanding understanding
understanding understanding understanding of the case of the case
of the case of the case of the case study, with study and fails
Case
study, study, study, missing significant to identify key
Understanding
identifying key identifying key some key inaccuracies in issues and
issues and issues with issues. identifying key challenges.
challenges some minor issues.
accurately. inaccuracies.
Accurately Identifies key Identifies some Fails to Unable to
identifies key problems and key problems accurately identify key
problems and challenges in but may miss identify key problems or
challenges in the case study important ones. problems and challenges in
the case study. with some Provides basic challenges. the case study.
Problem
Provides minor analysis with Provides
Identification &
insightful inaccuracies. limited limited or no
Analysis
analysis and Provides good supporting analysis.
supports analysis with evidence.
conclusions some
with strong supporting
evidence. evidence.
Develops Develops Develops basic Develops Fails to
creative and feasible solutions to the limited or develop any
insightful solutions to the identified unrealistic meaningful
solutions to the identified problems with solutions to the solutions or
Solution
identified problems. some identified recommendati
Development &
problems. Recommendati limitations. problems. ons to the
Recommendati
Recommendati ons are Recommendati Recommendati identified
ons
ons are well- supported with ons may lack ons are not problems.
supported, reasonable clarity or depth. well-supported.
realistic, and evidence.
actionable.
GLOSSARY OF TERMS (Unit-wise)

Unit I – Introduction to Business Analytics


1. Analytics Landscape: The ecosystem and environment that includes all tools, technologies, and
methodologies involved in analytics.
2. Business Analytics: The process of using data analysis tools to interpret data and make business
decisions.
3. Data-Driven Decision Making: Making decisions based on data analysis, rather than intuition or
assumptions.
4. Descriptive Analytics: Analysis that focuses on understanding historical data and what has happened.
5. Predictive Analytics: Techniques that predict future outcomes based on historical data.
6. Prescriptive Analytics: Analytics that suggest courses of action based on predictions and trends.
7. Decision Support Systems (DSS): Tools that help in making business decisions based on data analysis.
8. Data Visualization: The graphical representation of data to identify patterns, trends, and insights.
9. Data Mining: The process of discovering patterns and knowledge from large amounts of data.
10. Data Warehousing: The storage and management of large datasets used for analysis and reporting.
11. Data Science: A multidisciplinary field that combines computer science, statistics, and domain expertise
to extract knowledge from data.
12. Machine Learning (ML): A branch of artificial intelligence focused on creating algorithms that enable
computers to learn from data.
13. Big Data: Large, complex datasets that traditional data processing software can't handle.
14. Business Intelligence (BI): Technologies and strategies used to analyze business data and support
decision-making.
15. Challenges in Data-Driven Decision Making: Problems such as data quality, integration, accessibility,
and resistance to change.
16. Ethics in Analytics: Addressing moral issues related to data privacy, security, fairness, and transparency.
17. Data Governance: Policies and procedures to ensure data quality, consistency, and privacy.
18. Predictive Modeling: Using statistical techniques and machine learning to predict future trends.
19. Statistical Inference: The process of using data to make generalizations or predictions about a
population.
20. Scalability: The ability of an analytical tool or process to handle increasing volumes of data efficiently.
21. Variable Measurement: Methods to quantify variables in a dataset (nominal, ordinal, interval, ratio).
22. Data Types: Classification of data (e.g., numerical, categorical).
23. Nominal Data: Categorical data without any inherent order (e.g., gender, color).
24. Ordinal Data: Categorical data with a specific order (e.g., ratings on a scale of 1-5).
25. Interval Data: Data with meaningful differences but no true zero point (e.g., temperature).
26. Ratio Data: Data with meaningful differences and a true zero point (e.g., height, weight).
27. Feature Engineering: The process of selecting, modifying, or creating features to improve model
performance.
28. Feature Scaling: Standardizing the range of independent variables (e.g., normalizing data).
29. Feature Selection: Identifying and selecting the most important features for model building.
30. Widely Used Analytical Tools: Tools like Excel, Tableau, SAS, R, Python used for data analysis and
visualization.

Unit II – Introduction to Python and Data Processing


31. Python: A versatile, high-level programming language widely used for data analysis, machine learning,
and automation.
32. Jupyter Notebooks: A web application used for writing and sharing code and data analysis results.
33. Variables: Containers used to store data values in Python programs.
34. Data Structures: Methods of organizing data in Python, such as lists, tuples, sets, and dictionaries.
35. Functions: Reusable blocks of code that perform specific tasks.
36. Loops: Control flow statements (for, while) that allow repeated execution of code.
37. Conditionals: Statements (if-else) used to control the flow of execution based on conditions.
38. Pandas: A Python library used for data manipulation and analysis, particularly for structured data like
CSV files.
39. NumPy: A Python library used for numerical computing, particularly for handling arrays and matrices.
40. Matplotlib: A Python library for creating static, animated, and interactive visualizations.
41. Seaborn: A data visualization library based on Matplotlib that provides a high-level interface for
drawing attractive statistical graphics.
42. SciPy: A Python library used for scientific and technical computing.
43. Map Function: A built-in Python function used to apply a given function to all items in an iterable (e.g.,
list, tuple).
44. Filter Function: A Python function used to filter elements from an iterable based on a condition.
45. Data Wrangling: The process of cleaning and transforming raw data into a usable format for analysis.
46. Data Collection: The process of gathering data from various sources, such as surveys, sensors, or
databases.
47. Data Cleaning: Identifying and correcting errors or inconsistencies in a dataset.
48. Data Transformation: Changing the format, structure, or values of data to make it suitable for analysis.
49. Feature Extraction: Extracting meaningful variables or features from raw data (e.g., from images, text,
or audio).
50. Feature Selection: Choosing a subset of relevant features for building a model.
51. Text Data: Unstructured data, often in the form of natural language (e.g., product reviews, emails).
52. Categorical Data: Data that represents categories or groups (e.g., customer types, product categories).
53. Numeric Data: Data that represents numerical values (e.g., age, sales amount).
54. Image Data: Data that represents images, often in pixel matrix form.
55. Regular Expressions: A sequence of characters used to search and manipulate strings in Python.
56. Data Visualization Libraries: Tools used to create visualizations of data (e.g., Matplotlib, Seaborn,
Plotly).

Unit III – Building, Tuning, and Deploying Models


57. Machine Learning Models: Algorithms used to analyze patterns in data and make predictions or
classifications.
58. Model Evaluation: The process of assessing the performance of a machine learning model using various
metrics.
59. Model Tuning: The process of optimizing a machine learning model by adjusting its hyper-parameters.
60. Hyper-parameters: Parameters that is set before training a machine learning model, such as the learning
rate or number of trees in a decision tree.
61. Cross-Validation: A technique used to assess the generalization ability of a model by splitting data into
training and testing sets multiple times.
62. Regression Analysis: A statistical technique for modeling the relationship between a dependent variable
and one or more independent variables.
63. Simple Linear Regression: A regression model with one independent variable.
64. Multiple Linear Regression: A regression model that uses multiple independent variables to predict a
dependent variable.
65. Logistic Regression: A classification algorithm used for binary outcomes (0 or 1).
66. Binary Classification: A type of classification problem where the output has two classes (e.g., spam vs.
not spam).
67. Model Diagnostics: The process of evaluating model assumptions and residuals to assess the model's
fitness.
68. Residuals: The differences between observed and predicted values in regression models.
69. Receiver Operating Characteristic (ROC): A graphical representation of a classifier’s performance
across various thresholds.
70. Area Under the Curve (AUC): A metric used to evaluate the performance of a classification model,
particularly for imbalanced datasets.
71. Classification Cut-Off: The threshold used to decide the classification boundary in binary classification.
72. Gain Chart: A graphical tool used to evaluate the performance of classification models, particularly with
imbalanced datasets.
73. Lift Chart: A graphical tool used to evaluate the improvement a model provides over a random guess in
classification problems.
74. Regularization: A technique used to prevent overfitting by adding a penalty term to the model's cost
function.
75. L1 Regularization (Lasso): A form of regularization that can shrink some model coefficients to zero,
effectively performing feature selection.
76. L2 Regularization (Ridge): A form of regularization that penalizes large coefficients but does not shrink
them to zero.
77. Overfitting: When a model learns the noise in the training data, resulting in poor generalization to new
data.
78. Under-fitting: When a model is too simple to capture the underlying patterns in the data, leading to poor
performance.
79. Training Data: The dataset used to train a machine learning model.
80. Test Data: The dataset used to evaluate the performance of a machine learning model after it has been
trained.
QUESTION BANK (BTL 1-6)

BTL 1 - Remembering (Recall of Facts, Terms, or Concepts)


BTL 2 - Understanding (Explain Concepts or Interpret Information)
BTL 3 - Applying (Use Knowledge in Practical Situations)
BTL 4 - Analyzing (Breakdown Information into Parts and Identify Relationships)
BTL 5 - Evaluating (Make Judgments Based on Criteria and Standards)
BTL 6 - Creating (Generate New Ideas, Propose Solutions, or Construct New Models)

BTL
QNo. Questions
Level
Unit I
1 Define 'Feature Engineering' and give one example. BTL 1
2 Explain the term 'Ethics in Business Analytics. BTL 1
3 List the different types of data scales used in business analytics. BTL 1
4 Name three widely used analytical tools in business analytics. BTL 1
5 What are the challenges faced in data-driven decision making? BTL 1
6 What are the fundamental types of data used in analytics? BTL 1
7 What are the key components of the analytics landscape? BTL 1
8 What is the primary goal of building analytics capabilities in an organization? BTL 1
Describe how business analytics can be applied to solve real-world management
9 BTL 2
problems. Provide an example.
Discuss the role of widely used analytical tools in supporting decision-making in
10 BTL 2
business.
11 Explain the challenges in building an analytics capability within an organization. BTL 2
Explain the framework for data-driven decision making and its significance in
12 BTL 2
business analytics.
How can business analytics be ethically applied in decision-making? Discuss
13 BTL 2
potential ethical issues that may arise.
How does feature engineering enhance the predictive power of a data model? Provide
14 BTL 2
an example.
Interpret the concept of 'Roadmap for Analytics Capability Building' and how it
15 BTL 2
impacts an organization’s ability to leverage data analytics.
What is the relationship between data types and scales of measurement? How do they
16 BTL 2
influence data analysis?
Apply the ethics in business analytics by analysing a case where data could be
17 BTL 3
misused for business advantage. How would you propose a solution?
18 Demonstrate how feature engineering optimizes machine learning outcomes. BTL 3
Describe a data collection framework incorporating various data types and scales for a
19 BTL 3
specific business problem.
Given a dataset with customer purchase behaviour, apply the appropriate feature
20 BTL 3
engineering techniques to improve its quality for predictive modelling.
Given a management decision problem (e.g., marketing strategy, operational
21 efficiency), choose the appropriate data types and scales to collect, analyse, and BTL 3
present findings to support decision-making.
How can you apply business analytics to improve decision-making in human resource
22 BTL 3
management?
How can you apply ethical principles to ensure unbiased outcomes in predictive
23 BTL 3
analytics?
How can you apply the knowledge of variable measurement scales in designing
24 BTL 3
surveys for market research?
How would you apply feature engineering techniques to improve the performance of
25 BTL 3
a predictive model?
26 How would you apply tools like Python to conduct predictive analytics in finance? BTL 3
In a business scenario where data-driven decisions could impact stakeholders,
27 describe how ethical considerations should guide the analytics process and its BTL 3
outcomes.
Using a set of business metrics, identify the challenges a company might face in using
28 BTL 3
data-driven decision making to improve customer satisfaction.
Using one widely used analytical tool, propose how you would analyse sales data to
29 BTL 3
forecast demand for the next quarter in a manufacturing firm.
You are tasked with improving decision-making in a retail company. Use the
30 framework for data-driven decision making to propose a roadmap for integrating BTL 3
analytics into business processes.
You have been given the task of introducing business analytics into a traditional
31 organization. Apply your knowledge of analytics capabilities and outline steps to BTL 3
create an analytics roadmap for the organization.
Analyze a scenario where a company is struggling to integrate analytics into its
32 decision-making processes. What data types and scales would be most relevant to BTL 4
their analytics efforts?
Analyze the differences between commonly used analytical tools in terms of
33 BTL 4
functionality and ease of use.
Analyze the differences between nominal, ordinal, interval, and ratio scales in the
34 BTL 4
context of business applications.
Analyze the ethical dilemmas businesses face when using customer data for analytics
35 BTL 4
purposes.
36 Analyze the impact of business analytics on marketing strategies and outcomes. BTL 4
Analyze the impact of data quality issues on the effectiveness of data-driven decision
37 BTL 4
making in a company. How would you address these issues?
Analyze the role of feature engineering in handling imbalanced datasets in customer
38 BTL 4
segmentation.
Compare and contrast the challenges faced by companies in building analytics
39 BTL 4
capability in different industries (e.g., manufacturing vs. retail).
Examine the ethical implications of using customer data for predictive analytics in a
40 BTL 4
retail company. How would you assess the risks of misuse?
Given a case study where a company has started implementing business analytics,
41 BTL 4
analyse the strengths and weaknesses in their data-driven decision-making process.
Given a dataset with multiple variables, analyse how feature engineering can be
42 BTL 4
applied to improve the predictive model's accuracy.
In a case where an organization has adopted an analytical tool but faces resistance,
43 BTL 4
analyse the reasons for the resistance and suggest steps to overcome it.
Assess the ethical considerations in applying machine learning algorithms to
44 BTL 5
customer data. What guidelines would you propose to ensure responsible use of data?
Assess the impact of inaccurate or biased data on business analytics outcomes, and
45 BTL 5
provide recommendations for mitigating these issues in decision-making processes.
Based on an analysis of a company’s data analytics maturity, evaluate the next steps
46 BTL 5
they should take to further enhance their data-driven decision-making capabilities.
Critically evaluate the choice of analytical tools in a project. How do you decide
47 BTL 5
which tool is the best fit for a specific business problem?
Evaluate the effectiveness of an analytics capability-building roadmap for a global
48 BTL 5
company versus a local startup. Which would be more challenging and why?
Evaluate the role of feature engineering in a business analytics project. What criteria
49 would you use to determine whether feature engineering has added value to the BTL 5
model?
50 Given a management scenario where data-driven decisions are crucial, evaluate BTL 5
whether the company’s approach to data ethics aligns with industry standards. What
improvements would you suggest?
Create a business analytics framework that a new company can adopt to guide data-
51 BTL 6
driven decisions in a competitive market. What key components would you include?
Create a detailed proposal for integrating widely used analytical tools into the
52 decision-making processes of an organization. How would you ensure these tools BTL 6
align with the company’s goals and strategies?
Design a comprehensive analytics roadmap for a company looking to move from
53 traditional decision-making methods to a fully data-driven approach. What steps BTL 6
should be included in the roadmap?
Design an analytics-based solution for supply chain optimization in a manufacturing
54 BTL 6
company.
Design an ethical guidelines framework for the use of business analytics in customer
55 segmentation. How would you ensure the framework is adhered to in the company’s BTL 6
data practices?
Develop a business case for the use of business analytics in improving operational
56 efficiency in a supply chain. What key performance indicators would you track, and BTL 6
how would you measure success?
Develop an innovative solution using business analytics to improve customer
57 retention for a retail company. How would you collect, analyze, and act on the BTL 6
relevant data?
Propose a framework for ethical governance in business analytics to ensure
58 BTL 6
compliance and trust.
Propose a new feature engineering strategy for improving the predictive accuracy of a
59 BTL 6
model that forecasts product demand. Justify your choice of methods and techniques.
Unit II
60 Define the concept of feature scaling and why it is important in data analysis. BTL 1
61 Explain the basic data types in R. BTL 1
List three core libraries in Python that are commonly used for data analysis and
62 BTL 1
manipulation.
63 Name two types of data that can be handled during feature engineering. BTL 1
64 What are the basic components of a Python program? BTL 1
65 What are the different data types in Python used for data processing? BTL 1
66 What are the main assumptions of logistic regression? BTL 1
67 What is data wrangling, and why is it a crucial step in the data processing pipeline? BTL 1
68 What is Jupyter Notebook, and why is it commonly used in Python for data analysis? BTL 1
69 What is the difference between map and filter functions in Python? BTL 1
70 What is the primary function of the pandas library in Python? BTL 1
71 What is the purpose of feature selection in machine learning? BTL 1
72 Describe the importance of power analysis in experimental design. BTL 2
Describe the process of feature extraction and engineering. Why is it essential to
73 BTL 2
perform these tasks when working with machine learning models?
Describe the purpose of feature scaling and how it affects the performance of machine
74 BTL 2
learning algorithms like k-NN and linear regression.
Discuss how Python libraries like pandas and numpy contribute to data wrangling.
75 BTL 2
Provide an example of a common data wrangling task.
76 Discuss the key functions used to create a dataset in R. BTL 2
Discuss the role of data collection in the context of business analytics. How does data
77 BTL 2
collection influence the quality of the insights derived from the data?
Explain how Python's filter function differs from map, and provide an example where
78 BTL 2
filter would be more appropriate in a data processing task.
Explain the concept of data visualization and its importance in the data analysis
79 BTL 2
process. Provide an example of a visualization you might use for a sales dataset.
80 Explain the difference between data.frame() and tibble() in R. BTL 2
Explain the difference between feature engineering for numeric data, categorical data,
81 BTL 2
and text data.
82 Explain the difference between simple, multiple, and polynomial regression models. BTL 2
83 Explain the impact of missing values on data analysis results. BTL 2
Explain the role of Jupyter Notebooks in Python-based data analysis projects and its
84 BTL 2
advantages over traditional coding environments.
85 Explain the significance of Python in data analysis and visualization. BTL 2
How does feature selection help improve the performance of a model? Describe a
86 BTL 2
scenario where feature selection would be particularly useful.
How does the map function in Python work? Provide an example where it would be
87 BTL 2
useful in a data analysis pipeline.
88 What are the key objectives of exploratory data analysis? BTL 2
Apply feature selection techniques to a dataset with multiple features. Justify which
89 features should be selected and which should be discarded based on their relevance to BTL 3
the model.
Demonstrate how to use Python’s filter function to extract records from a dataset
90 BTL 3
where the sales value exceeds a specific threshold.
Given a dataset with customer purchase data, perform feature extraction and
91 engineering on both categorical and numeric features to prepare the data for BTL 3
modelling. Explain your approach.
Given a dataset with numeric and categorical features, apply feature scaling
92 BTL 3
techniques (e.g., standardization or normalization) to pre-process the data.
Given a dataset with sales and customer information, perform data wrangling steps
93 such as handling missing data, removing duplicates, and converting data types using BTL 3
Python.
94 How would you create a dataset in Python for analysing sales data? BTL 3
95 How would you create a scatter plot in Python with customized labels and colours? BTL 3
96 How would you create summary statistics and visualizations for EDA in Python? BTL 3
97 How would you fit and validate a multiple regression model in Python using lm()? BTL 3
98 How would you handle missing values using imputation techniques in Python? BTL 3
99 How would you install and load a package in Python to begin your analysis? BTL 3
100 How would you use Python to diagnose and validate a logistic regression model? BTL 3
101 How would you use Python to identify and remove duplicate records in a dataset? BTL 3
In a dataset with multiple continuous variables, apply feature scaling (e.g., Min-Max
102 Scaling) and describe how the scaling process influences the model's ability to learn BTL 3
patterns.
Use Python’s matplotlib or seaborn library to visualize the relationship between
103 product sales and advertising spend in a dataset. Provide insights based on the BTL 3
visualization.
Using Jupyter Notebooks, demonstrate how you would clean a dataset with missing
104 BTL 3
values and outliers, and provide a summary of the cleaned data.
Using Python, implement feature engineering for a text-based dataset (e.g., customer
105 BTL 3
reviews) by extracting useful features such as word count and sentiment.
106 What are the key steps in cleaning raw data before analysis? BTL 3
What are the steps to calculate measures of central tendency (mean, median, mode) in
107 BTL 3
Python?
Write a Python function using map to convert a list of strings (e.g., product names) to
108 BTL 3
uppercase.
Analyze the effectiveness of advanced graphs like boxplots and heatmaps in
109 BTL 4
identifying trends.
110 Analyze the effectiveness of logistic regression in predicting binary outcomes. BTL 4
111 Analyze the effects of multicollinearity in a regression model and propose methods to BTL 4
address it.
112 Analyze the impact of outliers on regression analysis. BTL 4
Analyze the impact of using different feature scaling techniques (e.g., normalization
113 vs. standardization) on a machine learning model’s performance. How would the BTL 4
choice of scaling method influence the results?
Analyze the role of feature selection in improving the performance of a machine
114 learning model. Given a dataset, how would you determine which features are BTL 4
irrelevant or redundant?
115 Compare and contrast the use of variance and standard deviation in analyzing data. BTL 4
Examine the process of feature engineering for text data. How would you extract
116 useful features from raw text (e.g., customer reviews) to enhance a machine learning BTL 4
model’s ability to predict customer satisfaction?
Given a dataset with different types of features (e.g., numeric, categorical, text),
117 analyze how you would combine these features to improve model accuracy. What BTL 4
techniques would you use to pre-process these features for model training?
Given a dataset with multiple missing values and categorical data, analyze the steps
118 you would take to clean and prepare the data for analysis using Python. How would BTL 4
you handle missing data for categorical vs. numerical variables?
Identify and describe the key differences between basic Python plotting functions and
119 BTL 4
ggplot2.
You are tasked with performing data wrangling on a large dataset with inconsistent
formatting, missing values, and outliers. Analyze how you would approach each of
120 BTL 4
these issues using Python and provide specific examples of functions or libraries you
would use.
After performing feature selection on a dataset, evaluate how the reduced feature set
121 affects model performance. How would you judge whether the feature selection has BTL 5
improved or degraded the model's predictive power?
Based on your analysis of a data wrangling process, evaluate the potential drawbacks
and risks of handling missing values by imputation versus removing rows with
122 BTL 5
missing data. Which method would be more suitable for a dataset with a high
proportion of missing values?
Critically evaluate the use of the map and filter functions in Python for data pre-
123 processing. What are the advantages and limitations of these functions when handling BTL 5
large datasets?
Evaluate the effectiveness of various feature engineering methods for categorical data
124 (e.g., one-hot encoding vs. label encoding). In what scenarios would one method be BTL 5
preferred over the other?
Evaluate the performance of a data visualization approach for exploring relationships
125 between multiple variables in a business analytics project. How would you assess BTL 5
whether the visualizations provide actionable insights?
Given a machine learning model’s performance on a dataset, evaluate how feature
126 scaling might improve its accuracy. Under what conditions would scaling not improve BTL 5
the performance of a model?
Create a Python program that uses the map and filter functions to clean a large
127 dataset. Describe how you would use these functions to transform categorical BTL 6
variables and filter out irrelevant data before applying machine learning models.
Create an interactive data visualization dashboard using Python that can be used by
business analysts to explore sales data, identify trends, and make informed decisions.
128 BTL 6
What libraries would you use, and how would you structure the dashboard to
maximize usability and insight generation?
Design a complete data processing pipeline in Python, from data collection to feature
129 engineering and selection. Provide a step-by-step approach for handling a dataset that BTL 6
includes missing values, outliers, and mixed data types (numeric and categorical).
130 Design a feature selection process that evaluates the relevance of features based on BTL 6
correlation with the target variable. How would you use Python to automate the
process of selecting the most significant features for a predictive model?
Develop a machine learning model using Python, incorporating feature scaling and
131 selection as part of the pre-processing pipeline. Explain how you would optimize this BTL 6
model by evaluating different feature engineering and scaling techniques.
Propose a feature engineering strategy for a dataset containing both numerical and
textual data. How would you handle the pre-processing and feature extraction for both
132 BTL 6
types of data, and what methods would you use to integrate them into a unified
feature set for model training?
Unit III
Explain the importance of exploratory data analysis (EDA) in the process of building
133
machine learning models. BTL 2
134 Define binary logistic regression and its application in business analytics. BTL 1
Explain what ROC (Receiver Operating Characteristic) curve represents in a model
135 BTL 1
evaluation.
136 List the steps involved in building a regression model. BTL 1
What are the key diagnostic tools used to assess the performance of a regression
137 BTL 1
model?
138 What are the key differences between L1 and L2 regularization techniques? BTL 1
139 What are the main features of Jupyter Notebooks that make it ideal for data analysis? BTL 1
140 What are the primary purposes of the NumPy and Pandas libraries in Python? BTL 1
141 What does AUC (Area Under the Curve) represent in a binary classification model? BTL 1
What is a Gain and Lift Chart, and how is it useful in evaluating classification
142 BTL 1
models?
What is regularization in machine learning, and what are the two types of
143 BTL 1
regularization techniques?
144 What is the difference between simple and multiple regression models? BTL 1
145 What is the purpose of model evaluation in the context of business analytics? BTL 1
Describe the role of visualization in exploratory data analysis and how it can aid in
146 BTL 2
identifying patterns or trends in the data.
Explain the concept of "optimal classification cut-off" in binary logistic regression
147 BTL 2
and why it is important in model evaluation.
148 How can hyper-parameter tuning enhance model performance? BTL 2
149 How does diagnostic analysis help in understanding the performance of a model? BTL 2
How does regularization (L1 and L2) help in reducing overfitting in machine learning
150 BTL 2
models?
How does the ROC curve assist in evaluating a binary classification model? What
151 BTL 2
does it mean when the curve is closer to the top-left corner?
Interpret the concept of a Gain and Lift Chart and explain how it helps in evaluating
152 BTL 2
the performance of a classification model.
153 What are the differences between mutable and immutable data types in Python? BTL 2
154 What is the difference between L1 (Lasso) and L2 (Ridge) regularization techniques? BTL 2
What is the difference between simple regression and multiple regression? How do
155 BTL 2
you decide which one to use?
What is the process of building a regression model? Explain the steps involved from
156 BTL 2
data collection to model deployment.
157 What is the significance of model tuning in improving model accuracy? BTL 2
After building a binary logistic regression model, evaluate its performance using
158 ROC and AUC. Find the optimal classification threshold and assess the impact of BTL 3
changing the threshold on the classification results.
Build a multiple regression model to predict sales based on advertising spend, price,
159 and competition. Apply model diagnostics to check for multicollinearity and BTL 3
heteroscedasticity.
Design a Python script to visualize and perform exploratory data analysis (EDA) on a
160 dataset. Use different types of plots (e.g., histograms, scatter plots, box plots) to BTL 3
uncover trends and patterns in the data.
Given a dataset with sales data, perform exploratory data analysis (EDA) using
161 BTL 3
Python. Visualize key relationships between features and describe your findings.
Given a dataset with several predictors, apply regularization (L1 and L2) to prevent
162 overfitting in a multiple regression model. Compare the performance of both BTL 3
regularization techniques and provide recommendations.
163 How can the map() function be applied to transform a list of strings into integers? BTL 3
How can you use descriptive statistics to summarize a dataset before applying
164 BTL 3
machine learning models?
How would you create a histogram to visualize the distribution of a numerical
165 BTL 3
variable using seaborn?
166 How would you handle missing values and duplicates in a dataset using Python? BTL 3
How would you use a gain chart to evaluate the effectiveness of a marketing
167 BTL 3
campaign?
How would you use Markdown in Jupyter Notebooks to document your data analysis
168 BTL 3
process?
How would you use the confusion matrix to evaluate the performance of a binary
169 BTL 3
logistic regression model?
170 How would you use the matplotlib library to create a line chart in Python? BTL 3
171 How would you write a Python function to calculate the factorial of a number? BTL 3
Using a dataset with binary outcomes (e.g., success/failure), implement a Gain and
172 Lift Chart. Analyze the chart to determine how well your classification model BTL 3
performs in predicting the outcomes.
Using a dataset with customer data, build and evaluate a binary logistic regression
173 model to predict customer churn. Include ROC and AUC analysis, and find the BTL 3
optimal classification cut-off.
174 What are the key steps involved in building and validating a regression model? BTL 3
175 What are the key steps involved in pre-processing a dataset for machine learning? BTL 3
You are given a dataset for a loan approval model. Build a multiple regression model
176 to predict loan approval status and use diagnostics to check for issues such as BTL 3
multicollinearity and residuals.
You are tasked with building a simple linear regression model to predict house prices
177 based on square footage. Apply the steps in building the model, evaluate its BTL 3
performance, and interpret the results.
You have built a regression model to predict employee performance. Conduct model
178 tuning and hyper-parameter optimization to improve the accuracy. Describe the steps BTL 3
you took to achieve this.
After performing exploratory data analysis (EDA) on a dataset, analyze how outliers
179 and missing values can influence the results of your regression and classification BTL 4
models.
Analyze how feature selection and feature engineering can improve model
180 performance in a logistic regression problem. What methods would you apply for BTL 4
both categorical and continuous features?
Analyze the advantages of using Jupyter Notebooks over traditional IDEs for Python
181 BTL 4
programming.
Analyze the effect of multicollinearity in a multiple regression model. How would
182 BTL 4
you detect multicollinearity, and what steps would you take to address it?
Analyze the effectiveness of combining statistical summaries with visualizations
183 BTL 4
during exploratory data analysis.
184 Analyze the impact of feature scaling on the performance of regression models. BTL 4
185 Analyze the impact of regularization (L1 vs. L2) on a regression model's coefficients. BTL 4
How does regularization affect the model’s bias and variance?
186 Analyze the residual plots of a regression model to assess its goodness of fit. BTL 4
Analyze the role of loops and conditionals in solving iterative programming
187 BTL 4
problems.
Analyze the trade-offs between using a simple regression model and a multiple
188 regression model. How would the addition of more features affect the model’s BTL 4
performance and interpretability?
Compare the effectiveness of scatter plots and box plots in identifying patterns in
189 BTL 4
data.
190 Compare the functionalities of Pandas and NumPy when working with tabular data. BTL 4
191 Compare the usage of map() and filter() functions for handling large datasets. BTL 4
Compare the use of gain charts and lift charts in determining the efficiency of a
192 BTL 4
predictive model.
Given a binary logistic regression model, analyze the relationship between the ROC
193 BTL 4
curve and AUC. What do these metrics tell you about the model’s performance?
Given a model that shows signs of overfitting, analyze the impact of tuning hyper-
194 parameters such as the learning rate or regularization strength on improving the BTL 4
model's generalization ability.
How would you interpret the p-value and R-squared value in a multiple regression
195 BTL 4
model summary?
Based on the model’s performance, evaluate the use of feature scaling before applying
196 a regression or classification model. How does scaling impact model accuracy and BTL 5
convergence speed?
Critically evaluate the use of Gain and Lift charts for assessing classification models.
197 How would you determine whether your model is performing well based on these BTL 5
charts?
Evaluate a model's performance after applying hyper-parameter tuning. How would
198 you assess whether the tuned model has improved over the original model in terms of BTL 5
accuracy, precision, and recall?
Evaluate a regression model's diagnostics (e.g., residual plots, R-squared value). What
199 metrics would you use to determine if the model fits the data well, and what actions BTL 5
would you take if the diagnostics indicate problems?
Evaluate the effect of different regularization methods (L1 vs. L2) on model
200 performance. In which scenarios would L1 regularization be preferred over L2, and BTL 5
why?
Evaluate the importance of diagnostic analysis in model building. How can you use
201 diagnostics to improve a model's generalizability and avoid overfitting or under- BTL 5
fitting?
Evaluate the performance of a binary logistic regression model using the ROC curve
202 and AUC. What thresholds would you set for classification, and how do these choices BTL 5
impact the model’s precision and recall?
Create a comprehensive model-building pipeline that includes the following stages:
203 data pre-processing, model selection, hyper-parameter tuning, and model evaluation. BTL 6
Justify your choices of methods and tools at each stage.
Create a Python script that builds, tunes, and evaluates a regression or classification
204 model. Include steps for model diagnostics and feature selection, and explain how the BTL 6
script will automate the model evaluation process.
Create a regression model to predict employee performance based on several
205 predictors. Apply diagnostic analysis to validate the model's assumptions and improve BTL 6
its accuracy.
Create an automated process for model monitoring and retraining after deployment.
206 How would you design this system to account for model drift and ensure that the BTL 6
model continues to deliver accurate predictions over time?
207 Design a logistic regression model to predict customer churn, and propose strategies BTL 6
for model tuning and hyper-parameter optimization. Include steps for evaluating the
model using AUC, ROC, and Gain/Lift charts.
Design an approach for creating a classification model using a dataset with
208 imbalanced classes. How would you balance the data, select appropriate evaluation BTL 6
metrics, and ensure the model generalizes well to new data?
Develop a business case for deploying a machine learning model that predicts product
209 demand based on historical sales data. Describe how you would integrate the model BTL 6
into an organization’s decision-making process.
Develop a strategy to improve a multiple regression model's performance using
210 feature engineering and regularization. Include the methods you would use to handle BTL 6
categorical variables, missing data, and scaling.
Propose a new framework for building and deploying machine learning models that
incorporates data collection, pre-processing, model training, evaluation, and
211 BTL 6
monitoring. How would you ensure the model’s long-term performance and
scalability?
Propose a solution for addressing overfitting in a machine learning model that uses
212 many predictors. What steps would you take to reduce complexity while maintaining BTL 6
model accuracy (e.g., regularization, feature selection)?
INSTITUTE OF PUBLIC ENTERPRISE
Shameerpet Campus
Hyderabad – 500101
POST GRADUATE DIPLOMA IN MANAGEMENT
Mid Trimester Examinations: February 2025

Programme : PGDM Trimester : III

Subject : Business Analytics for Decision Making Time : 1 hour

Code : 24PC304 Max Marks : 15

Note: Answer all questions

Section – I (4 X 1 = 4 Marks)
Q.NO QUESTION BTL CO
Q. 1
Q. 2
Q. 3
Q. 4

Section - II (Answer the following) (2 X 3 = 6 Marks)


Q.NO QUESTION BTL CO
Q. 5a
OR
Q. 5b

Q. 6a
OR
Q. 6b

Section - III (Answer the following) (1 X 5 = 5 Marks)


Q.NO QUESTION BTL CO
Q.7a
OR
Q. 7b
INSTITUTE OF PUBLIC ENTERPRISE
Shamirpet Campus: Hyderabad-500101
POST GRADUATE PROGRAMMES
End Trimester Examinations: APRIL 2025

Programme : PGDM Trimester : III

Subject : Business Analytics for Decision Making Time : 2 Hours

Code : 24PC304 Max Marks : 30


Note: Answer all questions

Section – I (6X1 = 6 Marks)

Q.NO QUESTION BTL CO


Q. 1
Q. 2
Q. 3
Q. 4
Q. 5
Q. 6

Section - II (4X 3 = 12 Marks)

Q.NO QUESTION BTL CO


Q. 7a

OR
Q. 7b

Q. 8a
OR
Q. 8b
Q. 9a
OR
Q. 9b
Q. 10a
OR
Q. 10b
Section – III (2 X 6 = 12 Marks)

Q.NO QUESTION BTL CO


Q. 11a
OR
Q. 11b
Q. 12a
OR
Q. 12b
UNIT I

Understanding the Dimensions of Analytics


Dimensions of Meaning
Analytics
Analytics The systematic computational analysis of data or statistics to identify meaningful
patterns and insights
Data Analytics The process of examining, organizing, and interpreting raw data to uncover
actionable insights.
Business Analytics The application of data analytics methods specifically for business decision-
making and strategy formulation.
Data Science A multidisciplinary field that uses scientific methods, algorithms, and systems to
extract insights and knowledge from structured and unstructured data.
Artifcial Intelligence The simulation of human intelligence processes by machines, particularly
computer systems.
Generative Artificial A subset of AI that focuses on creating new content (text, images, audio, video)
Intelligence based on patterns and data it has been trained on.

Summary of Dimensions of Analytics

Term Scope Focus Examples


Understanding data Trend analysis,
Analytics General
trends dashboards
Specific to data Patterns and insights Predicting customer
Data Analytics
examination in datasets churn
Business Business- Decision-making and
Revenue optimization
Analytics focused strategy
Complex data
Data Science Multidisciplinary modelling and Fraud detection, NLP
prediction
Artificial Intelligent Simulating human Chatbots, self-driving
Intelligence automation cognitive tasks cars
AI subset for Creating original
Generative AI DALL·E, ChatGPT
creation text, visuals, etc.

Analytics Landscape:
A comprehensive view of how analytics is utilized to support data-driven decision-making across various
domains. While the specific terminology and structure may vary slightly, the typical framework includes the
following key components:
1. Descriptive Analytics

 Focus: Understanding what has happened in the past.


 Tools/Methods: Reporting, dashboards, data visualization, and summary statistics.
 Purpose: Identifying patterns, summarizing historical data, and providing insights for operational
reporting.

2. Diagnostic Analytics

 Focus: Understanding why something happened.


 Tools/Methods: Data mining, hypothesis testing, and statistical analysis.
 Purpose: Analysing root causes and uncovering relationships in data.

3. Predictive Analytics

 Focus: Anticipating future outcomes or trends.


 Tools/Methods: Machine learning, regression analysis, and time-series forecasting.
 Purpose: Using historical data to predict future events, aiding proactive decision-making.

4. Prescriptive Analytics

 Focus: Recommending actions for achieving desired outcomes.


 Tools/Methods: Optimization, simulation, and decision models.
 Purpose: Providing actionable recommendations and strategies to achieve goals.

5. Cognitive/AI-Driven Analytics

 Focus: Enabling systems to learn, reason, and interact autonomously.


 Tools/Methods: Artificial Intelligence, Natural Language Processing, and Generative AI.
 Purpose: Automating complex decision-making, improving user experiences, and generating insights
or content.

6. Operational Analytics

 Focus: Improving operational efficiency and real-time decision-making.


 Tools/Methods: IoT analytics, process mining, and real-time dashboards.
 Purpose: Enhancing day-to-day operations and ensuring agility in response to changes.

7. Strategic Analytics

 Focus: Supporting high-level decision-making aligned with organizational goals.


 Tools/Methods: Scenario analysis, business simulations, and performance metrics.
 Purpose: Guiding long-term strategies and competitive advantage.

This landscape illustrates how analytics is a multi-faceted domain that integrates data, tools, and
methodologies to drive decisions at various levels, from operational to strategic, across diverse industries.

Framework for Data-Driven Decision Making:

A structured framework to guide data-driven decision-making processes which outlines five key steps:

1. Business Question: Clearly define the specific business problem or question that needs to be addressed.
2. Analysis Plan: Develop a detailed plan outlining the analytical approach, including hypotheses to test and
methodologies to employ.
3. Data Collection: Gather relevant data from appropriate sources, ensuring its quality and relevance to the
analysis.
4. Insights Derivation: Analyse the collected data to extract meaningful insights, identify patterns, and
validate hypotheses.
5. Recommendations: Formulate actionable recommendations based on the derived insights to inform
decision-making.

Roadmap for Analytics capability building:

This roadmap emphasizes a structured approach to integrating analytics into business processes to enhance
data-driven decision-making. The key steps include:
1. Define Objectives and Goals:
o Clearly articulate the organization's strategic objectives and how analytics can support these
goals.
o Identify specific areas where analytics can provide value, such as improving customer
insights, optimizing operations, or enhancing product development.
2. Assess Current Capabilities:
o Evaluate the existing analytics infrastructure, including technology, data quality, and human
resources.
o Determine the organization's maturity level in terms of data management and analytical skills.
3. Develop a Strategic Analytics Plan:
o Create a comprehensive plan that outlines the steps needed to build or enhance analytics
capabilities.
o Set measurable targets and timelines for achieving analytics objectives.
4. Invest in Technology and Tools:
o Acquire the necessary analytics tools and platforms that align with the organization's needs.
o Ensure scalability and integration capabilities with existing systems.
5. Build a Skilled Analytics Team:
o Recruit and train personnel with expertise in data analysis, statistics, and domain-specific
knowledge.
o Foster a culture of continuous learning and development in analytics.
6. Establish Data Governance and Management:
o Implement policies and procedures to ensure data quality, security, and privacy.
o Define data ownership and stewardship roles within the organization.
7. Promote a Data-Driven Culture:
o Encourage decision-making based on data insights across all levels of the organization.
o Provide training and resources to help employees understand and utilize analytics in their
roles.
8. Implement Analytics Solutions:
o Deploy analytics projects that address identified business needs.
o Use pilot projects to demonstrate value and refine approaches before broader implementation.
9. Monitor and Evaluate Performance:
o Regularly assess the effectiveness of analytics initiatives against predefined metrics.
o Gather feedback to identify areas for improvement and to inform future analytics strategies.
10. Scale and Innovate:
o Expand successful analytics initiatives to other areas of the organization.
o Stay abreast of emerging analytics trends and technologies to maintain a competitive edge.

By following this roadmap, organizations can systematically develop their analytics capabilities, leading to
more informed decision-making and improved business outcomes.

Challenges in Data-Driven Decision-making and Future:

several challenges organizations face in implementing data-driven decision-making and offers insights into
future trends in business analytics.

Challenges in Data-Driven Decision-Making:

1. Data Quality and Integration:


o Ensuring the accuracy, completeness, and consistency of data from diverse sources is crucial.
o Integrating data from various departments and systems can be complex and time-consuming.
2. Lack of Skilled Personnel:
o There's a shortage of professionals proficient in analytics, data science, and related fields.
o Organizations often struggle to build teams with the necessary expertise to analyze data
effectively.
3. Cultural Resistance:
o Employees and management may resist adopting data-driven approaches due to a preference
for traditional decision-making methods.
o Overcoming skepticism and fostering a culture that values data is essential.
4. Data Privacy and Security Concerns:
o Handling sensitive information requires strict adherence to privacy laws and regulations.
o Protecting data from breaches and unauthorized access is a continuous challenge.
5. Rapid Technological Changes:
o Keeping up with the fast-paced advancements in analytics tools and technologies can be
daunting.
o Continuous learning and adaptation are necessary to stay competitive.

Future Trends in Business Analytics:

1. Integration of Artificial Intelligence (AI):


o AI is increasingly being used to enhance decision-making processes.
o For instance, AI can assist in optimizing operations, as discussed in the article "The case for
appointing AI as your next COO."
2. Advanced Predictive and Prescriptive Analytics:
o Organizations are moving beyond descriptive analytics to predictive and prescriptive models.
o These models help forecast future trends and recommend actionable strategies.
3. Real-Time Data Processing:
o The demand for real-time analytics is growing, enabling organizations to make immediate,
informed decisions.
o This is particularly important in dynamic industries where timely insights are critical.
4. Enhanced Data Visualization:
o Improved visualization tools are making it easier to interpret complex data sets.
o Effective visualizations aid in communicating insights clearly to stakeholders.
5. Ethical and Responsible AI Use:
o As AI becomes more prevalent, there's a focus on ensuring its ethical application.
o Discussions around responsible AI use are highlighted in articles like "How we can use AI to
create a better society."

By addressing these challenges and staying abreast of emerging trends, organizations can effectively
leverage data-driven decision-making to achieve their strategic objectives.

Foundations of Data Science

Types of Data:
Categorization of data based on structure, source, and use case.
1. Based on data format:

Type Definition Examples


Structured data Organized into predefined formats like Spreadsheets,
tables, making it easy to store and analyze. relational databases
Unstructured Does not follow a specific format, making it Images, videos, social
data more complex to process media posts.
Semi-structured Falls between structured and unstructured; JSON, XML files,
data has some organizational properties but no NoSQL databases
strict schema

2. Based on Data Source


Type Definition Examples
Primary Data Collected directly from original Surveys, experiments,
sources for specific purposes interviews.
Secondary Data Derived from existing datasets or Research papers,
resources government databases.
Machine- Created by machines without direct sensor data, logs, IoT data
Generated Data human intervention

3. Based on nature:

Type Definition Examples


Qualitative Descriptive and non- Customer reviews, interview transcripts
Data numerical.
Quantitative Numerical and Number of employees, units sold. And
Data measurable Height, temperature.

4. Based on Processing State:

Type Definition Examples


Raw Data Unprocessed and in its original formSensor readings before
cleaning.
Processed Cleaned, transformed, and ready for Aggregated sales report
Data analysis

5. Based on Use:

Type Definition Examples


Operational Used in daily business Transaction records, inventory levels.
Data operations
Analytical Utilized for insights and Historical sales trends, predictive
Data decision-making. analytics data
Training Data Used for training machine Image datasets for object recognition, text
learning models. datasets for sentiment analysis.

6. Based on Content:

Type Definition Examples


Text Data Includes any data in textual format. Articles, emails, tweets
Image Data Captures visual information Photos, medical scans
Audio Data Contains sound Voice recordings, podcasts
Video Data Combines visual and audio Recorded lectures, movies
information

7. Specialized Types of Data:

Type Definition Examples


Metadata Data about data, providing descriptive File size, creation date
details.
Big Data: Large, complex datasets that require Social media analytics, e-
advanced tools to process commerce clickstreams.
Spatial Represents geographic or location- Maps, satellite imagery.
Data based information
Open Data Freely available data for public use. Government census data

Scales of Variable Measurement:


Scales of variable measurement, which are fundamental in selecting appropriate statistical analyses and
interpreting data accurately. The primary scales of measurement include:
1. Nominal Scale:
o Description: Categorizes data without any intrinsic ordering. Each category is distinct, and
there's no implied hierarchy.
o Examples: Gender (male, female), blood type (A, B, AB, O), or types of industries
(manufacturing, service, retail).
2. Ordinal Scale:
o Description: Categorizes data with a meaningful order or ranking among categories, but the
intervals between ranks are not necessarily equal.
o Examples: Customer satisfaction ratings (satisfied, neutral, dissatisfied), education levels
(high school, bachelor's, master's, doctorate).
3. Interval Scale:
o Description: Measures data with equal intervals between values, but lacks a true zero point,
meaning ratios are not meaningful.
o Examples: Temperature in Celsius or Fahrenheit, where the difference between degrees is
consistent, but zero does not indicate the absence of temperature.
4. Ratio Scale:
o Description: Similar to the interval scale, but includes a true zero point, allowing for
meaningful ratios between measurements.
o Examples: Height, weight, age, or income, where zero signifies the absence of the measured
attribute, and comparisons like "twice as much" are meaningful.
Understanding these measurement scales is crucial for selecting appropriate statistical methods and
accurately interpreting data in business analytics. Each scale dictates the types of analyses that are valid and
the conclusions that can be drawn from the data.

Feature Engineering:
Feature engineering involves creating new variables or modifying existing ones to enhance the performance
of predictive models. This process is crucial for improving model accuracy and uncovering hidden patterns
within the data.
Key Aspects of Feature Engineering:
1. Data Transformation:
o Applying mathematical functions to variables, such as logarithmic or square root
transformations, to stabilize variance or normalize distributions.
2. Interaction Features:
o Creating new features by combining two or more variables to capture interactions that may
influence the target outcome.
3. Binning:
o Grouping continuous variables into discrete bins or categories to reduce noise and handle
non-linear relationships.
4. Encoding Categorical Variables:
o Converting categorical data into numerical format using techniques like one-hot encoding or
label encoding to make them suitable for machine learning algorithms.
5. Handling Missing Values:
o Imputing missing data with appropriate values or creating indicator variables to flag missing
entries.
6. Scaling and Normalization:
o Adjusting the range of variables to ensure they contribute equally to the analysis, especially
important for distance-based algorithms.
7. Date and Time Feature Extraction:
o Deriving new features from date and time variables, such as day of the week, month, or time
of day, to capture temporal patterns.
By meticulously engineering features, analysts can significantly enhance the predictive power of their
models, leading to more accurate and reliable data-driven decisions.
Functional Applications of Business Analytics in Management:
The functional applications of business analytics in management span across various departments and
domains within an organization. These applications enable better decision-making, improve efficiency, and
drive strategic objectives. Below is an overview of key areas where business analytics plays a pivotal role:
Marketing Analytics
 Customer Segmentation: Identifying and grouping customers based on behavior, preferences, and
demographics.
 Campaign Performance: Measuring the effectiveness of marketing campaigns through KPIs like ROI
and conversion rates.
 Personalization: Leveraging data to tailor marketing messages and product recommendations.
 Churn Analysis: Predicting customer attrition and implementing retention strategies.

2. Financial Analytics
 Budgeting and Forecasting: Using historical data and predictive models to create accurate financial
projections.
 Risk Management: Identifying and mitigating financial risks through stress testing and scenario
analysis.
 Profitability Analysis: Evaluating profitability at product, customer, or segment levels.
 Fraud Detection: Employing machine learning algorithms to detect anomalies and fraudulent
activities.

3. Supply Chain and Operations Analytics


 Inventory Optimization: Ensuring optimal stock levels using demand forecasting and reorder point
analysis.
 Logistics and Transportation: Enhancing route planning, delivery times, and cost efficiency.
 Process Improvement: Identifying bottlenecks and inefficiencies in operations to enhance
productivity.
 Demand Planning: Using predictive analytics to match supply with customer demand.

4. Human Resource Analytics


 Talent Acquisition: Analyzing recruitment data to improve hiring strategies and reduce time-to-hire.
 Employee Retention: Identifying factors influencing turnover and designing retention programs.
 Performance Management: Tracking employee performance metrics and aligning them with
organizational goals.
 Workforce Planning: Forecasting future staffing needs based on business growth and market trends.

5. Strategic Management
 Market Trends Analysis: Monitoring market dynamics and competitive landscapes for informed
strategy formulation.
 Scenario Planning: Evaluating potential outcomes of strategic decisions through simulation models.
 Mergers and Acquisitions: Conducting due diligence and valuing target companies based on financial
and operational data.
 KPI Monitoring: Developing dashboards to track organizational performance against strategic
objectives.

6. Customer Relationship Management (CRM)


 Lifetime Value Prediction: Estimating the long-term value of customers to prioritize resources.

 Customer Feedback Analysis: Extracting insights from surveys, reviews, and social media for service
improvement.
 Loyalty Programs: Designing and optimizing loyalty initiatives to enhance customer retention.

7. Product and Service Development


 Innovation Analytics: Using customer insights and market trends to guide product innovation.

 Quality Assurance: Analyzing production data to maintain high product quality.


 Pricing Optimization: Determining optimal price points based on market conditions and consumer
behavior.

8. Risk and Compliance Analytics


 Regulatory Compliance: Ensuring adherence to industry regulations using monitoring tools.

 Operational Risk: Identifying vulnerabilities in processes and implementing safeguards.


 Crisis Management: Leveraging analytics to predict and prepare for potential crises.

9. IT and Cybersecurity Analytics


 Threat Detection: Identifying and mitigating cyber threats using pattern recognition algorithms.

 System Performance: Monitoring IT systems for performance optimization and downtime reduction.
 Data Management: Enhancing data governance and ensuring data quality for better decision-making.

By leveraging business analytics across these functional areas, organizations can achieve greater operational
efficiency, improve decision-making accuracy, and maintain a competitive edge in the market.
Widely used Analytical Tools:
Marketing Analytics Tools
 Google Analytics: Web analytics tool to track and analyze website traffic and user behavior.

 HubSpot: Inbound marketing platform for campaign performance, lead tracking, and customer
insights.
 Tableau: Visualization tool to analyze customer segmentation, campaign effectiveness, and churn
patterns.
 Adobe Analytics: Advanced analytics for understanding customer journeys and optimizing marketing
strategies.

2. Financial Analytics Tools


 SAP Analytics Cloud: Integrated platform for financial planning, forecasting, and risk analysis.

 Microsoft Power BI: Business intelligence tool to track financial KPIs and generate real-time
insights.
 QuickBooks: Accounting software for small and medium businesses to manage financial data and
budgeting.
 Alteryx: Analytics platform for financial data preparation, risk modeling, and profitability analysis.

3. Supply Chain and Operations Analytics Tools


 IBM Sterling: Supply chain management tool for inventory optimization and demand planning.

 JDA (now Blue Yonder): Advanced supply chain solutions for logistics and transportation
management.
 Qlik Sense: Business intelligence tool for process improvement and operational analytics.
 SAP Integrated Business Planning: Comprehensive platform for demand planning and supply chain
optimization.

4. Human Resource Analytics Tools


 Workday: Workforce planning, talent acquisition, and performance management tool.

 Tableau HR Dashboards: Pre-built templates for analyzing workforce data.


 Visier: HR analytics platform for employee retention, performance metrics, and workforce trends.
 LinkedIn Talent Insights: Platform for analyzing hiring trends and talent availability.
5. Strategic Management Tools
 StrategyMapper: Tool for aligning analytics insights with strategic goals.

 Balanced Scorecard Software: Framework for monitoring KPI performance across strategic
objectives.
 Microsoft Excel: Widely used for scenario planning and simulation modeling.
 Domo: Analytics and BI platform for tracking organizational performance.

6. Customer Relationship Management (CRM) Analytics Tools


 Salesforce CRM Analytics: Advanced analytics capabilities within the Salesforce platform for
customer insights.
 Zoho CRM: Analytics-driven CRM for customer feedback, lifetime value predictions, and loyalty
management.
 Klipfolio: Dashboard tool to monitor CRM metrics in real time.
 Mixpanel: Specialized for analyzing customer interactions and product usage patterns.

7. Product and Service Development Tools


 Airtable: Collaborative platform for managing product development pipelines and innovation
projects.
 IdeaScale: Tool for gathering and analyzing customer ideas for product innovation.
 Qualtrics: Experience management platform to understand customer feedback and guide quality
assurance.
 PricingHUB: Advanced pricing optimization software using machine learning.

8. Risk and Compliance Analytics Tools


 ACL Analytics: Tool for audit, risk management, and compliance analytics.

 MetricStream: GRC (Governance, Risk, and Compliance) platform to manage regulatory adherence
and operational risks.
 SAS Risk Management: Comprehensive tool for identifying, measuring, and mitigating financial
risks.
 Splunk: Tool for monitoring and managing IT and operational risks.

9. IT and Cybersecurity Analytics Tools


 Splunk Security Analytics: Real-time threat detection and response platform.

 IBM QRadar: Security information and event management (SIEM) tool for threat analysis.
 Snowflake: Cloud-based data platform for managing large datasets and ensuring data quality.
 Tableau Data Management: Enhances governance, quality, and scalability of IT systems.

By utilizing these tools, organizations can streamline their analytics processes, extract actionable insights,
and make informed decisions to achieve their strategic objectives.
Ethics in Business Analytics:
Ethics in Business Analytics is a crucial aspect of using data to drive decisions in any organization. As the
reliance on data-driven insights and algorithms increases, ensuring ethical practices in business analytics
becomes vital to avoid harmful consequences. Ethical concerns can arise at various stages of data analysis,
including data collection, analysis, and decision-making. Here are the key areas where ethics play a
significant role in business analytics:
Data Privacy and Protection
 Informed Consent: Businesses must obtain explicit consent from individuals before collecting or
using their data, particularly in sensitive areas like healthcare or personal information.
 Data Minimization: Collect only the data that is necessary for the intended purpose to reduce
exposure and minimize risks.
 Compliance with Regulations: Adhering to laws and regulations such as GDPR (General Data
Protection Regulation), HIPAA (Health Insurance Portability and Accountability Act), and CCPA
(California Consumer Privacy Act) is crucial to protect consumer privacy.
 Data Anonymization: Personal data should be anonymized or de-identified to reduce the risk of
misuse.

2. Transparency in Algorithms
 Explainability: Algorithms and models used in decision-making should be transparent and
interpretable. Business stakeholders should be able to understand how decisions are made by the
system.
 Fairness: Avoid using algorithms that unintentionally favor certain groups or individuals, leading to
biased outcomes. Fairness checks should be applied to ensure equitable results across different
demographics.
 Bias Detection: Regularly audit algorithms for biases (e.g., racial, gender, or socio-economic biases)
that may distort outcomes. Model developers should ensure that their models do not perpetuate
societal inequalities.

3. Accountability and Responsibility


 Human Oversight: While data analytics can inform decision-making, final decisions, particularly
those impacting individuals or communities, should be made by humans, ensuring accountability for
any negative consequences.
 Accountability for Data Misuse: Organizations should establish protocols to hold data stewards
accountable for any misuse or improper access to data.
 Auditability: Analytics processes should be auditable, meaning decisions and models should be
traceable to understand how results were derived and whether ethical standards were followed.

4. Ensuring the Integrity of Data


 Data Accuracy: Ensuring the quality and accuracy of data used in analytics is essential. Incorrect data
can lead to false conclusions, damaging reputations or causing harm.
 Data Integrity: Businesses should safeguard data from tampering, manipulation, or corruption.
Ensuring that data remains unchanged and accurate throughout its lifecycle is crucial for making
ethical decisions.
 Authenticity of Sources: Data should be sourced ethically and validated for credibility to avoid
propagating misinformation or using unverified data.

5. Ethical Use of Predictive Models


 Predictions and Privacy: Predictive models, such as those for customer behavior or credit scoring,
should not infringe on an individual’s privacy or autonomy. Businesses should avoid using models
that predict sensitive characteristics without clear consent.
 Transparency in Predictive Decisions: Customers and employees should have visibility into how their
data is being used to predict outcomes (e.g., creditworthiness, hiring decisions, or insurance pricing).
 Impact on Vulnerable Groups: Analytics should avoid practices that disproportionately harm
vulnerable groups, such as targeted marketing for exploitative products or discriminatory lending
practices.

6. Ethical Implications in Marketing Analytics


 Targeted Advertising: While targeted advertising can be effective, it should not exploit consumers’
vulnerabilities (e.g., advertising products like payday loans to financially vulnerable individuals).
 Manipulative Practices: Marketing strategies should not manipulate consumer behavior unethically
(e.g., using psychological tricks to coerce people into purchasing products they don’t need).

7. AI and Automation Ethics


 Automation and Job Displacement: The increasing automation of tasks using AI and analytics tools
should consider the social impact, such as job displacement. Ethical AI development includes
creating systems that complement human workers rather than replace them entirely.
 Bias in AI Models: AI models used in business analytics should be monitored and adjusted regularly
to avoid any unintentional reinforcement of historical biases.
 Fair Access to AI: Businesses should ensure that AI technologies are accessible to all stakeholders,
particularly marginalized groups that might otherwise be excluded from the benefits of innovation.

8. Social and Environmental Responsibility


 Sustainability: Business analytics should support sustainable practices, helping businesses reduce
waste, conserve resources, and improve environmental impact.
 Social Impact: Ethical business analytics should prioritize projects and initiatives that benefit society,
such as healthcare improvements, education, and equitable access to resources.

9. Ethical Decision-Making Framework


 Ethical Review Boards: Organizations can establish internal ethics boards or committees to review
significant data analytics projects to ensure that ethical standards are met.
 Training and Education: Continuous ethics training should be provided to those involved in analytics
to raise awareness of potential ethical pitfalls and encourage responsible use of data.

Key Principles of Ethics in Business Analytics:


 Respect for privacy

 Fairness and impartiality


 Transparency and explainability
 Accountability and responsibility
 Sustainability and social responsibility

In conclusion, ethics in business analytics is about balancing the potential benefits of data-driven decision-
making with the responsibility to protect privacy, avoid bias, and ensure transparency and fairness.
Organizations must adopt ethical frameworks and practices to ensure that their analytics initiatives create
positive value without causing harm to individuals, communities, or society at large.

UNIT II

Introduction to Python
Python was developed by Guido Van Rossum in the year 1991.
Python is a high level programming language that contains features of functional programming language like
C and object oriented programming language like Java.

FEATURES OF PYTHON
Simple
Python is a simple programming language because it uses English like sentences in its programs.
Easy to learn
Python uses very few keywords. Its programs use very simple structure.
Open source
Python can be freely downloaded from www.python.org website. Its source code can be read, modified and
can be used in programs as desired by the programmers.
High level language
High level languages use English words to develop programs. These are easy to learn and use. Like COBOL,
PHP or Java, Python also uses English words in its programs and hence it is called high level programming
language.
Dynamically typed
In Python, we need not declare the variables. Depending on the value stored in the variable, Python
interpreter internally assumes the datatype.
Platform independent
Hence, Python programs are not dependant on any computer with any operating system. We can use Python
on Unix, Linux, Windows, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, etc. almost all operating
systems. This will make Python an ideal programming language for any network or Internet.
Portable
When a program yields same result on any computer in the world, then it is called a portable program.
Python programs will give same result since they are platform independent.

Procedure and Object oriented


Python is a procedure oriented as well as object oriented programming language. In procedure oriented
programming languages (e.g. C and Pascal), the programs are built using functions and procedures. But in
object oriented languages (e.g. C++ and Java), the programs use classes and objects.
An object is anything that exists physically in the real world. An object contains behavior. This behavior is
represented by its properties (or attributes) and actions. Properties are represented by variables and actions
are performed by methods. So, an object contains variables and methods.
A class represents common behavior of a group of objects. It also contains variables and methods. But a
class does not exist physically.
A class can be imagined as a model for creating objects. An object is an instance (physical form) of a class.
Interpreted
First, Python compiler translates the Python program into an intermediate code called byte code. This byte
code is then executed by PVM. Inside the PVM, an interpreter converts the byte code instructions into
machine code so that the processor will understand and run that machine code.
Extensible
There are other flavors of Python where programs from other languages can be integrated into Python. For
example, Jython is useful to integrate Java code into Python programs and run on JVM (Java Virtual
Machine). Similarly IronPython is useful to integrate .NET programs and libraries into Python programs and
run on CLR (Common Language Runtime).

Embeddable
Several applications are already developed in Python which can be integrated into other programming
languages like C, C++, Delphi, PHP, Java and .NET. It means programmers can use these applications for
their advantage in various software projects.
Huge library
Python has a big library that contains modules which can be used on any Operating system.
Scripting language
A scripting language uses an interpreter to translate the source code into machine code on the fly (while
running). Generally, scripting languages perform supporting tasks for a bigger application or software.
Python is considered as a scripting language as it is interpreted and it is used on Internet to support other
softwares.

Database connectivity
A database represents software that stores and manipulates data. Python provides interfaces to connect its
programs to all major databases like Oracle, Sybase, SQL Server or MySql.
Scalable
A program would be scalable if it could be moved to another Operating system or hardware and take full
advantage of the new environment in terms of performance.
Core Libraries in Python
The huge library of Python contains several small applications (or small packages) which are already
developed and immediately available to programmers. These libraries are called ‘batteries included’. Some
interesting batteries or packages are given here:
 argparse is a package that represents command-line parsing library.
 boto is Amazon web services library.
 CherryPy is a Object-oriented HTTP framework.
 cryptography offers cryptographic techniques for the programmers
 Fiona reads and writes big data files
 jellyfish is a library for doing approximate and phonetic matching of strings.
 matplotlib is a library for electronics and electrical drawings.
 mysql-connector-python is a driver written in Python to connect to MySQL database.
 numpy is a package for processing arrays of single or multidimensional type.
 pandas is a package for powerful data structures for data analysis, time series and statistics.
 Pillow is a Python imaging library.
 pyquery represents jquery-like library for Python.
 scipy is the scientific library to do scientific and engineering calculations.
 Sphinx is the Python documentation generator.
 sympy is a package for Computer algebra system (CAS) in Python.
 w3lib is a library of web related functions.
 whoosh contains fast and pure Python full text indexing, search and spell checking library.

To know the entire list of packages included in Python, one can visit:
https://www.pythonanywhere.com/batteries_included/

Python Virtual Machine: PVM


A Python program contains source code (first.py) that is first compiled by Python compiler to produce byte
code (first.pyc). This byte code is given to Python Virtual Machine (PVM) which converts the byte code to
machine code. This machine code is run by the processor and finally the results are produced.

Python Virtual Machine (PVM) is a software that contains an interpreter that converts the byte code into
machine code.
PVM is most often called Python interpreter. The PVM of PyPy contains a compiler in addition to the
interpreter. This compiler is called Just In Time (JIT) compiler which is useful to speed up execution of the
Python program.

Memory management by PVM


Memory allocation and deallocation are done by PVM during runtime. Entire memory is allocated on heap.
We know that the actual memory (RAM) for any program is allocated by the underlying Operating system.
On the top of the Operating system, a raw memory allocator oversees whether enough memory is available
to it for storing the objects (ex: integers, strings, functions, lists, modules etc). On the top of the raw memory
allocator, there are several object-specific allocators operate on the same heap. These memory allocators will
implement different types of memory management policies depending on the type of the objects. For
example, an integer number should be stored in memory in one way and a string should be stored in a
different way. Similarly, when we deal with tuples and dictionaries, they should be stored differently. These
issues are taken care by object-specific memory allocators.
Garbage collection
A module represents Python code that performs specific task. Garbage collector is a module in Python that is
useful to delete objects from memory which are not used in the program. The module that represents the
garbage collector is named as gc. Garbage collector in the simplest way maintains a count for each object
regarding how many times that object is referenced (or used). When an object is referenced twice, its
reference count will be 2. When an object has some count, it is being used in the program and hence garbage
collector will not remove it from memory. When an object is found with a reference count 0, garbage
collector will understand that the object is not used by the program and hence it can be deleted from
memory. Hence, the memory allocated for that object is deallocated or freed.
Frozen Binaries
When a software is developed in Python, there are two ways to provide the software to the end user. The first
way is to provide the .pyc files to the user. The user will install PVM in his computer and run the byte code
instructions of the .pyc files.
The other way is to provide the .pyc files, PVM along with necessary Python library. In this method, all
the .pyc files, related Python library and PVM will be converted into a single executable file (generally
with .exe extension) so that the user can directly execute that file by double clicking on it. In this way,
converting the Python programs into true executables is called frozen binaries. But frozen binaries will have
more size than that of simple .pyc files since they contain PVM and library files also.
For creating Frozen binaries, we need to use other party softwares. For example, py2exe is a software that
produces frozen binaries for Windows operating system. We can use pyinstaller for UNIX or LINUX.
Freeze is another program from Python organization to generate frozen binaries for UNIX.
Jupyter Notebook
Jupyter is an IDE that is popular among Python developers and Data Scientists. It is a part of Anaconda
platform which is a collection of tools and IDEs. By installing Anaconda, we can have Jupyter IDE available
to us. When we install Anaconda, it comes with a copy of Python software along with important packages
like numpy, scikit-learn, scipy, pandas, matplotlib, etc. Also, it has 2 popular IDEs: Spider and Jupyter.
Anaconda is liked by Data Scientists because of its capabilities of handing huge volumes of data very
quickly and efficiently. In this section, first we will install Anaconda platform and then see how to work with
Jupyter Notebook.

HOW TO INSTALL AND USE JUPYTER NOTEBOOK


It opens anaconda official website. At the right side top corner, click on ‘Free Download’ and below ‘Skip
registration’ to download without providing mail id.

Step 2) It downloads a file like “Anaconda3-2024.10-1-Windows-x86_64.exe”. Double click on it. Then


Anaconda Setup will start execution. Click on “Next” button.
Step 3) Then click on “Next” button to continue. When it displays “License Agreement”, click on “I Agree”
button.

Step 4) Then click on ‘Just Me’ radio button for installing your individual copy.
Step 5) It will show a default directory to install. Click on ‘Next’.

Step 6) In the next screen, select the checkbox ‘Create start menu shortcuts’. Also, unselect other
checkboxes.
Step 7) The installation starts in the next screen. We should wait for the installation to complete.

Step 8) When the installation completes, click on “Next” .


Step 9) In the next screen, click on “Next”.

Step 10) In the final screen, do not check the checkboxes and then click on “Finish”.
Note: Once the installation is completed, we can find a new folder by the name “Anaconda3(64-bit)” created
in Window 10 applications which can be seen by pressing Windows “Start” button. When we click on this
folder, we can find several icons including “Jupyter Notebook” and “Spyder”.

USING JUPYTER NOTEBOOK


Step 1) click on the “Start” button on the Windows task bar and select the “Anaconda3” folder. In that, click
on “Jupyter Notebook” link. First of a black window opens where the Jupyter server runs. Minimize this
window but do not close it. After that, Jupyter opens in the browser and displays the following initial screen
(Home Page).
Step 3) It opens a new page. Click on “Untitled” at the top of the page and enter a new name for your
program. Then click on “Rename” button.
Step 4) Type the program code in cell and click on “Run” to run the code of the current cell. The current cell
being edited is shown in green box.

Step 5) We can enter code in the next cell and so on. In this manner, we can run the program as blocks of
code, one block at a time. When input is required, it will wait for your input to enter, as shown in the
following screen. The blue box around the cell indicates command mode.

Step 6) Type the program in the cells and run each cell to see the results produced by each cell.
Note: To save the program, click on Floppy symbol below the “File” menu. Click on “Insert” to insert a new
cell either above or below the current cell. The programs in Jupyter are saved with the extension “.ipynb”
which indicates Interactive Python Notebook file. This file stores the program and other contents in the form
of JSON (JavaScript Object Notation). Click on ‘Logout’ to terminate Jupyter. Then close the server window
also.

Step 7) To reopen the program, first enter into Jupyter Notebook Home Page. In the “Files” tab, find out the
program named “first.ipynb” and click on it to open it in another page.

Step 8) Similarly, to delete the file, first select it and then click on the Delete Bin symbol.
RUNNING A PYTHON PROGRAM
Running a Python program can be done from 3 environments: 1. Command line window 2. IDLE graphics
window 3. System prompt

Go get help, type help().


Type topics, FUNCTIONS, modules
Press <ENTER> to quit.

In IDLE window, click on help -> ‘Python Docs’ or F1 button to get documentation help.
Save a Python program in IDLE and reopen it and run it.
COMMENTS (2 types)
# single line comments
“”” or ‘’’ multi line comments

Docstrings
If we write strings inside “”” or ‘’’ and if these strings are written as first statements in a module, function,
class or a method, then these strings are called documentation strings or docstrings. These docstrings are
useful to create an API documentation file from a Python program. An API (Application Programming
Interface) documentation file is a text file or html file that contains description of all the features of a
software, language or a product.

DATATYPES
A datatype represents the type of data stored into a variable (or memory).
Built-in datatypes
The built-in datatypes are of 5 types:

 None Type
 Numeric types
 Sequences
 Sets
 Mappings

None type: an object that does not contain any value.


Numeric types: int, float, complex.
Boolean type: bool.
Sequences: str, bytes, bytearray, list, tuple, range.

int type: represents integers like 12, 100, -55.


float type: represents float numbers like 55.3, 25e3.
complex type: represents complex numbers like 3+5j or 3-10.5J. Complex numbers will be in the form of
a+bj or a+bJ. Here ‘a’ is called real part and ‘b’ is called ‘imaginary part’ and ‘j’ or ‘J’ indicates √-1.

NOTE:
Binary numbers are represented by a prefix 0b or 0B. Ex: 0b10011001
Hexadecimal numbers are represented by a prefix 0x or 0X. Ex: 0X11f9c
Octal numbers are represented by a prefix 0o or 0O. Ex: 0o145.

bool type: represents any of the two boolean values, True or False.
Ex: a = 10>5 # here a is treated as bool type variable.
print(a) #displays True
NOTE:
1. To convert a float number into integer, we can use int() function. Ex: int(num)
2. To convert an integer into float, we can use float() function.
3. bin() converts a number into binary. Ex: bin(num)
4. oct() converts a number into octal.
5. hex() converts a number into hexadecimal.
STRINGS
str datatype: represents string datatype. A string is enclosed in single quotes or double quotes.
Ex: s1 = “Welcome”
s2 = ‘Welcome’
A string occupying multiple lines can be inserted into triple single quotes or triple double quotes.
Ex: s1 = ‘’’ This is a special training on
Python programming that
gives insights into Python language.
‘’’
To display a string with single quotes.
Ex: s2 = “””This is a book ‘on Core Python’ programming”””
To find length of a string, use len() function.
Ex: s3 = ‘Core Python’
n = len(s3)
print(n) -> 11
We can do indexing, slicing and repetition of strings.
Ex: s = “Welcome to Core Python”
print(s) -> Welcome to Core Python
print(s[0]) -> W
print(s[0:7]) -> Welcome
print(s[:7]) -> Welcome
print(s[1:7:2]) -> ecm
print(s[-1] -> n
print(s[-3:-1]) -> ho
print(s[1]*3) -> eee
print(s*2) ->Welcome to CorePython Welcome to CorePython
Remove spaces using rstrip(), lstrip(), strip() methods.
Ex: name = “ Vijay Kumar “
print(name.strip())
We can find substring position in a string using find() method. It returns -1 if not found.

Ex: n = str.find(sub, 0, len(str))


We can count number of substrings in a string using count() method. Returns 0 if not found.
Ex: n = str.count(sub)
We can replace a string s1 with another string s2 in a main string using replace() method.
Ex: str.replace(s1, s2)
We can change the case of a string using upper(), lower(), title() methods.
Ex: str.upper()

CHARACTERS
There is no datatype to represent a single character in Python. Characters are part of str datatype.
Ex:
str = "Hello"
print(str[0])
H
for i in str: print(i)
H
e
l
l
o

BYTES AND BYTEARRAY


bytes datatype: represents a group of positive integers in the range of 0 to 255 just like an array. The
elements of bytes type cannot be modified.
Ex: arr = [10, 20, 55, 100, 99]
x = bytes(arr)
for i in x:
print(i)
10
20
55
100
99

bytearray datatype: same as bytes type but its elements can be modified.
arr = [10,20,55,100,99]
x=bytearray(arr)
x[0]=11
x[1]=21
for i in x: print(i)

11
21
55
100
99
NOTE:
We can do only indexing in case of bytes or bytearray datatypes. We cannot do slicing or repetitions.

LISTS
A list is similar to an array that can store a group of elements. A list can store different types of elements and
can grow dynamically in memory. A list is represented by square braces [ ]. List elements can be modified.
Ex:
lst = [10, 20, 'Ajay', -99.5]
print(lst[2])
Ajay
To create an empty list.
lst = [] # then we can append elements to this list as lst.append(‘Vinay’)
NOTE:
Indexing, slicing and repetition are possible on lists.
print(lst[1])
20
print(lst[-3:-1])
[20, 'Ajay']
lst = lst*2
print(lst)
[10, 20, 'Ajay', -99.5, 10, 20, 'Ajay', -99.5]
We can use len() function to find the no. of elements in the list.
n = len(lst) -> 4
del() function is for deleting an element at a particular position.
del(lst[1]) -> deletes 20
remove() will remove a particular element. clear() wil delete all elements from the list.
lst.remove(‘Ajay’)
lst.clear()
We can update the list elements by assignment.
lst[0] = ‘Vinod’
lst[1:3] = 10, 15
43
max() and min() functions return the biggest and smallest elements.
max(lst)
min(lst)

TUPLES
A tuple is similar to a list but its elements cannot be modified. A tuple is represented by parentheses ( ).
Indexing, slicing and repetition are possible on tuples also.
Ex:
tpl=( ) # creates an empty tuple
tpl=(10, ) # with only one element – comma needed after the element
tpl = (10, 20, -30, "Raju")
print(tpl)
(10, 20, -30, 'Raju')
tpl[0]=-11 # error
print(tpl[0:2])
(10, 20)
tpl = tpl*2
print(tpl)
(10, 20, -30, 'Raju', 10, 20, -30, 'Raju')
NOTE: len(), count(), index(), max(), min() functions are same in case of tuples also.
We cannot use append(), extend(), insert(), remove(), clear() methods on tuples.
To sort the elements of a tuple, we can use sorted() method.
sorted(tpl) # sorts all elements into ascending order
sorted(tpl, reverse=True) # sorts all elements into descending order
To convert a list into tuple, we can use tuple() method.
tpl = tuple(lst)

RANGE DATATYPE
range represents a sequence of numbers. The numbers in the range cannot be modified. Generally, range is
used to repeat a for loop for a specified number of times.
Ex: we can create a range object that stores from 0 to 4 as:
r = range(5)
print(r[0]) -> 0
for i in r: print(i)
0
1
2
3
4
Ex: we can also mention step value as:
r = range(0, 10, 2)
for i in r: print(i)
0
2
4
6
8
r1 = range(50, 40, -2)
for i in r1: print(i)
50
48
46
44
42
SETS
A set datatype represents unordered collection of elements. A set does not accept duplicate elements where
as a list accepts duplicate elements. A set is written using curly braces { }. Its elements can be modified.
s = {1, 2, 3, "Vijaya"}
print(s)
{1, 2, 3, 'Vijaya'}
NOTE: Indexing, slicing and repetition are not allowed in case of a set.
To add elements into a set, we should use update() method as:
s.update([4, 5])
print(s)
{1, 2, 3, 4, 5, 'Vijaya'}
To remove elements from a set, we can use remove() method as:
s.remove(5)
print(s)
{1, 2, 3, 4, 'Vijaya'}
A frozenset datatype is same as set type but its elements cannot be modified.
Ex:
s = {1, 2, -1, 'Akhil'} -> this is a set
s1 = frozenset(s) -> convert it into frozenset
for i in s1: print(i)
1
2
Akhil
-1
NOTE: update() or remove() methods will not work on frozenset.
MAPPING DATATYPES
A map indicates elements in the form of key – value pairs. When key is given, we can retrieve the associated
value. A dict datatype (dictionary) is an example for a ‘map’.
d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
print(d)
{10: 'kamal', 11: 'Subbu', 12: 'Sanjana'}
keys() method gives keys and values() method returns values from a dictionary.
k = d.keys()
for i in k: print(i)
10
11
12
for i in d.values(): print(i)
kamal
Subbu
Sanjana
To display value upon giving key, we can use as:
Ex: d = {10: 'kamal', 11:'Subbu', 12:'Sanjana'}
d[10] gives ‘kamal’
To create an empty dictionary, we can use as:
d = {}
Later, we can store the key and values into d, as:
d[10] = ‘Kamal’
d[11] = ‘Pranav’
We can update the value of a key, as: d[key] = newvalue.
Ex: d[10] = ‘Subhash’
We can delete a key and corresponding value, using del function.
Ex: del d[11] will delete a key with 11 and its value also.
PYTHON AUTOMATICALLY KNOWS ABOUT THE DATATYPE
The datatype of the variable is decided depending on the value assigned. To know the datatype of the
variable, we can use type() function.
Ex:
x = 15 #int type
print(type(x))
<class 'int'>
x = 'A' #str type
print(type(x))
<class 'str'>
x = 1.5 #float tye
print(type(x))
<class 'float'>
x = "Hello" #str type
print(type(x))
<class 'str'>
x = [1,2,3,4]
print(type(x))
<class 'list'>
x = (1,2,3,4)
print(type(x))
<class 'tuple'>
x = {1,2,3,4}
print(type(x))
<class 'set'>

Literals in Python
A literal is a constant value that is stored into a variable in a program.
a = 15
Here, ‘a’ is the variable into which the constant value ‘15’ is stored. Hence, the value 15 is called ‘literal’.
Since 15 indicates integer value, it is called ‘integer literal’.
Ex: a = ‘Srinu’ → here ‘Srinu’ is called string literal.
Ex: a = True → here, True is called Boolean type literal.
User-defined datatypes
The datatypes which are created by the programmers are called ‘user-defined’ datatypes. For example, an
array, a class, or a module is user-defined datatypes. We will discuss about these datatypes in the later
chapters.
Constants in Python
A constant is similar to variable but its value cannot be modified or changed in the course of the program
execution. For example, pi value 22/7 is a constant. Constants are written using caps as PI.

Identifiers and Reserved words


An identifier is a name that is given to a variable or function or class etc. Identifiers can include letters,
numbers, and the underscore character ( _ ). They should always start with a nonnumeric character. Special
symbols such as ?, #, $, %, and @ are not allowed in identifiers. Some examples for identifiers are salaray,
name11, gross_income, etc.
Reserved words are the words which are already reserved for some particular purpose in the Python
language. The names of these reserved words should not be used as identifiers. The following are the
reserved words available in Python:
and del from nonlocal try
as elif global not while
assert else if or with
break except import pass yield
class exec in print False
continue finally is raise True
def for lambda return
OPERATORS
A symbol that performs an operation.
An operator acts on variables or values
that are called ‘operands’.
Arithmetic operators
They perform basic arithmetic operations.
a=13, b = 5
Operator Meaning Example Result
+ Addition operator. a+b 18
- Subtraction operator. a-b 8
* Multiplication operator. a*b 65
/ Division operator. a/b 2.6
% Modulus operator. Gives remainder of a%b 3
division.
** Exponent operator. a ** b gives the a**b 371293
value of a to the power of b.

Assignment operators
To assign right side value to a left side variable.
Operator Example Meaning
= z = x+y Assignment operator. i.e. x+y
is stored into z.
+= z+=x Addition assignment
operator. i.e. z = z+x.
-= z-=x Subtraction assignment
operator. i.e. z = z-x.
*= z*=x Multiplication assignment
operator. i.e. z = z *x.
/= z/=x Division assignment operator.
i.e. z = z/x.
%= z%=x Modulus assignment
operator. i.e. z = z%x.
**= z**=y Exponentiation assignment
operator. i.e. z = z**y.
//= z//=y Floor division assignment
operator. i.e. z = z// y.

Ex:
a=b=c=5
print(a,b,c)
555
a,b,c=1,2,'Hello'
print(a,b,c)
1 2 Hello
x = [10,11,12]
a,b,c = 1.5, x, -1
print(a,b,c)
1.5 [10, 11, 12] -1

Unary minus operator


Converts +ve value into negative and vice versa.
Relational operators
Relational operators are used to compare two quantities. They return either True or False (bool datatype).
Ex:
a, b = 1, 2
print(a>b)
False

Ex:
1<2<3<4 will give True
1<2>3<4 will give False
Logical operators
Logical operators are useful to construct compound conditions. A compound condition is a combination of
more than one simple condition. 0 is False, any other number is True.

X=1, y=2
Operator Example Meaning Result
and x and y And operator. If x is False, it returns x, otherwise 2
it returns y.
or x or y Or operator. If x is False, it returns y, otherwise it 1
returns x.
not not x Not operator. If x is False, it returns True. If x is False
True it returns False.

Ex:
x=1; y=2; z=3
if(x<y or y>z):
print('Yes')
else:
print('No') -> displays Yes

Boolean operators
Boolean operators act upon ‘bool’ type values and they provide ‘bool’ type result. So the result will be again
either True or False.
x = True, y = False
Operator Example Meaning Result
and x and y Boolean and operator. False
If both x and y are
True, then it returns
True, otherwise
False.
or x or y Boolean or operator. True
If either x or y is
True, then it returns
True, else False.
not not x Boolean not operator. False
If x is True, it returns
False, else True.
INPUT AND OUTPUT
print() function for output
Example Output
print() Blank line
print(“Hai”) Hai
print(“This is the \nfirst line”) This is the
first line
print(“This is the \\nfirst line”) This is the \nfirst line
print(‘Hai’*3) HaiHaiHai
print(‘City=’+”Hyderabad”) City=Hyderabad
print(a, b) 12
print(a, b, sep=”,”) 1,2
print(a, b, sep=’-----‘) 1-----2
print("Hello") Hello
print("Dear") Dear
print("Hello", end='') HelloDear
print("Dear", end='')
a=2 You typed 2 as input
print('You typed ', a, 'as input')
%i, %f, %c, %s can be used as format strings. Hai Linda Your salary is 12000.5
name='Linda'; sal=12000.50 Hai Linda, Your salary is 12000.50
print('Hai', name, 'Your salary is', sal)
print('Hai %s, Your salary is %.2f' % (name,
sal))
print('Hai {}, Your salary is {}'.format(name, Hai Linda, Your salary is 12000.5
sal)) Hai Linda, Your salary is 12000.5
print('Hai {0}, Your salary is Hai 12000.5, Your salary is Linda
{1}'.format(name, sal))
print('Hai {1}, Your salary is
{0}'.format(name, sal))

input() function for accepting keyboard input


Example
str = input()
str = input(‘Enter your name= ‘)
a = int(input(‘Enter int number: ‘)
a = float(input(‘Enter a float number: ‘)
a,b,c = [int(x) for x in input("Enter three numbers: ").split()]
a,b,c = [int(x) for x in input('Enter a,b,c: ').split(',')]
a,b,c = [x for x in input('Enter 3 strings: ').split(',')]
lst = [float(x) for x in input().split(',')]
lst = eval(input(‘Enter a list: ‘))

Formal and actual arguments


When a function is defined, it may have some parameters. These parameters are useful to receive values
from outside of the function. They are called ‘formal arguments’. When we call the function, we should pass
data or values to the function. These values are called ‘actual arguments’. In the following code, ‘a’ and ‘b’
are formal arguments and ‘x’ and ‘y’ are actual arguments.
def sum(a, b): # a, b are formal arguments
c = a+b
print(c)
# call the function
x=10; y=15
sum(x, y) # x, y are actual arguments
The actual arguments used in a function call are of 4 types:
□ Positional arguments
□ Keyword arguments
□ Default arguments
□ Variable length arguments

Positional arguments
These are the arguments passed to a function in correct positional order. Here, the number of arguments and
their positions in the function definition should match exactly with the number and position of the argument
in the function call
def attach(s1, s2): # function definition
attach('New', 'York') # positional arguments
Keyword arguments
Keyword arguments are arguments that identify the parameters by their names.
def grocery(item, price): # function definition
grocery(item='Sugar', price=50.75) # key word arguments
Default arguments
We can mention some default value for the function parameters in the definition.
def grocery(item, price=40.00): # default argument is price
grocery(item='Sugar') # default value for price is used

Variable length arguments


A variable length argument is an argument that can accept any number of values. The variable length
argument is written with a ‘ * ‘ symbol before it in the function definition, as:
def add(farg, *args): # *args can take 0 or more values
add(5, 10)
add(5, 10, 20, 30)
Here, ‘farg’ is the formal argument and ‘*args’ represents variable length argument. We can pass 1 or more
values to this ‘*args’ and it will store them all in a tuple.
Function decorators
A decorator is a function that accepts a function as parameter and returns a function. A decorator takes the
result of a function, modifies the result and returns it. Thus decorators are useful to perform some additional
processing required by a function.
1. We should define a decorator function with another function name as parameter.
def decor(fun):
2. We should define a function inside the decorator function. This function actually modifies or decorates the
value of the function passed to the decorator function.
def decor(fun):
def inner():
value = fun() # access value returned by fun()
return value+2 # increase the value by 2
return inner # return the inner function
3. Return the inner function that has processed or decorated the value. In our example, in the last statement,
we were returning inner() function using return statement. With this, the decorator is completed.
The next question is how to use the decorator. Once a decorator is created, it can be used for any function to
decorate or process its result. For example, let us take num() function that returns some value, e.g. 10.
def num():
return 10
Now, we should call decor() function by passing num() function name as:
result_fun = decor(num)
So, ‘result_fun’ indicates the resultant function. Call this function and print the result, as:
print(result_fun())

ARRAYS
To work with arrays, we use numpy (numerical python) package.
For complete help on numpy: https://docs.scipy.org/doc/numpy/reference/
An array is an object that stores a group of elements (or values) of same datatype. Array elements should be
of same datatype. Arrays can increase or decrease their size dynamically.
NOTE: We can use for loops to display the individual elements of the array.
To work with numpy, we should import that module, as:
import numpy
import numpy as np
from numpy import *
Single dimensional (or 1D ) arrays
A 1D array contains one row or one column of elements. For example, the marks of a student in 5 subjects.
Creating single dimensional arrays
Creating arrays in numpy can be done in several ways. Some of the important ways are:
 Using array() function
 Using linspace() function
 Using logspace() function
 Using arange() function
 Using zeros() and ones() functions.

Creating 1D array using array()


To create a 1D array, we should use array() method that accepts list of elements.
Ex: arr = numpy.array([1,2,3,4,5])
Creating 1D array using linspace()
linspace() function is used to create an array with evenly spaced points between a starting point and ending
point. The form of the linspace() is:
linspace(start, stop, n)
‘start’ represents the starting element and ‘stop’ represents the ending element. ‘n’ is an integer that
represents the number of parts the elements should be divided. If ‘n’ is omitted, then it is taken as 50. Let us
take one example to understand this.
a = linspace(0, 10, 5)
In the above statement, we are creating an array ‘a’ with starting element 0 and ending element 10. This
range is divided into 5 equal parts and hence the points will be 0, 2.5, 5, 7.5 and 10. These elements are
stored into ‘a’. Please remember the starting and ending elements 0 and 10 are included.

Creating arrays using logspace


logspace() function is similar to linspace(). The linspace() produces the evenly spaced points. Similarly,
logspace() produces evenly spaced points on a logarithmically spaced scale. logspace is used in the
following format:
logspace(start, stop, n)
The logspace() starts at a value which is 10 power of ‘start’ and ends at a value which is 10 power of ‘stop’.
If ‘n’ is not specified, then its value is taken as 50. For example, if we write:
a = logspace(1, 4, 5)
This function represents values starting from 101 to 104 . These values are divided into 5 equal points and
those points are stored into the array ‘a’.
Creating 1D arrays using arange() function
The arange() function in numpy is same as range() function in Python. The arange() function is used in the
following format:
arange(start, stop, stepsize)
This creates an array with a group of elements from ‘start’ to one element prior to ‘stop’ in steps of
‘stepsize’. If the ‘stepsize’ is omitted, then it is taken as 1. If the ‘start’ is omitted, then it is taken as 0. For
example,
arange(10)
will produce an array with elements 0 to 9.
arange(5, 10, 2)
will produce an array with elements: 5,7,9.

Creating arrays using zeros() and ones() functions


We can use zeros() function to create an array with all zeros. The ones() function is useful to create an array
with all 1s. They are written in the following format:
zeros(n, datatype)
ones(n, datatype)
where ‘n’ represents the number of elements. we can eliminate the ‘datatype’ argument. If we do not specify
the ‘datatype’, then the default datatype used by numpy is ‘float’. See the examples:
zeros(5)
This will create an array with 5 elements all are zeros, as: [ 0. 0. 0. 0. 0. ]. If we want this array in integer
format, we can use ‘int’ as datatype, as:
zeros(5, int)
this will create an array as: [ 0 0 0 0 0 ].
If we use ones() function, it will create an array with all elements 1. For example,
ones(5, float)
will create an array with 5 integer elements all are 1s as: [ 1. 1. 1. 1. 1. ].

Arithmetic operations on arrays


Taking an array as an object, we can perform basic operations like +, -, *, /, // and % operations on each
element.
Ex:
import numpy
arr = numpy.array([10, 20, 30])
arr+5
Important Mathematical functions in numpy
Function Meaning
concatenate([a, b]) Joins the arrays a and b and returns the
resultant array.
sqrt(arr) Calculates square root value of each element
in the array ‘arr’.
power(arr, n) Returns power value of each element in the
array ‘arr’ when raised to the power of ‘n’.
exp(arr) Calculates exponentiation value of each
element in the array ‘arr’.
sum(arr) Returns sum of all the elements in the array
‘arr’.
prod(arr) Returns product of all the elements in the
array ‘arr’.
min(arr) Returns smallest element in the array ‘arr’.
max(arr) Returns biggest element in the array ‘arr’.
mean(arr) Returns mean value (average) of all elements
in the array ‘arr’.
median(arr) Returns median value of all elements in the
array ‘arr’.
std(arr) Gives standard deviation of elements in the
array ‘arr’.
argmin(arr) Gives index of the smallest element in the
array. Counting starts from 0.

Ex:
numpy.sort(arr)
numpy.max(arr)
numpy.sqrt(arr)
Aliasing the arrays
If ‘a’ is an array, we can assign it to ‘b’, as:
b=a
This is a simple assignment that does not make any new copy of the array ‘a’. It means, ‘b’ is not a new
array and memory is not allocated to ‘b’. Also, elements from ‘a’ are not copied into ‘b’ since there is no
memory for ‘b’. Then how to understand this assignment statement? We should understand that we are
giving a new name ‘b’ to the same array referred by ‘a’. It means the names ‘a’ and ‘b’ are referencing same
array. This is called ‘aliasing’.
‘Aliasing’ is not ‘copying’. Aliasing means giving another name to the existing object. Hence, any
modifications to the alias object will reflect in the existing object and vice versa.
Viewing and Copying arrays
We can create another array that is same as an existing array. This is done by view() method. This method
creates a copy of an existing array such that the new array will also contain the same elements found in the
existing array. The original array and the newly created arrays will share different memory locations. If the
newly created array is modified, the original array will also be modified since the elements in both the arrays
will be like mirror images.
We can create a view of ‘a’ as:
b = a.view()
Viewing is nothing but copying only. But it is called ‘shallow copying’ as the elements in the view when
modified will also modify the elements in the original array. So, both the arrays will act as one and the same.
Suppose we want both the arrays to be independent and modifying one array should not affect another array,
we should go for ‘deep copying’. This is done with the help of copy() method. This method makes a
complete copy of an existing array and its elements. When the newly created array is modified, it will not
affect the existing array or vice versa. There will not be any connection between the elements of the two
arrays.
We can create a copy of ’a’ as:
b = a.copy()
Multi-dimensional arrays (2D, 3D, etc)
They represent more than one row and more than one column of elements. For example, marks obtained by a
group of students each in five subjects.
Creating multi-dimensional arrays
We can create multi dimensional arrays in the following ways:
 Using array() function
 Using ones() and zeroes() functions
 Using eye() function
 Using reshape() function discussed earlier

Using array() function


To create a 2D array, we can use array() method that contains a list with lists.
ones() and zeros() functions
The ones() function is useful to create a 2D array with several rows and columns where all the elements will
be taken as 1. The format of this function is:
ones((r, c), dtype)
Here, ‘r’ represents the number of rows and ‘c’ represents the number of columns. ‘dtype’ represents the
datatype of the elements in the array. For example,
a = ones((3, 4), float)
will create a 2D array with 3 rows and 4 columns and the datatype is taken as float. If ‘dtype’ is omitted,
then the default datatype taken will be ‘float’. Now, if we display ‘a’, we can see the array as:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
The decimal point after each element represents that the elements are float type.
Just like ones() function, we can also use zeros() function to create a 2D array with elements filled with
zeros. Suppose, we write:
b = zeros((3,4), int)
Then a 2D array with 2 rows and 4 columns will be created where all elements will be 0s, as shown below:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
eye() function
The eye() function creates a 2D array and fills the elements in the diagonal with 1s. The general format of
using this function is:
eye(n, dtype=datatype)
This will create an array with ‘n’ rows and ‘n’ columns. The default datatype is ‘float’. For example, eye(3)
will create a 3x3 array and fills the diagonal elements with 1s as shown below:
a = eye(3)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
Indexing and slicing in 2D arrays
Ex:
import numpy as np
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr[0] gives 0th row -> [1,2,3]
arr[1] gives 1st row -> [4,5,6]
arr[0,1] gives 0th row, 1st column element -> 2
arr[2,1] gives 2nd row, 1st column element -> 8
a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
a[0:2, 0:3] -> 0th row to 1st row, 0th column to 2nd column
array([[1, 2, 3],
[5, 6, 7]])
a[1:3, 2:] -> 1th row to 2nd row, 2nd column to last column
array([[ 7, 8],
[11, 12]])

INTRODUCTION TO OOPS

Procedure oriented approach


The main and sub tasks are represented by functions and procedures.
Ex: C, Pascal, FORTRAN.
Object oriented approach
The main and sub tasks are represented by classes.
Ex: C++, Java, Python.
Differences between POA and OOA
POA OOA
1. There is no code reusability. For every We can create sub classes to existing classes
new task, the programmer needs to develop and reuse them.
a new function
2. One function may call on another Every class is independent and hence it can
function. Hence debugging becomes be debugged without disturbing other
difficult as we have to check every function. classes.
3. This approach is not developed from Developed from human being’s life and
human being’s life and hence learning and hence easy to understand and handle.
using it is very difficult.
4. Programmers lose control at a particular Suitable to handle bigger and complex
point when the code size is between 10,000 projects
to 1,00,000 lines. Hence not suitable for
bigger and complex projects.

Features of OOPS
1. classes and objects
2. encapsulation
3. abstraction
4. inheritance
5. polymorphism

Classes and objects


An object is anything that really exists in the world. Object contains behavior -> attributes and actions ->
variables and methods.
A group of objects having same behavior belong to same class or category.
A class is a model for creating objects. An object exists physically but a class does not exist physically.
Class also contains variables and methods.
Def: A class is a specification of behavior of a group of objects.
Def: An object is an instance (physical form) of a class.

Self variable
‘self’ is a default variable that contains the memory address of the instance of the current class. So, we can
use ‘self’ to refer to all the instance variables and instance methods.
Constructor
A constructor is a special method that is used to initialize the instance variables of a class. In the constructor,
we create the instance variables and initialize them with some starting values. The first parameter of the
constructor will be ‘self’ variable that contains the memory address of the instance.
A constructor may or may not have parameters.
Ex:
def __init__(self): # default constructor
self.name = ‘Vishnu’
self.marks = 900
Ex:
def __init__(self, n = ‘’, m=0): # parameterized constructor with 2 parameters
self.name = n
self.marks = m
Types of variables
The variables which are written inside a class are of 2 types:
 Instance variables
 Class variables or Static variables

Instance variables are the variables whose separate copy is created in every instance (or object). Instance
variables are defined and initialized using a constructor with ‘self’ parameter. Also, to access instance
variables, we need instance methods with ‘self’ as first parameter. Instance variables can be accessed as:
obj.var
Unlike instance variables, class variables are the variables whose single copy is available to all the instances
of the class. If we modify the copy of class variable in an instance, it will modify all the copies in the other
instances. A class method contains first parameter by default as ‘cls’ with which we can access the class
variables. For example, to refer to the class variable ‘x’, we can use ‘cls.x’.
NOTE: class variables are also called ‘static variables’. class methods are marked with the decorator
@classmethod .
NOTE: instance variables can be accessed as: obj.var or classname.var

Namespaces
A namespace represents a memory block where names are mapped (or linked) to objects. A class maintains
its own namespace, called ‘class namespace’. In the class namespace, the names are mapped to class
variables. Similarly, every instance (object) will have its own name space, called ‘instance namespace’. In
the instance namespace, the names are mapped to instance variables.
When we modify a class variable in the class namespace, its modified value is available to all instances.
When we modify a class variable in the instance namespace, then it is confined to only that instance. Its
modified value will not be available to other instances.
Types of methods
By this time, we got some knowledge about the methods written in a class. The purpose of a method is to
process the variables provided in the class or in the method. We already know that the variables declared in
the class are called class variables (or static variables) and the variables declared in the constructor are called
instance variables. We can classify the methods in the following 3 types:

Instance methods
(a) Accessor methods
(b) Mutator methods
Class methods
Static methods

An instance method acts on instance variables. There are two types of methods.
1. Accessor methods: They read the instance vars. They do not modify them. They are also called getter()
methods.
2. Mutator methods: They not only read but also modify the instance vars. They are also called setter()
methods.
PROGRAMS
4. Create getter and setter methods for a Manager with name and salary instance variables.
Static methods
We need static methods when the processing is at class level but we need not involve the class or instances.
Static methods are used when some processing is related to the class but does not need the class or its
instances to perform any work. For example, setting environmental variables, counting the number of
instances of the class or changing an attribute in another class etc. are the tasks related to a class. Such tasks
are handled by static methods. Static methods are written with a decorator @staticmethod above them. Static
methods are called in the form of classname.method().
Inner classes
Writing a class within another class is called inner class or nested class. For example, if we write class B
inside class A, then B is called inner class or nested class. Inner classes are useful when we want to sub
group the data of a class.

Encapsulation
Bundling up of data and methods as a single unit is called ‘encapsulation’. A class is an example for
encapsulation.
Abstraction
Hiding unnecessary data from the user is called ‘abstraction’. By default all the members of a class are
‘public’ in Python. So they are available outside the class. To make a variable private, we use double
underscore before the variable. Then it cannot be accessed from outside of the class. To access it from
outside the class, we should use: obj._Classname__var. This is called name mangling.

Inheritance
Creating new classes from existing classes in such a way that all the features of the existing classes are
available to the newly created classes – is called ‘inheritance’. The existing class is called ‘base class’ or
‘super class’. The newly created class is called ‘sub class’ or ‘derived class’.
Sub class object contains a copy of the super class object. The advantage of inheritance is ‘reusability’ of
code. This increases the overall performance of the organization.
Syntax: class Subclass(Baseclass):

Constructors in inheritance
In the previous programs, we have inherited the Student class from the Teacher class. All the methods and
the variables in those methods of the Teacher class (base class) are accessible to the Student class (sub
class). The constructors of the base class are also accessible to the sub class.

When the programmer writes a constructor in the sub class, then only the sub class constructor will get
executed. In this case, super class constructor is not executed. That means, the sub class constructor is
replacing the super class constructor. This is called constructor overriding.

super() method
super() is a built-in method which is useful to call the super class constructor or methods from the sub class.
super().__init__() # call super class constructor
super().__init__(arguments) # call super class constructor and pass arguments
super().method() # call super class method
Types of inheritance
There are two types:
1. Single inheritance: deriving sub class from a single super class.
Syntax: class Subclass(Baseclass):
2. Multiple inheritance: deriving sub class from more than one super class.
Syntax: class Subclass(Baseclass1, Baseclass2, … ):
NOTE: ‘object’ is the super class for all classes in Python.
Polymorphism
poly + morphos = many + forms
If something exists in various forms, it is called ‘Polymorphism’. If an operator or method performs various
tasks, it is called polymorphism.
Ex:
Duck typing: Calling a method on any object without knowing the type (class) of the object.
Operator overloading: same operator performing more than one task.
Method overloading: same method performing more than one task.
Method overriding: executing only sub class method in the place of super class method.
ABSTRACT CLASSES AND INTERFACES
An abstract method is a method whose action is redefined in the sub classes as per the requirement of the
objects. Generally abstract methods are written without body since their body will be defined in the sub
classes
anyhow. But it is possible to write an abstract method with body also. To mark a method as abstract, we
should use the decorator @abstractmethod. On the other hand, a concrete method is a method with body.
An abstract class is a class that generally contains some abstract methods. PVM cannot create objects to an
abstract class.
Once an abstract class is written, we should create sub classes and all the abstract methods should be
implemented (body should be written) in the sub classes. Then, it is possible to create objects to the sub
classes.
A meta class is a class that defines the behavior of other classes. Any abstract class should be derived from
the meta class ABC that belongs to ‘abc’ module. So import this module, as:
from abc import ABC, abstractmethod
(or) from abc import *

Interfaces in Python
We learned that an abstract class is a class which contains some abstract methods as well as concrete
methods also. Imagine there is a class that contains only abstract methods and there are no concrete methods.
It becomes an interface. This means an interface is an abstract class but it contains only abstract methods.
None of the methods in the interface will have body. Only method headers will be written in the interface.
So an interface can be defined as a specification of method headers. Since, we write only abstract methods in
the interface, there is possibility for providing different implementations (body) for those abstract methods
depending on the requirements of objects. In Python, we have to use abstract classes as interfaces.
Since an interface contains methods without body, it is not possible to create objects to an interface. In this
case, we can create sub classes where we can implement all the methods of the interface. Since the sub
classes will have all the methods with body, it is possible to create objects to the sub classes. The flexibility
lies in the fact that every sub class can provide its own implementation for the abstract methods of the
interface.

EXCEPTIONS
An exception is a runtime error which can be handled by the programmer. That means if the programmer can
guess an error in the program and he can do something to eliminate the harm caused by that error, then it is
called an ‘exception’. If the programmer cannot do anything in case of an error, then it is called an ‘error’
and not an exception.
All exceptions are represented as classes in Python. The exceptions which are already available in Python
are called ‘built-in’ exceptions. The base class for all built-in exceptions is ‘BaseException’ class. From
BaseException class, the sub class ‘Exception’ is derived. From Exception class, the sub classes
‘StandardError’ and ‘Warning’ are derived.
All errors (or exceptions) are defined as sub classes of StandardError. An error should be compulsorily
handled otherwise the program will not execute. Similarly, all warnings are derived as sub classes from
‘Warning’ class. A
warning represents a caution and even though it is not handled, the program will execute. So, warnings can
be neglected but errors cannot be neglected.
Just like the exceptions which are already available in Python language, a programmer can also create his
own exceptions, called ‘user-defined’ exceptions. When the programmer wants to create his own exception
class, he should derive his class from ‘Exception’ class and not from ‘BaseException’ class. In the Figure,
we are showing important classes available in Exception hierarchy.

Exception handling
The purpose of handling the errors is to make the program robust. The word ‘robust’ means ‘strong’. A
robust program does not terminate in the middle. Also, when there is an error in the program, it will display
an appropriate message to the user and continue execution. Designing the programs in this way is needed in
any software development. To handle exceptions, the programmer should perform the following 3 tasks:
Step 1: The programmer should observe the statements in his program where there may be a possibility of
exceptions. Such statements should be written inside a ‘try’ block. A try block looks like as follows:
try:
statements
The greatness of try block is that even if some exception arises inside it, the program will not be terminated.
When PVM understands that there is an exception, it jumps into an ‘except’ block.
Step 2: The programmer should write the ‘except’ block where he should display the exception details to the
user. This helps the user to understand that there is some error in the program. The programmer should also
display a message regarding what can be done to avoid this error. Except block looks like as follows:
except exceptionname:
statements # these statements form handler
The statements written inside an except block are called ‘handlers’ since they handle the situation when the
exception occurs.
Step 3: Lastly, the programmer should perform clean up actions like closing the files and terminating any
other processes which are running. The programmer should write this code in the finally block. Finally block
looks like as follows:

finally:
statements
The specialty of finally block is that the statements inside the finally block are executed irrespective of
whether there is an exception or not. This ensures that all the opened files are properly closed and all the
running processes are properly terminated. So, the data in the files will not be corrupted and the user is at the
safe-side.

However, the complete exception handling syntax will be in the following format:
try:
statements
except Exception1:
handler1
except Exception2:
handler2
else:
statements
finally:
statements
‘try’ block contains the statements where there may be one or more exceptions. The subsequent ‘except’
blocks handle these exceptions. When ‘Exception1’ occurs, ‘handler1’ statements are executed. When
‘Exception2’ occurs, ‘hanlder2’ statements are executed and so forth. If no exception is raised, the
statements inside the ‘else’ block are executed. Even if the exception occurs or does not occur, the code
inside ‘finally’ block is always executed. The following points are noteworthy:

 A single try block can be followed by several except blocks.


 Multiple except blocks can be used to handle multiple exceptions.
 We cannot write except blocks without a try block.
 We can write a try block without any except blocks.
 else block and finally blocks are not compulsory.
 When there is no exception, else block is executed after try block.
 finally block is always executed.

FILES IN PYTHON
A file represents storage of data. A file stores data permanently so that it is available to all the programs.
Types of files in Python
In Python, there are 2 types of files. They are:
 Text files
 Binary files
Text files store the data in the form of characters. For example, if we store employee name “Ganesh”, it will
be stored as 6 characters and the employee salary 8900.75 is stored as 7 characters. Normally, text files are
used to store characters or strings.
Binary files store entire data in the form of bytes, i.e. a group of 8 bits each. For example, a character is
stored as a byte and an integer is stored in the form of 8 bytes (on a 64 bit machine). When the data is
retrieved from the binary file, the programmer can retrieve the data as bytes. Binary files can be used to store
text, images, audio and video.
Opening a file
We should use open() function to open a file. This function accepts ‘filename’ and ‘open mode’ in which to
open the file.
file handler = open(“file name”, “open mode”, “buffering”)
Ex: f = open(“myfile.txt”, “w”)
Here, the ‘file name’ represents a name on which the data is stored. We can use any name to reflect the
actual data. For example, we can use ‘empdata’ as file name to represent the employee data. The file ‘open
mode’ represents the purpose of opening the file. The following table specifies the file open modes and their
meanings.
File open mode Description
w To write data into file. If any data is
already present in the file, it would be
deleted and the present data will be
stored.
r To read data from the file. The file
pointer is positioned at the beginning
of the file.
a To append data to the file. Appending
means adding at the end of existing
data. The file pointer is placed at the
end of the file. If the file does not
exist, it will create a new file for
writing data.
w+ To write and read data of a file. The
previous data in the file will be
deleted.
r+ To read and write data into a file. The
previous data in the file will not be
deleted. The file pointer is placed at
the beginning of the file.
a+ To append and read data of a file. The
file pointer will be at the end of the
file if the file exists. If the file does
not exist, it creates a new file for
reading and writing.
x Open the file in exclusive creation
mode. The file creation fails if the file
already exists.
The above Table represents file open modes for text files. If we attach ‘b’ for them, they represent modes for
binary files. For example, wb, rb, ab, w+b, r+b, a+b are the modes for binary files.
A buffer represents a temporary block of memory. ‘buffering’ is an optional integer used to set the size of
the buffer for the file. If we do not mention any buffering integer, then the default buffer size used is 4096 or
8192 bytes.
Closing a file
A file which is opened should be closed using close() method as:
f.close()
Files with characters
To write a group of characters (string), we use: f.write(str)
To read a group of characters (string), we use: str = f.read()
PROGRAMS
25. Create a file and store a group of chars.
26. Read the chars from the file.
Files with strings
To write a group of strings into a file, we need a loop that repeats: f.write(str+”\n”)
To read all strings from a file, we can use: str = f.read()
Knowing whether a file exists or not
The operating system (os) module has a sub module by the name ‘path’ that contains a method isfile(). This
method can be used to know whether a file that we are opening really exists or not. For example,
os.path.isfile(fname) gives True if the file exists otherwise False. We can use it as:
if os.path.isfile(fname): # if file exists,
f = open(fname, 'r') # open it
else:
print(fname+' does not exist')
sys.exit() # terminate the program
with statement
‘with’ statement can be used while opening a file. The advantage of with statement is that it will take care of
closing a file which is opened by it. Hence, we need not close the file explicitly. In case of an exception also,
‘with’ statement will close the file before the exception is handled. The format of using ‘with’ is:
with open(“filename”, “openmode”) as fileobject:
Ex: writing into a flie
# with statement to open a file
with open('sample.txt', 'w') as f:
f.write('Iam a learner\n')
f.write('Python is attractive\n')
Ex: reading from a file
# using with statement to open a file
with open('sample.txt', 'r') as f:
for line in f:
print(line)

DATA ANALYSIS USING PANDAS

Data Science
To work with datascience, we need the following packages to be installed:
C:\> pip install pandas
C:\> pip install xlrd //to extract data from Excel sheets
C:\> pip install matplotlib
Data plays important role in our lives. For example, a chain of hospitals contain data related to medical
reports and prescriptions of their patients. A bank contains thousands of customers’ transaction details. Share
market data represents minute to minute changes in the share values. In this way, the entire world is roaming
around huge data.
Every piece of data is precious as it may affect the business organization which is using that data. So, we
need some mechanism to store that data. Moreover, data may come from various sources. For example in a
business organization, we may get data from Sales department, Purchase department, Production department,
etc. Such data is stored in a system called ‘data warehouse’. We can imagine data warehouse as a central
repository of integrated data from different sources.
Once the data is stored, we should be able to retrieve it based on some pre-requisites. A business company
wants to know about how much amount they spent in the last 6 months on purchasing the raw material or
how many items found defective in their production unit. Such data cannot be easily retrieved from the huge
data available in the data warehouse. We have to retrieve the data as per the needs of the business
organization. This is called data analysis or data analytics where the data that is retrieved will be analyzed to
answer the questions raised by the management of the organization. A person who does data analysis is
called ‘data analyst’.

Once the data is analyzed, it is the duty of the IT professional to present the results in the form of pictures or
graphs so that the management will be able to understand it easily. Such graphs will also help them to
forecast the future of their company. This is called data visualization. The primary goal of data visualization
is to communicate information clearly and efficiently using statistical graphs, plots and diagrams.
Data science is a term used for techniques to extract information from the data warehouse, analyze them and
present the necessary data to the business organization in order to arrive at important conclusions and
decisions. A person who is involved in this work is called ‘data scientist’. We can find important differences
between the roles of data scientist and data analyst in following table:
Data Scientist Data Analyst
Data scientist formulates the questions that Data analyst receives questions from the
will help a business organization and then business team and provides answers to
proceed in solving them. them.
Data scientist will have strong data Data analyst simply analyzes the data and
visualization skills and the ability to provides information requested by the
convert data into a business story. team.
Perfection in mathematics, statistics and Perfection in data warehousing, big data
programming languages like Python and R concepts, SQL and business intelligence is
are needed for a Data scientist. needed for a Data analyst.
A Data scientist estimates the unknown A Data analyst looks at the known data
information from the known data. from a new perspective.

Please see the following sample data in the excel file: empdata.xlsx.
CREATING DATA FRAMES
is possible from csv files, excel files, python dictionaries, tuples list, json data etc.
Creating data frame from .csv file
>>> import pandas as pd
>>> df = pd.read_csv("f:\\python\PANDAS\empdata.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99

Creating data frame from .xlsx file


>>> df1 = pd.read_excel("f:\\python\PANDAS\empdata.xlsx", "Sheet1")
>>> df1
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-20
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2000-09-10
4 1005 Laxmi Prasanna 12000.75 2000-10-08
5 1006 Anant Nag 9999.99 1999-09-09

Creating data frame from a dictionary


>>> empdata = {"empid": [1001, 1002, 1003, 1004, 1005, 1006],
"ename": ["Ganesh Rao", "Anil Kumar", "Gaurav Gupta", "Hema Chandra", "Laxmi Prasanna", "Anant
Nag"],
"sal": [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99],
"doj": ["10-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", "10-8-2000", "9-9-1999"]}
>>> df2 = pd.DataFrame(empdata)
>>> df2
doj empid ename sal
0 10-10-2000 1001 Ganesh Rao 10000.00
1 3-20-2002 1002 Anil Kumar 23000.50
2 3-3-2002 1003 Gaurav Gupta 18000.33
3 9-10-2000 1004 Hema Chandra 16500.50
4 10-8-2000 1005 Laxmi Prasanna 12000.75
5 9-9-1999 1006 Anant Nag 9999.99
Creating data frame from a list of tuples
>>> empdata = [(1001, 'Ganesh Rao', 10000.00, '10-10-2000'),
(1002, 'Anil Kumar', 23000.50, '3-20-2002'),
(1003, 'Gaurav Gupta', 18000.33, '03-03-2002'),
(1004, 'Hema Chandra', 16500.50, '10-09-2000'),
(1005, 'Laxmi Prasanna', 12000.75, '08-10-2000'),
(1006, 'Anant Nag', 9999.99, '09-09-1999')]
>>> df3 = pd.DataFrame(empdata, columns=["eno", "ename", "sal", "doj"])
>>> df3
eno ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-2000
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-2002
3 1004 Hema Chandra 16500.50 10-09-2000
4 1005 Laxmi Prasanna 12000.75 08-10-2000
5 1006 Anant Nag 9999.99 09-09-1999
BASIC OPERATIONS ON DATAFRAMES
(Data analysis)
For all operations please refer to:
https://pandas.pydata.org/pandas-docs/stable/ generated/pandas.Series.html
df = pd.read_csv("f:\\python\PANDAS\empdata.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
1. To know the no. of rows and cols – shape
>>> df.shape
(6, 4)
>>> r, c = df.shape
>>> r
6
2. To display the first or last 5 rows – head(), tail()
>>> df.head()
>>> df.tail()
To display the first 2 or last 2 rows
>>> df.head(2)
>>> df.tail(2)
3. Displaying range of rows – df[2:5]
To display 2nd row to 4th row:
>>> df[2:5]
To display all rows:
>>> df[:]
>>> df
4. To display column names – df.columns
>>> df.columns
Index(['empid', 'ename', 'sal', 'doj'], dtype='object')
5. To display column data – df.columname
>>> df.empid (or)
>>> df['empid']
>>> df.sal (or)
>>> df['sal']
6. To display multiple column data – df[[list of colnames]]
>>> df[['empid', 'ename']]
empid ename
0 1001 Ganesh Rao
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
5 1006 Anant Nag
7. Finding maximum and minimum – max() and min()
>>> df['sal'].max()
23000.5
>>> df['sal'].min()
9999.9899999999998
8. To display statistical information on numerical cols – describe()
>>> df.describe()
empid sal
count 6.000000 6.000000
mean 1003.500000 14917.011667
std 1.870829 5181.037711
min 1001.000000 9999.990000
25% 1002.250000 10500.187500
50% 1003.500000 14250.625000
75% 1004.750000 17625.372500
max 1006.000000 23000.500000
9. Show all rows with a condition
To display all rows where sal>10000
>>> df[df.sal>10000]
empid ename sal doj
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
To retrieve the row where salary is maximum
>>> df[df.sal == df.sal.max()]
empid ename sal doj
1 1002 Anil Kumar 23000.5 3-20-2002
10. To show only cols of rows based on condition
>>> df[['empid', 'ename']][df.sal>10000]
empid ename
1 1002 Anil Kumar
2 1003 Gaurav Gupta
3 1004 Hema Chandra
4 1005 Laxmi Prasanna
11. To know the index range - index
>>> df.index
RangeIndex(start=0, stop=6, step=1)
12. To change the index to a column – set_index()
>>> df1 = df.set_index('empid')
(or) to modify the same Data Frame:
>>> df.set_index('empid', inplace=True)
>>> df
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 3-20-2002
1003 Gaurav Gupta 18000.33 03-03-02
1004 Hema Chandra 16500.50 10-09-00
1005 Laxmi Prasanna 12000.75 08-10-00
1006 Anant Nag 9999.99 09-09-99
NOTE: Now it is possible to search on empid value using loc[].
>>> df.loc[1004]
ename Hema Chandra
sal 16500.5
doj 10-09-00
Name: 1004, dtype: object
13. To reset the index back – reset_index()
>>> df.reset_index(inplace=True)
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 3-20-2002
2 1003 Gaurav Gupta 18000.33 03-03-02
3 1004 Hema Chandra 16500.50 10-09-00
4 1005 Laxmi Prasanna 12000.75 08-10-00
5 1006 Anant Nag 9999.99 09-09-99
HANDLING MISSING DATA
Read .csv file data into Data Frame
>>> df = pd.read_csv("f:\\python\PANDAS\empdata1.csv")
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 10-10-00
1 1002 Anil Kumar 23000.50 03-03-02
2 1003 NaN 18000.33 03-03-02
3 1004 Hema Chandra NaN NaN
4 1005 Laxmi Prasanna 12000.75 10-08-00
5 1006 Anant Nag 9999.99 09-09-99
To set the empid as index – set_index()
>>> df.set_index('empid', inplace=True)
>>> df
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 NaN 18000.33 03-03-02
1004 Hema Chandra NaN NaN
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To fill the NaN values by 0 – fillna(0)
>>> df1 = df.fillna(0)
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 0 18000.33 03-03-02
1004 Hema Chandra 0.00 0
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To fill columns with different data – fillna(dictionary)
>>> df1 = df.fillna({'ename': 'Name missing', 'sal': 0.0, 'doj':'00-00-00'})
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1003 Name missing 18000.33 03-03-02
1004 Hema Chandra 0.00 00-00-00
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
To delete all rows with NaN values – dropna()
>>> df1 = df.dropna()
>>> df1
ename sal doj
empid
1001 Ganesh Rao 10000.00 10-10-00
1002 Anil Kumar 23000.50 03-03-02
1005 Laxmi Prasanna 12000.75 10-08-00
1006 Anant Nag 9999.99 09-09-99
SORTING THE DATA
Read .csv file data into Data Frame and indicate to consider ‘doj’ as date type field
>>> df = pd.read_csv("f:\\python\PANDAS\empdata2.csv", parse_dates=['doj'])
>>> df
empid ename sal doj
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
3 1004 Hema Chandra 16500.50 2002-03-03
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09
To sort on a column – sort_values(colname)
>>> df1 = df.sort_values('doj')
>>> df1
>>> df1
empid ename sal doj
5 1006 Anant Nag 9999.99 1999-09-09
4 1005 Laxmi Prasanna 12000.75 2000-08-10
0 1001 Ganesh Rao 10000.00 2000-10-10
1 1002 Anil Kumar 23000.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03

3 1004 Hema Chandra 16500.50 2002-03-03


NOTE: To sort in descending order:
>>> df1 = df.sort_values('doj', ascending=False)
To sort multiple columns differently – sort_values(by =[], ascending = [])
To sort on ‘doj’ in descending order and in that on ‘sal’ in ascending order:
>>> df1 = df.sort_values(by=['doj', 'sal'], ascending=[False, True])
>>> df1
empid ename sal doj
3 1004 Hema Chandra 16500.50 2002-03-03
2 1003 Gaurav Gupta 18000.33 2002-03-03
1 1002 Anil Kumar 23000.50 2002-03-03
0 1001 Ganesh Rao 10000.00 2000-10-10
4 1005 Laxmi Prasanna 12000.75 2000-08-10
5 1006 Anant Nag 9999.99 1999-09-09

Data Wrangling
Data wrangling, also known as data munging, is the process of cleaning, transforming, and organizing raw
data into a format that is suitable for analysis. Python is a popular language for data wrangling due to its
powerful libraries and tools. Below is an overview of the key steps and libraries used in data wrangling with
Python:
Key Steps in Data Wrangling
1. Data Collection: Gather data from various sources (e.g., CSV files, databases, APIs, web scraping).
2. Data Cleaning: Handle missing values, remove duplicates, correct inconsistencies, and fix errors.
3. Data Transformation: Reshape, aggregate, or filter data to make it suitable for analysis.
4. Data Integration: Combine data from multiple sources.
5. Data Validation: Ensure data quality and consistency.
6. Data Export: Save the cleaned and transformed data into a usable format (e.g., CSV, Excel,
database).
Python Libraries for Data Wrangling
1. Pandas: The most widely used library for data manipulation and analysis.
o Key features: DataFrames, handling missing data, merging datasets, reshaping data.
o Example: import pandas as pd
2. NumPy: Used for numerical computations and handling arrays.
o Example: import numpy as np
3. OpenPyXL: For working with Excel files.
o Example: from openpyxl import Workbook
4. SQLAlchemy: For interacting with databases.
o Example: from sqlalchemy import create_engine
5. BeautifulSoup and Requests: For web scraping and collecting data from websites.
o Example: from bs4 import BeautifulSoup, import requests
6. PySpark: For handling large-scale data wrangling tasks in distributed environments.

Common Data Wrangling Tasks in Python


Applications
import pandas as pd
# Load CSV file
df = pd.read_csv('data.csv')

# Load Excel file


df = pd.read_excel('data.xlsx')

# Load data from a database


from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.db')
df = pd.read_sql('SELECT * FROM table_name', engine)
Inspecting Data
# View the first few rows
print(df.head())

# Get summary statistics


print(df.describe())

# Check for missing values


print(df.isnull().sum())

# Check data types


print(df.dtypes)

Handling Missing Data


# Drop rows with missing values
df_cleaned = df.dropna()

# Fill missing values with a specific value


df_filled = df.fillna(0)

# Fill missing values with the mean


df['column_name'].fillna(df['column_name'].mean(), inplace=True)

Data Transformation
# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Filter rows based on a condition


df_filtered = df[df['column_name'] > 10]

# Apply a function to a column


df['new_column'] = df['column_name'].apply(lambda x: x * 2)

# Group by and aggregate


df_grouped = df.groupby('category').agg({'value': 'sum'})
Merging Data
# Merge two DataFrames
df_merged = pd.merge(df1, df2, on='key_column', how='inner')

# Concatenate DataFrames
df_concat = pd.concat([df1, df2], axis=0)

Exporting Data
# Save
Visualizing Data
DATA VISUALIZATION USING MATPLOTLIB
Complete reference is available at:
https://matplotlib.org/api/pyplot_summary.html
CREATE DATAFRAME FROM DICTIONARY
>>> empdata = {"empid": [1001, 1002, 1003, 1004, 1005, 1006],
"ename": ["Ganesh Rao", "Anil Kumar", "Gaurav Gupta", "Hema Chandra", "Laxmi Prasanna", "Anant
Nag"],
"sal": [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99],
"doj": ["10-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", "10-8-2000", "9-9-1999"]}
>>> import pandas as pd
>>> df = pd.DataFrame(empdata)
TAKE ONLY THE COLUMNS TO PLOT
>>> x = df['empid']
>>> y = df['sal']
DRAW THE BAR GRAPH
Bar chart shows data in the form of bars. It is useful for comparing values.
>>> import matplotlib.pyplot as plt
>>> plt.bar(x,y)
<Container object of 6 artists>
>>> plt.xlabel('employee id nos')
Text(0.5,0,'employee id nos')
>>> plt.ylabel('employee salaries')
Text(0,0.5,'employee salaries')
>>> plt.title('XYZ COMPANY')
Text(0.5,1,'XYZ COMPANY')
>>> plt.legend()
>>> plt.show()

CREATING BAR GRAPHS FROM MORE THAN 1 DATA FRAMES


For example, we can plot the empid and salaries from 2 departments: Sales team and Production team.
import matplotlib.pyplot as plt
x = [1001, 1002, 1003, 1004, 1005, 1006]
y = [10000, 23000.50, 18000.33, 16500.50, 12000.75, 9999.99]
x1 = [1010, 1011, 1012, 1013, 1014, 1015]
y1 = [5000, 6000, 4500.00, 12000, 9000, 10000]
plt.bar(x,y, label='Sales dept', color='red')
plt.bar(x1,y1, label='Production dept', color='green')
plt.xlabel('emp id')
plt.ylabel('salaries')
plt.title('XYZ COMPANY')
plt.legend()
plt.show()
CREATING HISTOGRAM
Histogram shows distributions of values. Histogram is similar to bar graph but it is useful to show values
grouped in bins or intervals.
NOTE: histtype : {‘bar’, ‘barstacked’, ‘step’, ‘stepfilled’},
import matplotlib.pyplot as plt
emp_ages = [22,45,30,60,60,56,60,45,43,43,50,40,34,33,25,19]
bins = [0,10,20,30,40,50,60]
plt.hist(emp_ages, bins, histtype='bar', rwidth=0.8, color='cyan')
plt.xlabel('employee ages')
plt.ylabel('No. of employees')
plt.title('XYZ COMPANY')
plt.legend()
plt.show()
CREATING A PIE CHART
A pie chart shows a circle that is divided into sectors that each represents a proportion of the whole.
To display no. of employees of different departments in a company.
import matplotlib.pyplot as plt
slices = [50, 20, 15, 15]
depts = ['Sales', 'Production', 'HR', 'Finance']
cols = ['cyan', 'magenta', 'blue', 'red']
plt.pie(slices, labels=depts, colors=cols, startangle=90, shadow=True,
explode= (0, 0.2, 0, 0), autopct='%.1f%%')
plt.title('XYZ COMPANY')
plt.show()
Feature Engineering and Selection
Feature engineering and feature selection are critical steps in the machine learning pipeline. They involve
creating new features from raw data and selecting the most relevant features to improve model performance.
Python provides powerful libraries and tools to perform these tasks effectively.

Feature Engineering
Feature engineering is the process of creating new features or transforming existing ones to better represent
the underlying problem and improve model performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill with mean, median, or mode.
o Use advanced techniques like KNN imputation.
2. df['column'].fillna(df['column'].mean(), inplace=True)

Encoding Categorical Variables:


 One-Hot Encoding: Convert categorical variables into binary columns.
 Label Encoding: Convert categories into numerical labels.
# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_column'] = le.fit_transform(df['category_column'])
Scaling and Normalization:
 Standardization: Scale features to have zero mean and unit variance.
 Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1).
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df['scaled_column'] = minmax_scaler.fit_transform(df[['column']])

Creating Interaction Features:


 Combine two or more features to create new ones.
df['interaction_feature'] = df['feature1'] * df['feature2']

Binning:
 Convert continuous variables into discrete bins.
 df['binned_column'] = pd.cut(df['continuous_column'], bins=5)

Date/Time Feature Extraction:


 Extract useful information from date/time columns (e.g., day, month, year).
df['year'] = pd.to_datetime(df['date_column']).dt.year
df['month'] = pd.to_datetime(df['date_column']).dt.month
Text Feature Engineering:
 Tokenization, TF-IDF, word embeddings, etc.
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['text_column'])

Feature Selection
Feature selection involves identifying and selecting the most relevant features to improve model
performance and reduce overfitting.
Common Techniques for Feature Selection
1. Filter Methods:
o Use statistical measures to select features.
o Examples: Correlation, Chi-Square, Mutual Information.
# Correlation-based feature selection
correlation_matrix = df.corr()
relevant_features = correlation_matrix['target'].abs().sort_values(ascending=False)
Wrapper Methods:
 Use a machine learning model to evaluate feature subsets.
 Examples: Recursive Feature Elimination (RFE).
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_]
Embedded Methods:
 Features are selected as part of the model training process.
 Examples: Lasso Regression, Decision Trees.
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]
Dimensionality Reduction:
 Reduce the number of features while preserving information.
 Examples: PCA, t-SNE.
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
Feature Engineering and Selection Workflow
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split

# Load data
df = pd.read_csv('data.csv')

# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)

# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])

# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Feature Selection
X = df.drop('target', axis=1)
y = df['target']

# Select top 10 features using ANOVA F-test


selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features)
Feature Extraction and Engineering
Feature extraction and feature engineering are essential steps in preparing data for machine learning models.
While they are closely related, they serve slightly different purposes:
 Feature Engineering: Creating new features or transforming existing ones to better represent the
underlying problem.
 Feature Extraction: Reducing the dimensionality of data by extracting the most important
information from raw data.
Both processes aim to improve model performance, reduce overfitting, and make the data more interpretable.
Feature Extraction
Feature extraction is often used when dealing with high-dimensional data (e.g., images, text, or signals) to
reduce the number of features while retaining the most important information.
Common Techniques for Feature Extraction
Principal Component Analysis (PCA):
o Reduces dimensionality by projecting data onto orthogonal axes (principal components).
Linear Discriminant Analysis (LDA):
 Reduces dimensionality while preserving class separability (useful for supervised learning).

t-SNE (t-Distributed Stochastic Neighbor Embedding):


 Reduces dimensionality for visualization (not suitable for feature extraction in models).
Autoencoders:
 Neural networks used for unsupervised dimensionality reduction.
Text Feature Extraction:
 Convert text into numerical features using techniques like Bag of Words, TF-IDF, or word
embeddings.
Image Feature Extraction:
 Use pre-trained models (e.g., VGG, ResNet) or techniques like Histogram of Oriented Gradients
(HOG) to extract features from images

Feature Engineering
Feature engineering involves creating new features or transforming existing ones to improve model
performance.
Common Techniques for Feature Engineering
Handling Missing Values:
o Fill missing values with mean, median, mode, or use advanced techniques like KNN
imputation.
Encoding Categorical Variables:
 One-Hot Encoding: Convert categorical variables into binary columns.
 Label Encoding: Convert categories into numerical labels.
Scaling and Normalization:
 Standardization: Scale features to have zero mean and unit variance.
 Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1).
Creating Interaction Features:
 Combine two or more features to create new ones.
Binning:
 Convert continuous variables into discrete bins.
Date/Time Feature Extraction:
 Extract useful information from date/time columns (e.g., day, month, year).
Polynomial Features:
 Create polynomial combinations of features.
Example: Feature Extraction and Engineering Workflow
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load data
df = pd.read_csv('data.csv')

# Feature Engineering
# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)

# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])

# Scaling
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column']])

# Feature Extraction
X = df.drop('target', axis=1)
y = df['target']

# Apply PCA for dimensionality reduction


pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)

# Display explained variance


print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Feature Engineering on Numeric Data, Categorical Data, Text Data, & Image Data
Feature engineering is the process of transforming raw data into meaningful features that improve the
performance of machine learning models. The techniques used depend on the type of data (numeric,
categorical, text, or image). Below is a detailed guide on feature engineering for each type of data:

Feature Engineering for Numeric Data


Numeric data is the most common type of data in machine learning. Feature engineering for numeric data
involves scaling, transforming, and creating new features.
Common Techniques
Scaling and Normalization:
o Standardization: Scale features to have zero mean and unit variance.
Log Transformation:
 Reduce skewness in data.
Binning:
 Convert continuous variables into discrete bins.
Polynomial Features:
 Create polynomial combinations of features.
Interaction Features:
 Combine two or more numeric features.

Feature Engineering for Categorical Data


Categorical data represents discrete values (e.g., gender, color). Feature engineering for categorical data
involves encoding and creating new features.
Common Techniques
One-Hot Encoding:
o Convert categorical variables into binary columns.
Label Encoding:
 Convert categories into numerical labels.
Target Encoding:
 Encode categories based on the target variable (mean of the target for each category).
Frequency Encoding:
 Encode categories based on their frequency in the dataset.
Creating Interaction Features:
 Combine categorical and numeric features.

Feature Engineering for Text Data


Text data requires converting unstructured text into structured numerical features.
Common Techniques
Bag of Words (BoW):
o Represent text as a vector of word frequencies.
TF-IDF (Term Frequency-Inverse Document Frequency):
 Weigh words based on their importance in the document and corpus.
Word Embeddings:
 Use pre-trained models like Word2Vec, GloVe, or FastText to represent words as dense vectors.
Text Cleaning:
 Remove stopwords, punctuation, and perform stemming/lemmatization.
N-Grams:
 Capture sequences of words (e.g., bigrams, trigrams).
Feature Engineering for Image Data
Image data requires extracting meaningful features from pixel values.
Common Techniques
Resizing and Normalization:
o Resize images to a fixed size and normalize pixel values.
Feature Extraction Using Pre-trained Models:
 Use models like VGG, ResNet, or Inception to extract features.
Edge Detection:
 Use techniques like Canny edge detection to highlight edges.
Histogram of Oriented Gradients (HOG):
 Extract features based on gradient orientations.
Color Histograms:
 Extract color distribution features.

Feature Scaling and Feature Selection are two important preprocessing techniques in machine learning and
data analysis. They play a crucial role in improving model performance, reducing computational complexity,
and ensuring better interpretability of the data.

Feature Scaling
Feature scaling is the process of normalizing or standardizing the range of independent variables (features)
in the dataset. This is particularly important for algorithms that are sensitive to the magnitude of the data,
such as distance-based algorithms or gradient descent-based optimization.
Why is Feature Scaling Important?
 Ensures that all features contribute equally to the model.
 Improves convergence speed for optimization algorithms (e.g., gradient descent).
 Prevents features with larger magnitudes from dominating those with smaller magnitudes.
Common Techniques for Feature Scaling
1. Normalization (Min-Max Scaling):
o Scales features to a fixed range, usually [0, 1].
o Formula: Xscaled=X−XminXmax−XminXscaled=Xmax−XminX−Xmin
o Suitable for algorithms like neural networks and k-nearest neighbors (KNN).
2. Standardization (Z-score Normalization):
o Scales features to have a mean of 0 and a standard deviation of 1.
o Formula: Xscaled=X−μσXscaled=σX−μ, where μμ is the mean and σσ is the standard
deviation.
o Suitable for algorithms like linear regression, logistic regression, and support vector machines
(SVM).
3. Robust Scaling:
o Uses the median and interquartile range (IQR) to scale features, making it robust to outliers.
o Formula: Xscaled=X−medianIQRXscaled=IQRX−median
4. Max Abs Scaling:
o Scales each feature by its maximum absolute value.
o Suitable for sparse data.
Feature Selection
Feature selection is the process of selecting a subset of relevant features (variables) to use in model
construction. It helps reduce overfitting, improve model interpretability, and decrease computational costs.
Why is Feature Selection Important?
 Reduces the dimensionality of the dataset, which can improve model performance.
 Removes irrelevant or redundant features, reducing noise.
 Speeds up training and inference times.
Common Techniques for Feature Selection
1. Filter Methods:
o Select features based on statistical measures (e.g., correlation, mutual information, chi-
square).
oExamples:
 Correlation coefficient for linear relationships.
 Mutual information for non-linear relationships.
 Chi-square test for categorical features.
2. Wrapper Methods:
o Use a machine learning model to evaluate the performance of subsets of features.
o Examples:
 Forward Selection: Start with no features and add one at a time.
 Backward Elimination: Start with all features and remove one at a time.
 Recursive Feature Elimination (RFE): Recursively removes the least important
features.
3. Embedded Methods:
o Perform feature selection as part of the model training process.
o Examples:
 Lasso (L1 regularization): Penalizes less important features by shrinking their
coefficients to zero.
 Ridge (L2 regularization): Reduces the impact of less important features but does not
eliminate them.
 Tree-based methods: Feature importance scores from decision trees, random forests,
or gradient boosting.
4. Dimensionality Reduction:
o Transform features into a lower-dimensional space.
o Examples:
 Principal Component Analysis (PCA): Reduces dimensions while preserving variance.
 Linear Discriminant Analysis (LDA): Reduces dimensions while preserving class
separability.
 t-SNE and UMAP: Non-linear dimensionality reduction for visualization.

Key Differences Between Feature Scaling and Feature Selection


Aspect Feature Scaling Feature Selection
Purpose Normalize/standardize feature values Select the most relevant features
Impact Improves algorithm performance and speed Reduces dimensionality and
overfitting
Techniques Normalization, standardization, robust scaling Filter, wrapper, embedded methods,
PCA
When to Required for distance-based or gradient-based Useful for high-dimensional datasets
Use algorithms
UNIT III

Unit-3
Building Machine Learning Models:
In today’s rapidly evolving business landscape, data has become one of the most valuable
assets. Machine learning (ML), a subset of artificial intelligence (AI), is revolutionizing how
businesses derive insights, optimize processes, and enhance decision-making. Understanding
how to build and apply machine learning models is not only a technical skill but also a
strategic advantage.
Machine learning is the process of developing algorithms that enable computers to learn from
data and improve their performance over time without being explicitly programmed. Unlike
traditional programming, where rules are hard-coded, ML models identify patterns and
relationships in data, enabling them to make predictions or decisions.
In a business context, machine learning applications range from predictive analytics and
customer segmentation to supply chain optimization and fraud detection. For Management
students, understanding ML is vital for leveraging data-driven strategies to create value.
Steps to Build Machine Learning Models
Building an ML model involves a structured process. Here’s an overview of the key steps:

1. Define the Problem


The first step in any machine learning project is identifying the business problem. Clearly
define what you want to achieve—whether it's predicting customer churn, recommending
products, or optimizing pricing strategies. This step is crucial for aligning ML efforts with
business objectives.
Example: A retail business might aim to predict which customers are likely to leave
their loyalty program in the next six months.

2. Collect and Prepare Data


Data is the foundation of machine learning. Begin by gathering relevant data from various
sources such as databases, CRM systems, or third-party providers. The data must be cleaned
and pre-processed to handle missing values, remove outliers, and normalize formats.
Example: For customer churn prediction, you might collect data on purchase history,
demographics, and engagement metrics.

3. Select the Right Algorithm


Choosing the right machine learning algorithm depends on the problem type:
 Supervised Learning: Used for labelled data (e.g., predicting sales revenue).
 Unsupervised Learning: Used for unlabelled data (e.g., customer segmentation).
 Reinforcement Learning: Used for decision-making in dynamic environments (e.g.,
inventory management).
Example: Logistic regression or random forests can be used for binary classification
problems like churn prediction.

4. Train the Model


Split the dataset into training and testing sets. The training set is used to teach the model how
to identify patterns, while the testing set evaluates its performance. Fine-tuning parameters
and hyperparameters ensures the model performs optimally.
Example: Train a random forest model on customer data to predict churn probability.

5. Evaluate the Model


Key metrics such as accuracy, precision, recall, F1-score, and area under the curve (AUC) are
used to assess a model’s performance. Cross-validation is often employed to ensure the
model generalizes well to unseen data.
Example: A churn prediction model with high precision ensures most flagged customers are
likely to churn, reducing unnecessary retention costs.

6. Deploy and Monitor


Once the model is validated, it can be deployed into production systems. Continuous
monitoring ensures it remains accurate as data changes over time. Periodic retraining may be
required to adapt to new patterns.
Example: Deploy the churn model in a CRM system to trigger retention campaigns for at-risk
customers.

Applications of Machine Learning in Business:


1. Marketing and Sales
 Customer segmentation: Tailoring campaigns for different customer groups.
 Predictive analytics: Forecasting sales trends and customer behaviour.
 Recommendation systems: Suggesting products based on customer preferences.

2. Operations and Supply Chain


 Demand forecasting: Predicting inventory needs to optimize stock levels.
 Predictive maintenance: Identifying potential equipment failures before they occur.
 Route optimization: Enhancing logistics efficiency.

3. Finance
 Credit scoring: Evaluating creditworthiness of borrowers.
 Fraud detection: Identifying unusual patterns in transactions.
 Portfolio management: Automating investment strategies.
4. Human Resources
 Talent acquisition: Screening candidates using resume parsing and scoring algorithms.
 Employee retention: Predicting attrition risks and devising retention strategies.
 Performance evaluation: Analysing employee productivity metrics.

Challenges in Building ML Models


Despite its potential, machine learning is not without challenges:
1. Data Quality: Poor-quality data can lead to inaccurate models.
2. Interpretability: Complex models like deep learning can be difficult to explain to
stakeholders.
3. Ethical Concerns: Bias in data can lead to unfair outcomes.
4. Integration: Aligning ML models with existing business processes requires
coordination across teams.

Evaluating Machine Learning Models:


Machine learning (ML) has become a pivotal tool in modern business decision-making. It
empowers organizations to predict trends, automate processes, and gain competitive
advantages. For Management students aspiring to work in data-driven organizations,
understanding how to evaluate machine learning models is critical.
Importance of Evaluating Machine Learning Models
Evaluating machine learning models ensures that they are accurate, reliable, and suitable for
specific business objectives. Poorly evaluated models can lead to flawed predictions,
inefficient resource allocation, and misguided strategic decisions. In a business environment,
evaluating models is not only about technical performance but also about alignment with
organizational goals and ethical considerations.

Key Aspects of Model Evaluation


1. Understanding the Problem
Before diving into evaluation metrics, it is crucial to define the business problem clearly. For
example, a model predicting customer churn will have different evaluation criteria than one
used for fraud detection. Clarity on the problem ensures that the evaluation focuses on
relevant metrics and outcomes.

2. Choosing Appropriate Metrics


Metrics are essential for quantifying model performance. The choice of metrics depends on
the type of problem—classification, regression, clustering, etc. Below are some commonly
used metrics:

 Classification Metrics:
o Accuracy: Measures the percentage of correctly predicted instances but may
be misleading for imbalanced datasets.
o Precision and Recall: Precision focuses on the correctness of positive
predictions, while recall measures the model’s ability to capture all actual
positives. These are often combined into the F1-score for a balanced
assessment.
o ROC-AUC: Evaluates the trade-off between true positive and false positive
rates across different thresholds.

 Regression Metrics:
o Mean Absolute Error (MAE): Represents the average absolute difference
between predicted and actual values.
o Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Highlight
larger errors more significantly, providing insights into model performance.
o R-squared: Indicates the proportion of variance in the target variable explained
by the model.
 Clustering Metrics:
o Silhouette Score: Measures the quality of clustering based on intra-cluster and
inter-cluster distances.
o Dunn Index: Evaluates cluster compactness and separation.

3. Cross-Validation
Cross-validation ensures that the model’s performance is consistent across different subsets
of the data. Techniques like k-fold cross-validation divide the dataset into k parts, training the
model on k-1 parts and testing it on the remaining part. This approach minimizes overfitting
and provides a more robust performance estimate.

4. Bias-Variance Trade-off
Understanding the bias-variance trade-off is vital for model evaluation. High bias indicates
underfitting, where the model is too simple to capture the complexity of the data. High
variance suggests overfitting, where the model performs well on training data but poorly on
unseen data. Balancing these factors is key to building reliable models.
5. Explainability and Interpretability
In a business setting, it is crucial to understand how a model makes decisions. Techniques
like feature importance, SHAP (Shapley Additive Explanations), and LIME (Local
Interpretable Model-agnostic Explanations) help interpret model outputs. Explainability
builds trust with stakeholders and ensures compliance with ethical and regulatory standards.

6. Ethical Considerations
Management students must recognize the ethical implications of deploying machine learning
models. Models can perpetuate biases present in training data, leading to unfair or
discriminatory outcomes. Regular audits and fairness metrics, such as demographic parity or
equalized odds, help ensure ethical deployment.
Business Context in Model Evaluation
Evaluating machine learning models in a business context goes beyond technical metrics.
Management professionals should consider the following factors:
 Alignment with Business Goals: The chosen model should directly support
organizational objectives, such as improving customer retention, optimizing supply
chains, or increasing profitability.
 Cost-Benefit Analysis: Evaluate the financial implications of deploying the model.
For instance, in fraud detection, the cost of false negatives (missed fraud cases) may
outweigh the cost of false positives.
 Scalability and Deployment: Ensure that the model can handle increasing data
volumes and integrate seamlessly with existing systems.
 Stakeholder Buy-In: Clear communication of the model’s benefits and limitations to
non-technical stakeholders is essential for successful implementation.

Strategies for Continuous Improvement


Model evaluation is not a one-time process. Continuous monitoring and improvement are
necessary to maintain performance in dynamic business environments. Key strategies
include:
 Monitoring Model Drift: Regularly check for changes in data distribution that could
degrade model performance.
 A/B Testing: Compare different models or versions of a model to determine which
performs better in real-world scenarios.
 Feedback Loops: Incorporate user feedback to refine the model over time.
Evaluating machine learning models is a multifaceted process that combines technical rigor
with business acumen. For Decision Makers, understanding evaluation metrics, aligning
models with business goals, and considering ethical implications are critical skills. By
mastering these principles, future business leaders can harness the power of machine learning
to drive informed decision-making and sustainable growth.

Understanding Model Tuning:


Model tuning refers to the process of optimizing a machine learning model to improve its
performance on unseen data. This involves adjusting hyperparameters—the settings of the
model that are not learned during training but significantly affect the outcome. Proper tuning
ensures that the model generalizes well, avoiding both underfitting (poor learning) and
overfitting (memorizing instead of learning).
Why Model Tuning Matters for Businesses
In business contexts, the accuracy and reliability of ML models directly impact decision-
making. For example:
 Customer Segmentation: A poorly tuned clustering model might group customers
incorrectly, leading to ineffective marketing campaigns.
 Demand Forecasting: Inaccurate predictions can result in inventory issues, lost sales,
or wasted resources.
 Fraud Detection: An underperforming classification model might fail to flag
fraudulent transactions, exposing businesses to financial risk.
By tuning models effectively, businesses can mitigate these risks and enhance operational
efficiency

Key Techniques for Model Tuning


Management Professionals do not need to master the technical implementation of these
techniques but should understand their strategic significance:

1. Grid Search:
o A systematic approach where all possible combinations of hyperparameters
are tested.
o Example: For a decision tree, hyperparameters like maximum depth and
minimum samples per leaf can be varied systematically.

2. Random Search:
o Instead of testing all combinations, a random subset of hyperparameter values
is selected.
o This method is often faster and can yield comparable results to grid search.

3. Bayesian Optimization:
o Uses probabilistic models to predict the best hyperparameters.

o Efficient for high-dimensional problems where traditional methods might be


computationally expensive.
4. Cross-Validation:
o A technique to evaluate model performance by splitting the dataset into
training and validation subsets multiple times.
o Ensures that the model’s performance is not dependent on a single data split.

5. Automated Tuning Tools:


o Tools like AutoML (e.g., Google AutoML, H2O.ai) automatically test and
tune models, making it easier for non-technical users to achieve optimal
performance.

Business Implications of Model Tuning:


1. Resource Allocation:
o Model tuning can be resource intensive. Businesses must balance the
computational cost of tuning with the expected benefits in model performance.
2. Interpretability vs. Performance:
o Highly tuned models like deep learning networks may lack interpretability.
Management professionals should consider whether stakeholders prioritize
understanding the model over its accuracy.

3. Ethical Considerations:
o Bias in data can be amplified during tuning. Business leaders must ensure that
tuning processes do not inadvertently propagate discrimination.

4. Scalability:
o Models tuned on small datasets may not perform well when scaled. Businesses
should simulate real-world scenarios to validate tuning outcomes.

Real-World Examples:
 Netflix: Netflix’s recommendation engine uses tuned ML models to suggest content,
directly driving user engagement and retention.
 Amazon: Tuning ML models for demand forecasting allows Amazon to optimize its
supply chain, reducing costs and improving delivery times.
 Healthcare: Hospitals use tuned ML models for patient outcome predictions,
enabling better resource planning and treatment strategies.

The Role of Management Professionals in Model Tuning


While Management students may not directly tune models, they play a critical role in:
 Communicating business goals to data science teams.
 Evaluating model performance against business KPIs.
 Making strategic decisions based on model outputs.

Interpreting Machine Learning Models:


As Management students, understanding how to interpret machine learning (ML) models can
help one make informed business decisions, optimize strategies, and gain insights into
complex data patterns. This note aims to provide a clear, high-level overview of ML model
interpretation and its relevance in a business context.
Interpreting machine learning models refers to understanding how a model makes its
predictions or decisions. It involves examining the relationship between input features (e.g.,
customer demographics, sales data) and the output (e.g., customer churn, sales forecast). This
understanding ensures that models align with business goals, regulatory requirements, and
ethical standards.
Key objectives of model interpretation:
 Explainability: Making the model's behaviour understandable to stakeholders.
 Transparency: Ensuring the decision-making process is clear and traceable.
 Trustworthiness: Building confidence in the model's predictions.

Why Is Model Interpretation Important for Business?


1. Enhanced Decision-Making: Businesses rely on ML models to make decisions such
as pricing, customer segmentation, and inventory management. Understanding these
models ensures decisions are logical and data driven.
2. Compliance with Regulations: Many industries (e.g., finance and healthcare) require
transparency in decision-making to comply with regulations like GDPR or CCPA.
3. Ethical Considerations: Uninterpretable models may lead to biases or unintended
discrimination. For instance, a loan approval model must ensure fairness across
demographics.
4. Stakeholder Communication: Non-technical stakeholders need simplified
explanations of complex models to make strategic decisions.

Techniques for Interpreting ML Models


ML models range from simple (e.g., linear regression) to complex (e.g., deep learning).
Interpretation methods depend on the model type.

A. Interpreting Simple Models


 Linear Regression: Coefficients indicate the impact of each feature on the target
variable. For example, a coefficient of 0.5 for "advertising spend" suggests a 0.5-unit
increase in sales for every unit increase in spend.
 Decision Trees: Visualizing decision paths helps understand how decisions are made
based on feature splits.

B. Interpreting Complex Models


Complex models (e.g., Random Forest, XGBoost, Neural Networks) require advanced
interpretation techniques:

1. Feature Importance:
o Identifies which features influence the model's predictions the most.

o Example: In a customer churn model, features like "subscription duration" or


"monthly spending" might rank high.

2. Partial Dependence Plots (PDPs):


o Show how a feature impacts the target variable while keeping others constant.

o Example: analysing how a price increase affects sales.

3. SHAP (SHapley Additive ExPlanations):


o A popular method to explain individual predictions by attributing the
contribution of each feature.
o Example: For a rejected loan application, SHAP might highlight "low income"
as the primary reason.

4. LIME (Local Interpretable Model-agnostic Explanations):


o Focuses on explaining individual predictions by creating a simple model
around the specific instance.

Business Use Cases:


1. Marketing Campaigns:
o Understanding which factors (e.g., age, region, previous purchases) influence
customer response rates to design targeted campaigns.
2. Risk Management:
o In finance, models can predict credit risk, but interpretation ensures that
decisions are fair and aligned with regulatory standards.

3. Operations and Supply Chain:


o ML models forecasting demand can guide inventory planning, but
interpretation helps identify key drivers like seasonal trends or external
factors.
4. Customer Retention:
o Identifying factors contributing to customer churn helps businesses take
preventive actions.

Challenges in Model Interpretation


1. Trade-Off Between Accuracy and Interpretability:
o Simple models are easier to interpret but may lack predictive power compared
to complex models like neural networks.
2. Black-Box Models:
o Complex models often operate as "black boxes," making it harder to explain
their predictions without advanced tools.

3. Overfitting:
o Misinterpreted models may show patterns that are only relevant to training
data and not real-world scenarios.

4. Bias and Fairness:


o Poorly interpreted models may reinforce existing biases, leading to unethical
or unfair decisions.
Best Practices for Management Students:
1. Learn Core Concepts:
o Understand foundational ML concepts like regression, classification, and
clustering.

2. Focus on Business Context:


o Always relate model outputs to business goals and challenges.

3. Leverage Visualization Tools:


o Tools like Tableau, Power BI, and Python libraries (e.g., Matplotlib, Seaborn)
help make ML results more interpretable.

4. Collaborate with Data Teams:


o Work closely with data scientists to bridge the gap between technical insights
and business strategies.

5. Stay Updated on Tools:


o Familiarize yourself with interpretation tools like SHAP, LIME, and industry-
specific platforms.
Model Deployment:
Model deployment refers to the process of integrating a machine learning model into a
production environment where it can interact with other applications, make predictions, and
generate value. It’s the step where theoretical models move beyond experimentation and
begin solving real-world problems.

Importance of Deployment for Business:


 Operational Efficiency: Automates repetitive and complex decision-making tasks.
 Scalability: Provides real-time insights, allowing businesses to make timely
decisions.
 Personalization: Powers customer-centric solutions like recommendation systems,
dynamic pricing, and targeted marketing.
 Competitive Advantage: Enhances business innovation through predictive analytics
and AI-driven processes.

Steps in ML Model Deployment


A. Pre-deployment Steps
1. Understanding the Problem: Clearly define the business challenge the model will
address.
2. Data Preparation: Ensure the data used for training reflects real-world conditions.
3. Model Development: Build and train the ML model using frameworks like
TensorFlow, PyTorch, or Scikit-learn.
4. Validation: Test the model with unseen data to evaluate its performance.

B. Deployment Process
1. Selecting a Deployment Strategy:
o Batch Processing: Predictions are made on a schedule (e.g., generating daily
sales forecasts).
o Real-Time Processing: Predictions occur instantly (e.g., fraud detection in
online payments).

2. Choosing an Infrastructure:
o Cloud Platforms: AWS, Google Cloud Platform (GCP), and Microsoft Azure
provide scalable deployment environments.
o On-Premises: Useful for businesses with specific security or compliance
needs.
3. Packaging the Model: Convert the model into a deployable format using tools like
Docker for containerization.
4. API Development: Expose the model as a REST API or gRPC service so other
applications can interact with it.
5. Monitoring and Maintenance: Track model performance in production and retrain it
when necessary to prevent degradation.

Key Deployment Tools and Frameworks


 Model Management: MLflow, TensorFlow Serving
 Orchestration: Kubernetes, Docker
 Monitoring: Prometheus, Grafana
 Version Control: GitHub, DVC

Challenges in ML Model Deployment


 Scalability: Ensuring the model performs efficiently with increasing data and
requests.
 Data Drift: Adapting to changes in data patterns over time.
 Latency: Balancing prediction speed with accuracy.
 Regulation and Compliance: Ensuring the model adheres to data privacy laws like
GDPR.

Examples of Successful ML Deployments


1. Netflix: Uses ML to recommend personalized content, increasing user retention.
2. Amazon: Employs ML for dynamic pricing, supply chain optimization, and product
recommendations.
3. Uber: Leverages real-time ML models for route optimization and pricing strategies.

Role of Management Students:


 Cross-functional Collaboration: Deployment involves IT, data science, and business
teams working together.
 ROI-Focused Approach: Always evaluate the business value the deployed model
delivers.
 Continuous Improvement: Deployment is not the end; monitor and optimize the
model regularly.
 Ethical Considerations: Ensure the model doesn’t perpetuate bias or violate ethical
norms.

Exploratory Data Analysis (EDA) in Business Decision-Making:


Exploratory Data Analysis (EDA) is a critical step in data analysis that involves
understanding and summarizing datasets to uncover patterns, trends, and insights that inform
decision-making. In the context of business decisions, EDA acts as a bridge between raw data
and actionable insights, helping stakeholders understand the data at a granular level before
making key decisions.
Objectives of EDA in Business
1. Understand the Dataset:
o Identify the structure, dimensions, and types of variables in the dataset.

o Assess data quality by checking for missing values, outliers, or


inconsistencies.
2. Gain Insights into Patterns:
o Detect trends, relationships, and correlations between variables.

o Uncover hidden patterns that may reveal customer behaviour, market trends,
or operational inefficiencies.
3. Assess Assumptions and Hypotheses:
o Validate existing business hypotheses with preliminary insights.

o Challenge biases by letting the data tell its own story.

4. Prepare Data for Advanced Analysis:


o Transform data for predictive modelling or machine learning by normalizing
or encoding it.
o Identify and remove redundant or irrelevant variables to focus on meaningful
factors.
Steps in EDA for Business Applications
1. Data Collection and Cleaning:
o Collect data from internal systems, surveys, third-party vendors, or APIs.

o Clean the data by handling missing values, resolving inconsistencies, and


normalizing formats.

2. Descriptive Statistics:
o Calculate measures like mean, median, mode, variance, and standard deviation
to understand central tendencies and variability.
o Use frequency distributions to assess categorical data.

3. Visualization:
o Use histograms, bar charts, and boxplots to visualize distributions and detect
outliers.
o Create scatterplots and heatmaps to identify correlations or relationships
between variables.
o Employ time series plots for trend analysis in metrics like revenue or customer
acquisition.

4. Segment Analysis:
o Divide the data into meaningful segments, such as customer demographics or
product categories, to identify key differentiators.
o Analyse performance across these segments to find growth opportunities.
5. Correlation Analysis:
o Use correlation matrices to evaluate how variables are related, identifying key
drivers of success (e.g., customer satisfaction linked to sales).

6. Hypothesis Testing:
o Perform statistical tests like t-tests, chi-square tests, or ANOVA to validate
relationships or patterns.

Role of EDA in Business Decision-Making


1. Identifying Opportunities:
o Detect underserved customer segments or untapped markets.

o Identify high-performing products or services to scale.

2. Mitigating Risks:
o Spot potential issues, such as customer churn, through anomaly detection.

o Understand factors influencing declining revenue or rising operational costs.

3. Strategic Planning:
o Use insights to set measurable goals and benchmarks.

o Design targeted marketing campaigns based on data-driven customer


segmentation.

4. Improving Efficiency:
o Streamline operations by identifying bottlenecks or inefficiencies.

o Optimize pricing strategies through demand analysis.

5. Driving Innovation:
o Identify patterns that suggest emerging trends, enabling businesses to innovate
ahead of competitors.

Challenges in EDA for Business


1. Data Quality: Poor data quality, such as incomplete or biased data, can mislead
decision-making.
2. Complexity of Data: Handling large, complex datasets can be challenging without
proper tools or expertise.
3. Misinterpretation: Correlation does not imply causation; misinterpreting
relationships can lead to faulty strategies.
4. Time Constraints: Businesses often face pressure to make decisions quickly, limiting
the depth of analysis.
Tools and Techniques for EDA
 Software: Python, R, Excel, Tableau, Power BI, SAS.
 Libraries: Pandas, NumPy, Matplotlib, Seaborn, Plotly, Dplyr, ggplot2.
 Techniques: Data profiling, clustering, dimensionality reduction (e.g., PCA), and
feature selection.

Exploration of Data Using Visualization:


Data has become the cornerstone of modern business strategy. Organizations, irrespective of
their size and industry, rely on data-driven insights to gain a competitive edge. However, raw
data is often too complex or voluminous to analyse effectively. This is where data
visualization becomes a critical tool, transforming abstract data into visual formats like
charts, graphs, and dashboards. These visuals enable stakeholders to explore data
meaningfully, identify trends, uncover insights, and make informed decisions.
The Importance of Data Visualization in Business
Data visualization serves as a bridge between complex datasets and human understanding. In
today's fast-paced environment, businesses must process and analyse data rapidly to seize
opportunities and mitigate risks. Visualization simplifies this process by:
1. Condensing Complex Information: Raw data, particularly big data, is often
incomprehensible in its native form. Visualizations like scatter plots or heat maps
reduce complexity, presenting critical insights briefly.
2. Enhancing Pattern Recognition: Humans are naturally inclined to recognize
patterns and outliers in visual representations. For instance, a sales trend line can
reveal seasonal peaks, while a bar graph might highlight underperforming product
categories.
3. Facilitating Communication: Visuals are universal. They allow stakeholders from
diverse backgrounds to understand and interpret data insights. For instance, a CEO, a
data scientist, and a marketing manager can all comprehend the same visual
dashboard tailored to their needs.

Applications of Data Visualization in Business Decisions


Visualization tools play a pivotal role in various business functions, such as:
1. Strategic Planning: Heat maps and cluster analyses enable businesses to identify
profitable markets, untapped customer segments, or competitive threats. For example,
a company can use geo-mapped visuals to assess regional sales performance.
2. Operational Efficiency: Data dashboards can monitor real-time operations,
highlighting bottlenecks in production or supply chain disruptions. A manufacturing
firm might use this to optimize resource allocation.
3. Marketing Analytics: Marketing teams rely on visualizations to track campaign
performance, customer demographics, and behavioural trends. Funnel charts, for
instance, can visualize the customer journey from awareness to conversion.
4. Financial Decision-Making: Visual tools like pie charts and trend lines help in
budgeting, forecasting, and risk assessment. CFOs can use such tools to identify
overspending areas or predict future revenue streams.
5. Human Resource Management: HR departments use data visualization to analyse
workforce metrics such as employee turnover, performance, and engagement levels.

Benefits of Data Visualization in Decision-Making


1. Improved Decision Accuracy: By providing a clearer understanding of data,
visualization reduces the chances of errors or misinterpretations, leading to better-
informed decisions.
2. Faster Insights: Visual tools make it easier to identify actionable insights quickly,
which is essential in dynamic industries where time is of the essence.
3. Enhanced Collaboration: Data visualization promotes teamwork, enabling cross-
functional collaboration and ensuring alignment on business strategies.
4. Engagement and Understanding: Visualization captures attention and enhances
comprehension, making it easier for decision-makers to engage with data.

Challenges and Limitations


Despite its advantages, data visualization is not without challenges:
1. Misleading Visuals: Poorly designed visualizations—such as distorted scales or
selective data representation—can lead to incorrect conclusions.
2. Over-Reliance on Tools: While visualization tools are powerful, they are only as
effective as the data and analysis behind them. Businesses must ensure data quality
and integrity.
3. Interpretation Bias: Stakeholders may interpret the same visualization differently,
especially if there is no accompanying context or explanation.

Best Practices for Data Visualization in Business


To maximize the effectiveness of data visualization, businesses should adhere to the
following best practices:
1. Understand the Audience: Tailor visualizations to the needs of the audience. A
technical team might prefer detailed scatter plots, while executives might benefit from
high-level dashboards.
2. Simplify the Design: Avoid cluttered visuals. Use minimalistic designs that focus on
key data points.
3. Provide Context: Supplement visuals with brief explanations to ensure clarity and
reduce interpretation errors.
4. Leverage Interactive Dashboards: Interactive tools, such as Tableau or Power BI,
allow users to drill down into data and explore insights at various levels.
5. Iterate and Improve: Continuously update visualizations based on feedback and
evolving business needs.

Key Methods of Data Exploration Using Visualization


1. Univariate Analysis
Univariate analysis focuses on exploring a single variable to understand its
distribution and properties. Visualization methods for univariate analysis include:
o Histograms: Display the frequency distribution of a variable, revealing
skewness, modality, and outliers.
o Box Plots: Highlight the range, interquartile range, and potential outliers,
enabling a quick assessment of data variability.
o Bar Charts: Useful for categorical data, showing the frequency or proportion
of each category.
Example: A retailer analysing sales by product category can use bar charts to identify which
products contribute most to revenue.

2. Bivariate Analysis
Bivariate analysis examines the relationship between two variables, uncovering
correlations and dependencies. Common visualization methods include:
o Scatter Plots: Show relationships between two continuous variables, helping
to identify linear or non-linear correlations.
o Line Graphs: Ideal for exploring trends over time, such as monthly sales
figures.
o Heatmaps: Represent correlations or interactions between variables, often
used in market segmentation studies.
Example: A finance team can use scatter plots to explore the correlation between marketing
spend and revenue growth.

3. Multivariate Analysis
Multivariate analysis involves exploring relationships among three or more variables
simultaneously. Visualization methods include:
o Bubble Charts: Add a third dimension to scatter plots using bubble size to
represent another variable.
o Parallel Coordinate Plots: Visualize multi-dimensional data by plotting
variables as parallel axes, highlighting patterns across different dimensions.
o Clustered Heatmaps: Combine clustering algorithms with heatmaps to
segment data into meaningful groups.
Example: A telecom company analysing customer churn can use clustered heatmaps to group
customers based on demographics and usage patterns.

4. Geospatial Visualization
Geospatial visualization is critical for businesses that deal with location-based data.
Common methods include:
o Choropleth Maps: Display data intensity across geographic regions, such as
sales density by state.
o Point Maps: Plot individual data points on a map, useful for tracking store
locations or delivery routes.
o Density Maps: Highlight areas with high concentrations of activity, such as
customer demand in urban regions.
Example: A logistics company can use density maps to optimize delivery routes in high-
demand areas.
5. Time-Series Analysis
Time-series analysis focuses on exploring temporal data to identify trends,
seasonality, and anomalies. Key visualization techniques include:
o Line Charts: Depict trends over time, ideal for tracking performance metrics
like revenue or customer growth.
o Seasonal Decomposition Plots: Break down time-series data into trend,
seasonal, and residual components.
o Area Charts: Emphasize cumulative trends over time, useful for visualizing
market share growth.
Example: An e-commerce platform can analyse seasonal variations in sales using line charts
to plan inventory and marketing strategies.

6. Interactive Dashboards
Interactive dashboards integrate multiple visualization types into a single interface,
allowing users to explore data dynamically. Features such as drill-downs, filters,
and real-time updates enable deeper analysis.
o Tools like Tableau, Power BI, and Google Data Studio are widely used for
creating interactive dashboards.
o Dashboards allow users to customize views based on specific business
questions, fostering collaborative decision-making.
Example: A sales manager can use an interactive dashboard to monitor regional performance
and adjust strategies accordingly.
1. Learning Outcome:

After completing this module the students will be able to:

 Determine whether a regression experiment would be useful in a given instance


 Develop a simple linear regression model
 Understand the assumption of simple linear regression model
 Fit regression equations
 Compute a prediction interval for the dependent variable
 Check significance of regression coefficients using t-test
 Test the goodness of fit our model

2. Introduction to Simple Linear Regression


After having established the fact that two variables are strongly correlated with each other,
one may be interested in predicting the value of one variable with the help of the given value
of another variable. For example, if we know that yield of wheat and amount of rainfall are
closely related to each other, we can estimate the amount of rainfall to achieve a particular
wheat production level. This estimation becomes possible because of regression analysis that
reveals average relationship between the variables.
The term “Regression” was first used by Sir Francis Galton in 1877 while studying the
relationship between the height of fathers and sons. The dictionary meaning of regression is
the act of returning back to the average. According to Morris Hamburg, regression analysis
refers to the methods by which estimates are made of the values of one a variable from a
knowledge of the values of one or more other variables and to measurement of the errors
involved in this estimation process. Ya Lun Chou elaborates it further adding that regression
analysis basically attempts to establish the nature of relationship between the variables and
thereby provides mechanism for prediction/ estimation.

In regression analysis, we basically attempt to predict the value of one variable from known
values of another variable. The variable that is used to estimate the variable of interest is
known as “independent variable” or “explanatory variable” and the variable which we are
trying to predict is termed as “dependent variable” or “explained variable”. Usually,
dependent variable is denoted by Y and independent variable as X.

It may be noted here that term ‘dependent’ and ‘independent’ refer to the mathematical or
functional meaning of dependence; i.e. they do not imply that there is necessarily any cause
and effect relationship between the variables. It simply means that estimates of values of the
dependent variables Y may be obtained for given values of independent variable X from a
mathematical function involving X and Y. In that sense, values of Y are dependent upon the
values of X. The variable X may or may not cause the variation in the variable Y. For
instance, while estimating demand of a FMCG product from figures on sales promotion
expenditures, demand is generally considered as the dependent variable. However, there may
or may not be causal relationship between these two variables in the sense that changes in
sales promotion cause changes in demand. In fact, in few cases, the cause-effect relationship
may be just opposite what appears to be the obvious one.

This is termed as ‘simple’ as there is only one independent variable and ‘linear’ as it assumes
a linear relationship between dependent and independent variable. This means that average
relationship may be expressed by a straight line equation Y= a + b X which is also called as
“regression line”. In this regression line, Y is dependent variable and X independent variable
and a & b are constants. It may be viewed in the following figure.

a = Intercept of regression line on Y-axis

b = Slope of regression line or regression coefficient of Y on X

regression line

It is easy to understand from the above regression equation that for every incremental
increase in the dependent variable X, the value of dependent variable Y increases by ‘b’;
slope of equation.

For instance, let regression equation be:

Y = 2 + 3X
When X = 0; Y = 2
When X = 1; Y = 5

When X = 2; Y = 8

When X = 3; Y = 11

Thus, is apparent from above that as value of X increases by 1; the value of Y increases by 3
which are equal to the slope of the given equation.

Applications of Regression analysis

Regression analysis is specialized branch of statistical theory and is of immense use almost in
all the scientific disciplines. It is also very useful particularly in economics as it is
predominantly used in estimating the relationship among various economic variables that
comprise the epitome of economic life. Its applications are extended to almost all the natural,
physical and social sciences. It attempts to accomplish mainly the followings:

(1) With the help of regression line that signifies an average relationship between two
variables, it may predict the unknown value of dependent variable for the given values of
independent variable;
(2) Regression line or prediction line that is used for the estimation never makes one hundred
percent correct estimation for the given values of independent variable. There is always
some difference between the actual value and estimated value of dependent variable that
is known as ‘estimation error’. Regression analysis also computes this error as standard
error of estimation and reveals the accuracy of prediction. The amount of error depends
upon the spread of scatter diagramme which is prepared by plotting the actual
observations of Y and X variables on a graph. In the following figures, one can easily
note that amount of estimation error would be more for more spread of actual observation
and lesser for less spread.
More spread of observations Less spread of observations
More error of prediction Less error of prediction

(3) Regression analysis also depicts the relationship or association between the variables (i.e.
r, Pearson’s coefficient of correlation). The square of r is called as Coefficient of
Determination (r2) that assesses the variance in dependent variables that has been
accounted for by the regression equation. In general, the greater the value of r 2 the better
is the fit and the more useful the regression equation as a predictive instrument.

Coefficient of Determination (r2) = Explained Variation/ Total Variation


For example, if value of r2 = 0.70
i.e r2 = 70/ 100

It may be interpreted as out of total variations (100 percent), 70 percent variations in Y are
explained by the X in the suggested regression equation/ model.

3. Developing a Simple Linear Regression Model

A statistical model is a set of mathematical formulas and assumptions related to a real world
situation. We wish to develop our simple regression model in such a way that explains the
process underlying our data as much as possible. Since it is almost impossible for any model
to explain each and everything due to inherent uncertainty in the real world, therefore we will
always have some remaining errors which occur due to many unknown outside factors
affecting the process generating our data.

A good statistical model is prudent that makes the use of few mathematical terms which
explain the real situation as much as possible. The model attempts to capture the rational
behaviour of our data set and leaves out the factors that are nonsystematic and cannot be
predicted/ estimated. The following figure explains a well defined statistical model.

Systematic
component
Data Statistical model
+
Random errors
The errors (ε); also termed as residuals are that part of our data generation that cannot be
estimated by the model because it is not systematic. Hence, errors (ε) constitute a random
component in the model. It is easy to understand that a good statistical model splits our data
process into two components namely; systematic component which is well explained by a
mathematical term contained in the model, and a pure random component, the source of
which is absolutely unknown and therefore, it cannot be estimated by the model. The
effectiveness of a statistical model depends upon the amount of error associated with it. There
is an inverse relationship between the effectiveness of model and the amount of error that
means less the error, more effective the model is and more the error, less effective the model
would be.

4. Assumptions of Simple Linear Regression Model:


Our model to be effective, we need to ensure that the random component of our model i.e.
errors must be minimum. Since this random component occurs due to the factors which are
absolutely undetectable, it becomes very difficult to deal with these errors. Here, we work on
two important assumptions. Firstly, random errors are normally distributed i.e. errors have
zero mean and constant variance.

ε N(0, σ2)
In the figure below, one may observe that all errors are identically distributed, all centred on
regression line. Therefore, the mean of error distribution and variance are always equal to
zero and constant respectively.

Normal distributions of errors,


all centred on the regression
line.

Mean = 0
Variance = constant
Secondly, errors/residuals are independent/ uncorrelated of one another. It can be ensured
by making a scatter diagramme for these errors. If errors depict any pattern, it means they are
correlated with one another otherwise, independent of each other. It goes without saying that
in the following figure, errors are absolutely randomly distributed i.e. they are uncorrelated
with each other.

Now, under above two assumptions, we may attempt to develop a model to illustrate our data
or actual situation. We may propose a simple linear regression model for explaining the
relationship between two variables and estimate the parameters of model from our sample
data. After fitting the model to our data set, we consider the errors/residuals that are not
explained by the model. Having obtained the random component (errors), we analyze it to
determine whether it contains any systematic component or not. If it contains then we may re-
evaluate our proposed model and if feasible, adjust it to include the systematic component
found in the errors/residuals.

Otherwise, we may have to reject the model and try new one. Here, it is important to note that
random component (errors/residuals) must not contain any systematic components of our data
set, it should be purely random. Only then we can use this model for explaining the
relationship between the variables, prediction purpose and controlling of a variable etc.

Proposed Simple Linear regression Model


It is easy to understand from the above discussion that in simple linear regression, we model
the relationship between two variables X and Y as a straight line. The equation of straight
line is Y = a + b X where a = intercept of line on Y-axis and b = slope of line. Therefore, our
model comprises of two parameters (constants). Since it is not possible to explain a given
data by any model completely due to uncertainty prevailing in the real world, we need to
insert an error term ε in our proposed simple linear regression model for the population is as
below:

Ŷ = α + β X + ε ………………….(i) (probabilistic model)

In the above proposed model Y is the dependent variable which we wish to explain/predict
for given values of X that is the independent variable/predictor. α and β are population

model parameters where α equals the population equivalent of the sample ‘a’. Similarly, β is

parameter analogous to ‘b’ that is slope of sample regression line. In our model, ε is the error
term that curtails the predictive strength of our proposed model.

A careful insight reveals that above model contains two components. First, systematic
(nonrandom) component which is line (Y = α + β X) itself and secondly, pure random
component that is error term ε. The systematic component is the equation for the mean of Y,
given X. We may represent the conditional mean of Y, given X, by E(Y) as below:

E(Y) = α + β X...................................(ii)

By comparing equations (i) and (ii), we can notice that each value of Y comprises the average
Y for the given value of X (this is a straight line), plus a random error. As X increases, the
average value of Y also increases, assuming a positive slope of the line (or decreases, if the
slope is negative). The actual value of Y is equal to the average Y conditional on X, plus a
random error ε. Thus, we have, for a given value of X,

Y = Average Y for given X + Error


In the above figure it is obvious that value of X i are considered to be fixed (not random); the

only randomness in the values of Y caused by error term ε.

This model may be used only if the true straight line relationship exists between variables X
and Y. If it is not straight line, then we need to use some other models.

Until now, we have described the population model which is assumed true based on the
relationship between X and Y. Now we wish to have an idea about unknown relationship in
population and estimate it from the sample information. For this, we get a random sample of
observations on two variables and X and Y, then compute the parameters a & b of sample
regression line which are analogous to population parameters α and β. This is done with the
help of method of least squares, discussed in the next section.

5. Estimation: Finding Out Parameters of Proposed Model

In the above section we have described a simple linear regression model. Now we wish to
compute the parameters (α and β) of proposed model so as to estimate the value of Y for the
given values of X. Our model to be effective, we wish to keep random component of our
model at the minimum. For this, we use the ‘method of least squares’ that predicts the value
of Y in such a way that the average of estimated Y for given values of X is always equal to
average of actual Y in order to minimize the random component (errors). This method is also
considered as best linear unbiased estimator (BLUE) that gives both unbiased estimators.

Now we will use Ŷ to show the individual values of the estimate points which lie on
estimation line for a given X. The best fitted estimation line will be as follows:

Ŷi = a + b X i

Where, i= 1,2,3,…………..n; are actual observations. Here, Ŷ1 is the first fitted (estimated)

value corresponding to X1 which is the value of Y1 without error ε1, and so on for all i=
1,2,3…….n. If we do not know the actual value of Y, this is the fitted value which we will
predict from the estimated regression line for given X. Thus, ε 1 will be the first residual, the
distance from the first observation to the fitted regression line; ε 2 is second one and so on to

εn the nth error. The total error εi is taken as estimates of the true population error.

Thus,
Error (Ʃ εi) = Ʃ (Y – Ŷ)

It would noteworthy to mention here that the process of summing individual difference for
computing the total error is not a reliable way to decide the goodness of fit of an estimation
line because of canceling effect of positive and negative values. Similarly, adding absolute
values also does not give clear impression of the goodness of model fit as the sum of absolute
value does not stress the magnitude of the error. Therefore, we always prefer to take sum of
square of the individual error.

Error Sum of square (SSE) = Ʃ (Y – Ŷ)2.....................(Unexplained variations)

Method of least squares minimizes the SSE (random component). This suggests two linear
equations to compute two parameters of the model as followings:

Ʃ Y = N.a + b. Ʃ X--------(1)

Ʃ X.Y = a. Ʃ X + b. Ʃ X2----------------------(2)
By solving both above equations, we can find out the values of a & b and it is done in such a
way that SSE between the actual value and estimated value of Y is always minimum.

a = ӯ̅ – b 𝑋̅

Where
a = Y- intercept
b = Slope of estimation line
𝑌̅ = Mean of values of Y
𝑋̅ = Mean of values of X

∑𝑋𝑌−𝑛
Where𝑋̅𝑌̅
b = ∑𝑋2−𝑛
𝑋̅2

b = Slope of best fitting estimation line


n = Number of pairs of value for the variables X and Y

The value of α and β; parameters of population model will be equal to value of ‘a’ and ‘b
respectively; calculated as above.

Significance of ‘a’ and ‘b’

Here, we also need to measure the significance of the parameters a & b so that we can take a
decision whether they should be retained in proposed model or not. For that, we may consider
the null and alternative hypothesis as following:

H0: a = 0
Ha: a ≠ 0

This means if value of ‘a’ is significantly different (higher or lower) from zero, we may reject
the null hypothesis. This implies that value of ‘a’ is significant and it must be included in
estimation line. This is done using t-test basically.

If value of ‘a’ is found to be significant, then at 5% level of significance we can say that the
value of equivalent population parameter α would be in this range {a ± S.E. (t)}.

Similarly, for ‘b’ we may take the hypothesis as:


H0: b = 0
Ha: b ≠ 0

This means if value of ‘b’ is significantly different (higher or lower) from zero, we may reject
the null hypothesis. This implies that value of ‘b’ is significant and there is a linear
relationship between X and Y. therefore, it must be included in estimation line.

If value of ‘b’ is found to be significant, then at 5% level of significance we can say that the
value of equivalent population parameter β would be in this range {b ± S.E. (t)}.

6. Checking Developed Estimating Equation

After having calculated the values of parameters a & b from the aforementioned method, we
may claim that we have obtained our estimation equation that is supposed to be the best fit
line. We can check it in a very simple way. As per the following figure, we may plot the
actual sample observations and populations estimation line (Ŷi = α + β Xi) on graph.

According to one of the mathematical properties of a line fitted by method of least squares;
the sum of individual positive and negative errors must equal to zero.

Ʃ Error = Ʃ d = Ʃ (Y - Ŷ) = zero

Using above information, we may check the sum of individual errors in our case and if it is
equal to zero, it implies that we have not committed any serious mathematical mistake in
finding out the estimation line and this is the best fit line.

The Standard Error of Estimation:


The next stage in building estimation line (model) is to assess its reliability. By the discussion
made so far we are able to develop an understanding that a regression line is more accurate as
an estimator when actual observations lie close to this line. Here, ‘standard error of estimate’
is a tool that is usually used to check the reliability of model. The term standard error is
actually similar to standard deviation that is used to measure the variability or dispersion of
given observations about the mean of given data set.

On the contrary, the standard error of estimate (Se) measures the dispersion/variation of the
actual observations around the regression line. This can be computed as below:

Mean Square Error (MSE) = Error Sum of Squares (SSE) / (n-2)

It is interesting to note in above formula that sum of the squared deviations are divided by (n-
2) instead of n. As the value of parameters a & b are computed from the actual data points
and when the same points are used in estimating the line, we actually loss 2 degrees of
freedom.

Standard Error of Estimate (Se) = √MSE

Ʃ(Y−Ŷ)²
Se = √ n−2

Where

Y = Value of dependent variable Y

Ŷ = Estimated value of Y from the estimating

line n = Number of actual data points

(observations)

We may also calculate Se using one more formula mentioned as below:

Se = √
ƩY² − aƩY−bƩXY
n−2

Where

X = actual given values of X

Y = actual given values of Y

n= Number of observations/data points


It is obvious from above discussion that larger the standard error of estimate, the greater the
dispersion of data points around the estimation line. On the other hand, if the error is zero, it
means that estimation line is a perfect estimator of Y and all data points will essentially lie on
line.

If we assume that all observations are normally distributed around the line, one may easily
note in the following figure that then observations will lie in the following pattern:

± 1 Se = about 68 % data points


± 2 Se = about 95 % data points
± 3 Se = about 100 % data points

Here it is important to note again two important assumptions:


1. The observed values for Y are normally distributed around each estimated value of Y
2. The dispersion of distribution around each estimated value is the same.

It is easy to understand that if second assumption is not true, then the standard error at one
point on line could differ from the standard error at another point on line.

Finding Out Approximate Estimation Interval:

Standard error calculated as above is a good instrument to find out a prediction interval
around an estimated value of Ŷ, within which the actual value of Y lies. In the above figure,
we may be 68% confident that actual value of Y lie within ± 1 S e interval of the estimated
line. Since, prediction interval is based on normal distribution of data points; larger sample
size (n ≥ 30) is required. For small size sample, we cannot get accurate intervals. One may
keep in mind that these prediction intervals are only approximates.

7. Summary:

Regression analysis is an important technique to develop a model that may estimate the value
of a dependent variable by at least one independent variable. In simple linear regression
analysis, there is one dependent and one independent variable. The variable whose is to be
estimated is termed as dependent variable and the variable which is used for prediction is
called independent variable. This analysis signifies an average relationship between two
variables. This assumes a linear relationship between the variables and is expressed as a
straight line equation. To test the goodness of fit of our proposed model, we may look at
followings:

(a) Making the scatter plot of given observation and fitting the regression line. If observed
data point are highly scattered around the line then our model may not be fit.
(b) Coefficient of Determination (r2): if it is greater than 60%, model may be considered as
good fit;
(c) We can see the significance of parameters of a & b using t-test. If any parameter is found
no significant (that means it is equal to zero), it cannot be included in our model.
(d) We may plot the residuals/ errors against the independent variable to check the
assumptions of independence of error and constant variation around the regression line.
(e) While running our regression model on computers, computer printouts usually contain an
analysis of variance (ANOVA) table with an F test of the regression model. This table
contains information about:

Error Sum of Squares (SSE) = Ʃ (Y - Ŷ)2-------------------------unexplained variations


Mean Square Error (MSE) = SSE / (n-2) (n-2 = degree of freedom)

SSE is that component of total error that cannot be explained by our regression model.
Therefore, usually termed as unexplained variation

Regression Sum of Square (SSR) = Ʃ (Ŷ - 𝑌̅)2-----------------------------explained variations


Mean Regression Square Error (MSR) = SSR / 1 (1 = degree of freedom)
SSR is that component of total error that can be explained by our regression model.

Therefore, usually termed as explained variation

Information pertaining to above errors is used to calculate F statistics.

𝑀𝑆𝑅 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠


𝑀𝑆𝐸
F= = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑠𝑢𝑚 𝑜𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑢𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛𝑠

If the value of F is found to be significant, it means variables X and Y are linearly related and
our model is good fit.

Thus, by examining our regression model on above parameters, we may assess the goodness
of fit for our model.

*******

Case: Mr. Atal Sharma, a psychologist for Hero group, has designed a test to show the
danger of over supervising the workers by the superiors. He selects eight workers from the
assembly line who are given a series of complicated tasks to perform. During their
performance, they are continuously interrupted by their superiors assisting them to complete
the work. Upon completion of work, all workers are given a test designed to measure the
worker’s hostility towards superior (a high score equals to low hostility). Their corresponding
scores on the hostility are given as below:

Worker’s score on hostility test (Y) 58 41 45 27 26 12 16 3


No of times he was interrupted (X) 5 10 10 15 15 20 20 25

(a) Plot this data set


(b) Propose a regression model that best describes the relationship between the number of
times interrupted and the test score
(c) Predict the expected test score if the worker is interrupted 18 minutes.
Solution:

In the problem, worker’s score is dependent variable Y whose value is to be estimated and
number of times he was interrupted is independent variable X. Now to propose an estimation
line, first we plot the given data points using MS Excel as below:
70
60
50
test score

40
30 Y
20 Linear (Y)
10
0
0 10 20 30
number of times interrupted

From the following scatter diagramme, we can observe easily that there is linear relationship
between both the variable, now we may propose an estimation line as following:

Ŷ=α+βX

Now we calculate the parameters of the sample regression line as below:

𝑋̅𝑌̅
∑𝑋𝑌−𝑛
b = ∑𝑋2−𝑛 a = 𝑌̅ – b𝑋̅
𝑋̅2

Y X (Y-Ӯ) (Y-Ӯ)² Ŷ (Ŷ-Ӯ) (Ŷ-Ӯ)² XY X²


58 5 29.5 870.25 56.5 28 784 290 25
41 10 12.5 156.25 42.5 14 196 410 100
45 10 16.5 272.25 42.5 14 196 450 100
27 15 -1.5 2.25 28.5 0 0 405 225
26 15 -2.5 6.25 28.5 0 0 390 225
12 20 -16.5 272.25 14.5 -14 196 240 400
16 20 -12.5 156.25 14.5 -14 196 320 400
3 25 -25.5 650.25 0.5 -28 784 75 625
28.5 15 2386 28.5 2352 2580 2100

Ӯ SST SSR ƩXY ƩX²

𝐗̅ = 𝟏𝟓; Ӯ = 28.5; ƩXY = 2580; ƩX² = 2100

SST = Sum of squared deviations (total variations) = 2386

SSR = Sum of Regression variations (explained variations) = 2352


2580−8𝑥15𝑥28.5
b = 2100−8𝑥2
25
b = - 2.8 (that is equivalent to β)

a = 28.5 – (-2.8)x15 a = 70.5 (that is equivalent to α)

Therefore, our estimation line would be:

Ŷ = 70.5 – 2.8 X

Now, we calculate the Coefficient of Determination (r2) to test the goodness of fit estimation
line:

r2 = Explained variations (SSR) / total variations (SST)

= 2352/ 2386 = 98.57%

The value of r2 is very high. Therefore, our estimation line seems good fit

Now, we see the significance of the parameters of our estimation line. For this, we can
observe in the MS Excel output shown as hereunder:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.9928
R Square 0.9858
Adjusted R
Square 0.9834
Standard Error 2.3805
Observations 8

ANOVA
Df SS MS F Significance F
Regression 1 2352 2352.0000 415.0588 0.0000
Residual 6 34 5.6667
Total 7 2386

Standard
Coefficients Error t Stat P-value
Intercept 70.5 2.2267 31.66075 0.0000
X -2.8 0.1374 -20.373 0.0000

We may see in the above summary output

Value of a (intercept) = 70.5 (and for this p ˂ 0.05)


Value of b (slope) = - 2.8 (and for this p ˂ 0.05)

Thus, both values are significant and this shows that our model is good fit.

Next, in ANOVA table, value of F = 415.0588 (p ˂ 0.05) is also found very significant. This
shows that both the variables are strongly linearly related. This also shows that our model is
good fit.

Standard Error of Estimation equals to 2.3805.

Now we can predict the expected test score if the worker is interrupted 18 minutes using our
estimation line as following:
Test Score = 70.5 – 2.8 * (number of times interrupted)
Test score = 70.5 – 2.8* (18)

= 70.5 – 50.4 = 20.1 (Answer)


29
C H A P T E R

Multiple Regression

I
n Chapter 27 we tried to predict the percent body fat of male
WHO 250 Male subjects from their waist size, and we did pretty well. The R2 of
subjects WHAT 67.8% says that we ac- counted for almost 68% of the variability in
Body fat and waist size %body fat by knowing only the waist size. We completed the analysis
UNITS %Body fat and inches
by performing hypothesis tests on the coef-
ficients and looking at the residuals.
WHEN 1990s But that remaining 32% of the variance has been bugging us.
WHERE United States Couldn’t we do a better job of accounting for %body fat if we weren’t
WHY Scientific research
limited to a single predictor? In the full data set there were 15 other
measurements on the 250 men. We might be able to use other
predictor variables to help us account for that leftover varia- tion that
wasn’t accounted for by waist size.
What about height? Does height help to predict %body fat? Men
with the same waist size can vary from short and corpulent to tall and
emaciated. Knowing a man has a 50-inch waist tells us that he’s likely
to carry a lot of body fat. If we found out that he was 7 feet tall, that
might change our impression of his body type. Knowing his height as
well as his waist size might help us to make a more ac- curate
prediction.

Just Do It
Does a regression with two predictors even make sense? It does—and
that’s fortu- nate because the world is too complex a place for simple
linear regression alone to model it. A regression with two or more
predictor variables is called a multiple
regression. (When we need to note the difference, a regression on a
single predic- tor is called a simple regression.) We’d never try to find
a regression by hand, and even calculators aren’t really up to the task.
This is a job for a statistics program on a computer. If you know how
to find the regression of %body fat on waist size with a statistics
package, you can usually just add height to the list of predictors
without having to think hard about how to do it.
29-1
29-2 Part VI I • Inferenc e When Variables Are Related

For simple regression we found the Least Squares solution, the one whose
A Note on Terminology
When we have two or coef-
ficients made the sum of the squared residuals as small as possible.
more predictors and fit a
For multiple regression, we’ll do the same thing but this time with
linear model by least more coefficients. Remark- ably enough, we can still solve this
squares, we are formally problem. Even better, a statistics package can find the coefficients of
said to fit a least squares the least squares model easily.
linear multiple re- Here’s a typical example of a multiple regression table:
gression. Most folks just
Dependent variable is: Pct BF
call it “multiple
R-squared 5 71.3% R-squared (adjusted) 5 71.1%
regression.” You may also
s 5 4.460 with 250 2 3 5 247 degrees of freedom
see the abbreviation OLS
Variable Coefficient SE(Coeff) t-ratio P-value
used with this kind of
analy- sis. It stands for Intercept 23.10088 7.686 20.403 0.6870
Waist 1.77309 0.0716 24.8 #0.0001
Height 20.60154 0.1099 25.47 #0.0001

You should recognize most of the numbers in this table. Most of them
mean what you expect them to.
R2 gives the fraction of the variability of %body fat accounted for by
Metalware Prices. the multiple regression model. (With waist alone predicting %body
Multi- ple regression is a
valuable tool for businesses.
fat, the R2 was 67.8%.) The multiple regression model accounts for
Here’s the story of one 71.3% of the variability in %body fat. We shouldn’t be surprised that
company’s analysis of its R2 has gone up. It was the hope of accounting for some of that leftover
manufac- turing process. variability that led us to try a second predictor.
The standard deviation of the residuals is still denoted s (or
sometimes se to dis- tinguish it from the standard deviation of y).
The degrees of freedom calculation follows our rule of thumb: the
degrees of free- dom is the number of observations (250) minus one for
each coefficient estimated— for this model, 3.
For each predictor we have a coefficient, its standard error, a t-
Compute a Multiple ratio, and the corresponding P-value. As with simple regression, the t-
Regression. We always find ratio measures how many standard errors the coefficient is away from
multi- ple regressions with a
computer. Here’s a chance to
0. So, using a Student’s t-model, we can use its P-value to test the null
try it with the statistics hypothesis that the true value of the coefficient is 0.
package you’ve been using. Using the coefficients from this table, we can write the regression model:
¿
%body fat 5 23.10 1 1.77 waist 2 0.60 height.
As before, we define the residuals as
¿
residuals 5 %body fat 2 %body fat.
We’ve fit this model with the same least squares principle: The sum
of the squared residuals is as small as possible for any choice of
coefficients.

So, What’s New?


So what’s different? With so much of the multiple regression looking
just like sim- ple regression, why devote an entire chapter (or two) to
the subject?
There are several answers to this question. First—and most
important—the meaning of the coefficients in the regression model
has changed in a subtle but im- portant way. Because that change is
not obvious, multiple regression coefficients
Chapter 29 • Multiple Regression 29-3

Reading the Multiple are often misinterpreted. We’ll show some examples to help make the
Regression Table. You may be meaning clear.
sur- prised to find that you Second, multiple regression is an extraordinarily versatile
already know how to interpret
most of the values in the calculation, underly- ing many widely used Statistics methods. A
table. Here’s a narrated sound understanding of the multiple regression model will help you to
review. understand these other applications.
Third, multiple regression offers our first glimpse into statistical
models that use more than two quantitative variables. The real world
is complex. Simple mod- els of the kind we’ve seen so far are a great
start, but often they’re just not detailed enough to be useful for
understanding, predicting, and decision making. Models that use
several variables can be a big step toward realistic and useful
modeling of complex phenomena and relationships.

What Multiple Regression Coefficients Mean


We said that height might be important in predicting body fat in men.
What’s the relationship between %body fat and height in men? We
know how to approach this question; we follow the three rules. Here’s
the scatterplot:

40

30
% Body

20
Fat

10

66 69 72 75
Height (in.)

The scatterplot of %body fat against height seems to


say that there is little relationship between these
variables. Figure 29.1

It doesn’t look like height tells us much about %body fat. You just
can’t tell much about a man’s %body fat from his height. Or can you?
Remember, in the multiple regression model, the coefficient of height
was 20.60, had a t-ratio of 25.47, and had a very small P-value. So it
did contribute to the multiple regression model. How could that be?
The answer is that the multiple regression coefficient of height
takes account of the other predictor, waist size, in the regression
model.
To understand the difference, let’s think about all men whose waist
size is about 37 inches—right in the middle of our sample. If we think
only about these men, what do we expect the relationship between
height and %body fat to be? Now a negative association makes sense
because taller men probably have less body fat than shorter men who
have the same waist size. Let’s look at the plot:
29-4 Part VI I • Inferenc e When Variables Are Related

40

30

% Body
20

Fat
10

66 69 72 75
Height (in.)

When we restrict our attention to men with waist sizes


between 36 and 38 inches (points in blue), we can see
a relationship be- tween %body fat and height. Figure
29.2

Here we’ve highlighted the men with waist sizes between 36 and 38
inches. Overall, there’s little relationship between %body fat and
height, as we can see from the full set of points. But when we focus on
particular waist sizes, there is a relationship between body fat and
height. This relationship is conditional because we’ve restricted our
set to only those men within a certain range of waist sizes. For men
with that waist size, an extra inch of height is associated with a
decrease of about 0.60% in body fat. If that relationship is consistent
for each waist size, then the multiple regression coefficient will
estimate it. The simple regression co- efficient simply couldn’t see it.
We’ve picked one particular waist size to highlight. How could we
look at the relationship between %body fat and height conditioned on
all waist sizes at the same time? Once again, residuals come to the
rescue.
We plot the residuals of %body fat after a regression on waist size
As their name reminds us,
against the residuals of height after regressing it on waist size. This
residuals are what’s left display is called a partial re- gression plot. It shows us just what we
over after we fit a model. asked for: the relationship of %body fat to height after removing the
That lets us remove the linear effects of waist size.
effects of some variables.
The residuals are what’s
left.

7.5
% Body Fat

0.0
Residuals

–7.5

–4 0 4
Height Residuals (in.)

A partial regression plot for the coefficient of


height in the regression model has a slope equal
to the coefficient value in the multiple regression
model. Figure 29.3
Chapter 29 • Multiple Regression 29-5

A partial regression plot for a particular predictor has a slope that is


the same as
the multiple regression coefficient for that predictor. Here, it’s 20.60. It
also has the
same residuals as the full multiple regression, so you can spot any
outliers or influential points and tell whether they’ve affected the
estimation of this particu- lar coefficient.
Many modern statistics packages offer partial regression plots as an
option for any coefficient of a multiple regression. For the same
reasons that we always look at a scatterplot before interpreting a
simple regression coefficient, it’s a good idea to make a partial
regression plot for any multiple regression coefficient that you hope to
understand or interpret.

The Multiple Regression Model


We can write a multiple regression model like this, numbering the
predictors arbi- trarily (we don’t care which one is x1), writing b’s for
the model coefficients (which we will estimate from the data), and
including the errors in the model:
y 5 b0 1 b1x1 1 b2x2 1 e.
Of course, the multiple regression model is not limited to two
predictor vari- ables, and regression model equations are often
written to indicate summing any number (a typical letter to use is k)
of predictors. That doesn’t really change any- thing, so we’ll often
stick with the two-predictor version just for simplicity. But don’t
forget that we can have many predictors.
The assumptions and conditions for the multiple regression model
sound nearly the same as for simple regression, but with more
variables in the model, we’ll have to make a few changes.

Assumptions and Conditions

Linearity Assumption
We are fitting a linear model.1 For that to be the right kind of model,
we need an underlying linear relationship. But now we’re thinking
about several predictors. To see whether the assumption is
reasonable, we’ll check the Straight Enough Condition for each of the
predictors.
Multiple Regression Straight Enough Condition: Scatterplots of y against each of the
Assumptions. The assumptions predictors are reasonably straight. As we have seen with height in the
and conditions we check for body fat example, the scat- terplots need not show a strong (or any!)
multi- ple regression are slope; we just check that there isn’t a bend or other nonlinearity. For
much like those we checked the %body fat data, the scatterplot is beautifully lin- ear in waist as we
for simple regression. Here’s
an animated discussion of the saw in Chapter 27. For height, we saw no relationship at all, but at
assumptions and conditions least there was no bend.
for multiple regression. As we did in simple regression, it’s a good idea to check the
residuals for linear- ity after we fit the model. It’s good practice to
plot the residuals against the
1 By linear we mean that each x appears simply multiplied by its coefficient and added to
the model. No x appears in an exponent or some other more complicated function. That
means that as we move along any x-variable, our prediction for y will change at a
constant rate (given by the coefficient) if noth- ing else changes.
29-6 Part VI I • Inferenc e When Variables Are Related

predicted values and check for patterns, especially for bends or other
Check the Residual Plot nonlineari- ties. (We’ll watch for other things in this plot as well.)
(Part 1) If we’re willing to assume that the multiple regression model is
The residuals should reasonable, we can fit the regression model by least squares. But we
appear to have no pattern must check the other assumptions and conditions before we can
with re- spect to the interpret the model or test any hypotheses.
predicted values.
Independence Assumption
As with simple regression, the errors in the true underlying regression
model must be independent of each other. As usual, there’s no way to
be sure that the In- dependence Assumption is true. Fortunately, even
Check the Residual Plot though there can be many pre- dictor variables, there is only one
(Part 2) response variable and only one set of errors. The Independence
The residuals should Assumption concerns the errors, so we check the corresponding
appear to be randomly conditions on the residuals.
scattered and show no
Randomization Condition: The data should arise from a random
sample or randomized experiment. Randomization assures us that the
patterns or clumps when data are representa- tive of some identifiable population. If you can’t
plotted against the pre- identify the population, you can’t interpret the regression model or
dicted values. any hypothesis tests because they are about a regression model for
that population. Regression methods are often ap- plied to data that
were not collected with randomization. Regression models fit to such
data may still do a good job of modeling the data at hand, but without
some reason to believe that the data are representative of a particular
population, you should be reluctant to believe that the model
generalizes to other situations.
We also check displays of the regression residuals for evidence of
patterns, trends, or clumping, any of which would suggest a failure of
independence. In the special case when one of the x-variables is
related to time, be sure that the residu- als do not have a pattern
when plotted against that variable.
The %body fat data were collected on a sample of men. The men
were not related in any way, so we can be pretty sure that their
measurements are independent.

Equal Variance Assumption


The variability of the errors should be about the same for all values of
each predic- tor. To see if this is reasonable, we look at scatterplots.
Does the Plot Thicken? Condition: Scatterplots of the regression
Check the Residual Plot
residuals against each x or against the predicted values, yˆ, offer a
(Part 3) visual check. The spread around the line should be nearly constant. Be
The spread of the alert for a “fan” shape or other ten- dency for the variability to grow
residuals should be or shrink in one part of the scatterplot.
uniform when plot- ted Here are the residuals plotted against waist and height. Neither plot
against any of the x’s or shows pat- terns that might indicate a problem.
against the predicted 10
10
5
Residuals

0 5
Residual

0
–5
–10 –5
s

–10

66 69 72 75
30 35 40 45 50
78
Waist (in.)
Height (in.)

Residuals plotted against each predictor show no pattern. That’s a good


indication that the Straight Enough Condition and the “Does the Plot
Thicken?” Condition are satisfied. Figure 29.4
Chapter 29 • Multiple Regression 29-7

If residual plots show no pattern, if the data are plausibly


independent, and if the plots don’t thicken, we can feel good about
interpreting the regression model. Before we test hypotheses,
however, we must check one final assumption.

Normality Assumption
We assume that the errors around the idealized regression model at
any specified values of the x-variables follow a Normal model. We
40 need this assumption so that we can use a Student’s t-model for
inference. As with other times when we’ve used Student’s t, we’ll
30 settle for the residuals satisfying the Nearly Normal Condition.
Count

20 Nearly Normal Condition: Because we have only one set of


s

10
residuals, this is the same set of conditions we had for simple
regression. Look at a histogram or Normal probability plot of the
residuals. The histogram of residuals in the %body fat regression
–12.0 –4.5 3.0 10.5 certainly looks nearly Normal, and the Normal probability plot is fairly
Residuals straight. And, as we have said before, the Normality Assumption
Check a histogram of the
becomes less important as the sample size grows.
residuals. The distribution of Let’s summarize all the checks of conditions that we’ve made and
the residuals should be the order that we’ve made them:
unimodal and symmet- ric. Or
check a Normal probability 1. Check the Straight Enough Condition with scatterplots of the y-
plot to see whether it is variable against each x-variable.
straight.
Figure 29.5
2. If the scatterplots are straight enough (that is, if it looks like the
regression model is plausible), fit a multiple regression model to
the data. (Otherwise, either stop or consider re-expressing an x-
or the y-variable.)
3. Find the residuals and predicted values.
4. Make a scatterplot of the residuals against the predicted values. 2
This plot should look patternless. Check in particular for any bend
(which would suggest that the data weren’t all that straight after
all) and for any thickening. If there’s a bend and especially if the
plot thickens, consider re-expressing the y-variable and starting
over.
5. Think about how the data were collected. Was suitable
randomization used? Are the data representative of some
identifiable population? If the data are measured over time, check
for evidence of patterns that might suggest they’re not
independent by plotting the residuals against time to look for pat-
terns.
6. If the conditions check out this far, feel free to interpret the
regression model and use it for prediction. If you want to
Partial Regression investigate a particular coefficient, make a partial regression plot
Plots vs. Scatterplots. When for that coefficient.
should you use a partial
regression plot? And why? 7. If you wish to test hypotheses about the coefficients or about the
This activity shows you. overall re- gression, then make a histogram and Normal
probability plot of the residuals to check the Nearly Normal
Condition.

2 In Chapter 27 we noted that a scatterplot of residuals against the predicted values


looked just like the plot of residuals against x. But for a multiple regression, there are
several x’s. Now the predicted val- ues, yˆ, are a combination of the x’s—in fact, they’re
the combination given by the regression equation we have computed. So they combine
the effects of all the x’s in a way that makes sense for our partic- ular regression model.
That makes them a good choice to plot against.
29-8 Part VI I • Inferenc e When Variables Are Related

Multiple Regression

Let’s try finding and interpreting a multiple regression model for the
body fat data.

Plan Name the variables,


report the W’s, and specify I have body measurements on 250 adult males
the questions of interest. from the BYU Human Performance Research Center.
I want to under- stand the relationship between %
Model Check the body fat, height, and waist size.
appropriate conditions. ✔ Straight Enough Condition: There is no
obvious bend in the scatterplots of %body fat
against either x-variable. The scatterplot of
residuals against predicted values below
shows no patterns that would suggest
nonlinearity.
✔ Independence Assumption: These data are not
col- lected over time, and there’s no reason to think
that the
Now you can find the %body fat of one man influences that of another. I
regression and examine the don’t know whether the men measured were sampled
residuals. randomly, but the data are presented as being
representative of the male population of the
United States.
✔ Does the Plot Thicken? Condition: The
scatterplot of residuals against predicted values
shows no obvious changes in the spread about
the line.

10
Residuals (% body

–5
fat)

–10

10 20 30 40
Predicted (% body fat)
Actually, you need the Nearly
Normal Condition only if we ✔ Nearly Normal Condition: A histogram of the
want to do inference. residuals is unimodal and symmetric.
30

20
Count
s

10
–11.25
–5.00 1.25 7.50
Residuals (% body fat)
Chapter 29 • Multiple Regression 29-9

The Normal probability plot of the residuals is


reasonably straight:

10

Residuals (% body
5

–5

fat)
–10

–1.5 0.0 1.5


Normal Scores

Choose your method.


Under these conditions a full multiple regression
analysis is appropriate.
Mechanics Here is the computer output for the regression:
Dependent variable is: %BF
R-squared 5 71.3% R-squared (adjusted) 5 71.1%
s 5 4.460 with 250 2 3 5 247 degrees of freedom

Sum Mean
Source of DF Square F-ratio P-
Square value
s
Regression 12216.6 2 6108.28 307 ,0.0001
Residual 4912.26 247 19.8877

Variable Coefficie SE(Coeff t-ratio P-value


nt )
Intercept 23.10088 7.686 20.403 0.6870
Waist 1.77309 0.0716 24.8 ,0.0001
Height 20.60154 0.1099 25.47 ,0.0001

The estimated regression equation is


Conclusion Interpret the ¿

regres- sion in the proper %body fat 5 23.10 1 1.77 waist 2 0.60 height.
context. The R2 for the regression is 71.3%. Waist size and
height to- gether account for about 71% of the
variation in %body fat among men. The regression
equation indicates that each inch in waist size is
associated with about a 1.77 increase in %body fat
among men who are of a particular height.
Each inch of height is associated with a decrease in
%body fat of about 0.60 among men with a particular
waist size.
The standard errors for the slopes of 0.07 (waist) and
0.11 (height) are both small compared with the
slopes them- selves, so it looks like the coefficient
estimates are fairly precise. The residuals have a
standard deviation of 4.46%, which gives an
indication of how precisely we can predict
%body fat with this model.
29-10 Part VI I • Inferenc e When Variables Are Related

Multiple Regression Inference I: I Thought I Saw an ANOVA Table . . .


There are several hypothesis tests in the multiple regression output,
Mean Squares and but all of them talk about the same thing. Each is concerned with
More. Here’s an animated tour whether the underlying model parameters are actually zero.
of the rest of the regression
table. The num- bers work
The first of these hypotheses is one we skipped over for simple
together to help us regression (for reasons that will be clear in a minute). Now that we’ve
understand the analysis. looked at ANOVA (in Chapter 28), 3 we can recognize the ANOVA table
sitting in the middle of the re- gression output. Where’d that come
from?
The answer is that now that we have more than one predictor,
there’s an overall test we should consider before we do more
inference on the coefficients. We ask the global question “Is this
multiple regression model any good at all?” That is, would we do as
well using just y to model y? What would that mean in terms of the
regression? Well, if all the coefficients (except the intercept) were
zero, we’d have
yˆ 5 b0 1 0x1 1 c 1 0xk
and we’d just set b0 5 y.
To address the overall question, we’ll test
H0: b1 5 b2 5 c 5 bk 5 0.
(That null hypothesis looks very much like the null hypothesis we
tested in the Analysis of Variance in Chapter 28.)
We can test this hypothesis with a statistic that is labeled with the
letter F (in honor of Sir Ronald Fisher, the developer of Analysis of
Variance). In our exam- ple, the F-value is 307 on 2 and 247 degrees
of freedom. The alternative hypothesis is just that the slope
coefficients aren’t all equal to zero, and the test is one-sided— bigger
F-values mean smaller P-values. If the null hypothesis were true, the
F- statistic would be near 1. The F-statistic here is quite large, so we
can easily reject the null hypothesis and conclude that the multiple
regression model is better than just using the mean.4
Why didn’t we do this for simple regression? Because the null
hypothesis would have just been that the lone model slope coefficient
was zero, and we were already testing that with the t-statistic for the
slope. In fact, the square of that t- statistic is equal to the F-statistic
for the simple regression, so it really was the identical test.

Multiple Regression Inference II: Testing the Coefficients


Once we check the F-test and reject the null hypothesis—and, if we are
being care- ful, only if we reject that hypothesis—we can move on to
checking the test statistics

3 If you skipped over Chapter 28, you can just take our word for this and read on.
4 There are F tables on the CD, and they work pretty much as you’d expect. Most
regression tables in- clude a P-value for the F-statistic, but there’s almost never a need
to perform this particular test in a multiple regression. Usually we just glance at the F-
statistic to see that it’s reasonably far from 1.0, the value it would have if the true
coefficients were really all zero.
Chapter 29 • Multiple Regression 29-11

for the individual coefficients. Those tests look like what we did for
the slope of a simple regression. For each coefficient we test
H0: bj 5 0
against the (two-sided) alternative that it isn’t zero. The regression
table gives a standard error for each coefficient and the ratio of the
estimated coefficient to its standard error. If the assumptions and
conditions are met (and now we need the
Nearly Normal condition), these ratios follow a Student’s t-distribution.
bj 2
tn2k2 0
1 SEsbj
d
How many degrees of freedom? We have a rule of thumb and it
works here. The degrees of freedom is the number of data values
minus the number of predic- tors (in this case, counting the intercept
term). For our regression on two predic- tors, that’s n 2 3. You
shouldn’t have to look up the t-values. Almost every regres- sion
report includes the corresponding P-values.
We can build a confidence interval in the usual way, as an
estimate 6 a margin of error. As always, the margin of error is just
the product of the standard error and a critical value. Here the
critical value comes from the t-distribution on n 2 k 2 1 degrees of
freedom. So a confidence interval for bj is

bj 6 t*n2k21 SEsbjd.
The tricky parts of these tests are that the standard errors of the
coefficients now require harder calculations (so we leave it to the
technology) and the meaning of a
coefficient, as we have seen, depends on all the other predictors in the
multiple re-
gression model.
That last bit is important. If we fail to reject the null hypothesis for a
multiple regression coefficient, it does not mean that the
corresponding predictor variable has no linear relationship to y. It
means that the corresponding predictor con- tributes nothing to
modeling y after allowing for all the other predictors.

How’s That, Again?


This last point bears repeating. The multiple regression model looks
so simple and straightforward:

y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
It looks like each bj tells us the effect of its associated predictor, xj,
on the response variable, y. But that is not so. This is, without a doubt,
How Regression
the most common error that people make with multiple regression:
Coefficients Change with New • It is possible for there to be no simple relationship between y and xj,
Variables. When the and yet bj in a multiple regression can be significantly different from
regression model grows by
including a new prdictor, all 0. We saw this hap- pen for the coefficient of height in our example.
the coefficients are likely to • It is also possible for there to be a strong two-variable relationship
change. That can help us between y and xj, and yet bj in a multiple regression can be almost 0
understand what those with a large P-value so that we must retain the null hypothesis
coefficients that the true coefficient is zero. If
mean.
29-12 Part VI I • Inferenc e When Variables Are Related

we’re trying to model the horsepower of a car, using both its weight
and its en- gine size, it may turn out that the coefficient for engine
size is nearly 0. That doesn’t mean that engine size isn’t important
for understanding horsepower. It simply means that after allowing
for the weight of the car, the engine size doesn’t give much
additional information.
Multiple Regression
• It is even possible for there to be a significant linear relationship
Coefficients. You may be between y and xj in one direction, and yet bj can be of the opposite
thinking that multiple sign and strongly significant in a multiple regression. More
regression coefficients must expensive cars tend to be bigger, and since big- ger cars have worse
be more consistent than this fuel efficiency, the price of a car has a slightly negative as- sociation
discussion suggests. Here’s a with fuel efficiency. But in a multiple regression of fuel efficiency on
hands-on analysis for you to
investigate. weight and price, the coefficient of price may be positive. If so, it
means that among cars of the same weight, more expensive cars
have better fuel efficiency. The simple regression on price, though,
has the opposite direction because, overall, more expensive cars are
bigger. This switch in sign may seem a little strange at first, but it’s
not really a contradiction at all. It’s due to the change in the
meaning of the coefficient of price when it is in a multiple
regression rather than a simple regression.
So we’ll say it once more: The coefficient of xj in a multiple
regression depends as much on the other predictors as it does on xj.
Remember that when you inter- pret a multiple regression model.

Another Example: Modeling Infant Mortality


Infant mortality is often used as a general measure of the quality of
WHO U.S. states
healthcare for children and mothers. It is reported as the rate of
WHAT Various measures deaths of newborns per 1000 live births. Data recorded for each of the
relating to children 50 states of the United States may allow us to build regression models
and teens to help understand or predict infant mortality. The vari- ables
available for our model are child death rate (deaths per 100,000
WHEN 1999
children aged 1–14), percent of teens who are high school dropouts
WHY Research and policy (ages 16–19), percent of low–birth weight babies (lbw), teen birth rate
(births per 100,000 females ages 15–17), and teen deaths by accident,
homicide, and suicide (deaths per 100,000 teens ages 15–19).5
All of these variables were displayed and found to have no outliers
and nearly Normal distributions.6 One useful way to check many of
our conditions is with a
scatterplot matrix. This is an array of scatterplots set up so that the plots in
each
row have the same variable on their y-axis and those in each column have
the
same variable on their x-axis. This way every pair of variables is graphed. On
the
diagonal, rather than plotting a variable against itself, you’ll usually find either
a
Normal probability plot or a histogram of the variable to help us assess the
Nearly
Normal Condition.

5The data are available from the Kids Count section of the Annie E. Casey Foundation, and
are all for 1999.
6 In the interest of complete honesty, we should point out that the original data include the District
of
Columbia, but it proved to be an outlier on several of the variables, so we’ve restricted
attention to the 50 states here.
Chapter 29 • Multiple Regression 29-13

A scatterplot matrix shows a


scat- terplot of each pair of
variables ar- rayed so that the

Mortality
vertical and hori- zontal axes

Infant
are consistent across rows
and down columns. The diag-
onal cells may hold Normal
proba- bility plots (as they do
here), his- tograms, or just the

Child Death
names of the variables. These
are a great way to check the
Straight Enough Condi- tion

Rate
and to check for simple out-
liers. Figure 29.6

HS Dropout Rate

Low Birth
Weight

Teen Births

Teen Deaths
The individual scatterplots show at a glance that each of the
relationships is straight enough for regression. There are no obvious
bends, clumping, or outliers. And the plots don’t thicken. So it looks
like we can examine some multiple regres- sion models with
inference.

Inference for Multiple Regression

Let’s try to model infant mortality with all of the available predictors.

Plan State what you want


to know. I wonder whether all or some of these predictors
contribute to a useful model for infant mortality.
Hypotheses Specify your
hypotheses. first, there is an overall null hypothesis that asks
whether the entire model is better than just
(Hypotheses on the modeling y with its mean:
intercept are not
particularly interesting for H0: The model itself contributes nothing useful, and
these data.) all the slope coefficients,
b1 5 b2 5 c 5 bk 5 0.
29-14 Part VI I • Inferenc e When Variables Are Related

HA: At least one of the bj is not 0.


If I reject this hypothesis, then I’ll test a null
hypothesis for each of the coefficients of the form:
H0: The j-th variable contributes nothing useful, after
Model State the null allow- ing for the other predictors in the model: bj
model. 5 0.
HA: The j-th variable makes a useful contribution to
the model: bj 2 0.
Check the appropriate
assump- tions and ✔ Straight Enough Condition: The scatterplot
conditions. matrix shows no bends, clumping, or outliers.
✔ Independence Assumption: These data are
based on random samples and can be considered
independent.
These conditions allow me to compute the regression
model and find residuals.
✔ Does the Plot Thicken? Condition: The residual
plot shows no obvious trends in the spread:
Residuals (deaths/10,000 live

0
births)

–1

6 7 8 9 10
Predicted (deaths/10,000 live births)

✔ Nearly Normal Condition: A histogram of the


residuals is unimodal and symmetric.
20

15

10

–2.0
0.0 2.0
Residual
s
Chapter 29 • Multiple Regression 29-15

The one possible outlier is South Dakota. I may repeat


the analysis after removing South Dakota to see
whether it changes substantially.
Choose your method. Under these conditions I can continue with a
multiple regression analysis.
Computer output for this regression looks like this:
Mechanics
Multiple regressions are Dependent variable is: Infant mort
always found from a computer R-squared 5 71.3 % R-squared (adjusted) 68.0 %
program. s 5 0.7520 with 50 2 6 5 44 degrees of freedom
Sum of Mean
Source Squares DF Square F-ratio
Regression 61.7319 5 12.3464 21.8
Residual 24.8843 44 0.565553

The P-values given in the Variable CoefficientSE(Coeff) t-ratio P-value


regres- sion output table are Intercept 1.63168 0.9124 1.79 0.0806
from the Stu- dent’s t- CDR 0.03123 0.0139 2.25 0.0292
distribution on HS drop 20.09971 0.0610 21.63 0.1096
sn 2 6d 5 44 degrees of Low BW 0.66103 0.1189 5.56 ,0.0001
freedom. They are Teen
appropriate for two-sided births 0.01357 0.0238 0.57 0.5713
alternatives.
Teen
deaths 0.00556 0.0113 0.49 0.6245

The F-ratio of 21.8 on 5 and 44 degrees of freedom


Consider the hypothesis tests. is cer- tainly large enough to reject the default null
Under the assumptions we’re hypothesis that the regression model is no better
will- ing to accept, and
considering the conditions
than using the mean infant mortality rate. So I’ll go
we’ve checked, the in- on to examine the in- dividual coefficients.
dividual coefficients follow
Stu- dent’s t-distributions on
44 degrees of freedom.
Most of these coefficients have relatively small t-
Conclusion Interpret your ratios, so I can’t be sure that their underlying values
results in the proper context.
are not zero.
Two of the coefficients, child death rate (cdr) and low birth
weight (lbw), have P-values less than 5%. So I am
confident that in this model both of these variables
are unlikely to really have zero coefficients.
Overall the R2 indicates that more than 71% of the
variabil- ity in infant mortality can be accounted for
with this regression model.
After allowing for the linear effects of the other
variables in the model, an increase in the child death
rate of 1 death per 100,000 is associated with an
increase of 0.03 deaths per 1000 live births in
the infant mortality rate.
And an increase of 1% in the percentage of live
births that are low birth weight is associated with an
increase of 0.66 deaths per 1000 live births.
29-16 Part VI I • Inferenc e When Variables Are Related

Comparing Multiple Regression Models


We have more variables available to us than we used when we
modeled infant mortality. Moreover, several of those we tried don’t
seem to contribute to the model. How do we know that some other
choice of predictors might not provide a better model? What exactly
would make an alternative model better?
These are not easy questions. There is no simple measure of the
success of a multiple regression model. Many people look at the R2
value, and certainly we are not likely to be happy with a model that
accounts for only a small fraction of the variability of y. But that’s not
enough. You can always drive the R2 up by piling on more and more
predictors, but models with many predictors are hard to under- stand.
Keep in mind that the meaning of a regression coefficient depends on
all the other predictors in the model, so it is best to keep the number
of predictors as small as possible.
Regression models should make sense. Predictors that are easy to
understand are usually better choices than obscure variables.
Similarly, if there is a known mechanism by which a predictor has an
effect on the response variable, that pre- dictor is usually a good
choice for the regression model.
How can we know whether we have the best possible model? The
simple an- swer is that we can’t. There’s always the chance that some
other predictors might bring an improvement (in higher R2 or fewer
predictors or simpler interpretation).

Adjusted
R2
You may have noticed that the full regression tables shown in this
chapter include another statistic we haven’t discussed. It is called
adjusted R2 and sometimes ap- pears in computer output as
R2(adjusted). The adjusted R2 statistic is a rough at-
tempt to adjust for the simple fact that when we add another predictor to a
multi-
ple regression, the R2 can’t go down and will most likely get larger.
Only if we were to add a predictor whose coefficient turned out to be
exactly zero would the R2 remain the same. This fact makes it difficult
to compare alternative regression models that have different numbers
of predictors.
We can write a formula for R2 using the sums of squares in the
ANOVA table portion of the regression output table:
SSRegression SSRegression
R2 5 5 .
SSRegression 1 SSResidual SSTotal
Adjusted R2 simply substitutes the corresponding mean squares for the SS’s:
2 MSRegression
Rad 5 .
j
MSTotal
Because the mean squares are sums of squares divided by their
degrees of free- dom, they are adjusted for the number of predictors
in the model. As a result, the adjusted R2 value won’t necessarily
increase when a new predictor is added to the multiple regression
model. That’s fine. But adjusted R2 no longer tells the fraction of
variability accounted for by the model and it isn’t even bounded by 0
and 100%, so it can be awkward to interpret.
Comparing alternative regression models is a challenge, especially
when they have different numbers of predictors. The search for a
summary statistic to help us
Chapter 29 • Multiple Regression 29-17

choose among models is the subject of much contemporary research


in Statistics. Adjusted R2 is one common—but not necessarily the best
—choice often found in computer regression output tables. Don’t use
it as the sole decision criterion when you compare different regression
models.

What Can Go Wrong? Interpreting Coefficients



Don’t claim to “hold everything else constant” for a
single individual. It’s often meaningless to say that a
regression coefficient says what we ex- pect to happen if all
variables but one were held constant for an individual and the
predictor in question changed. While it’s mathematically correct,
it often just doesn’t make any sense. We can’t gain a year of
experience or have another child without getting a year older.
Instead, we can think about all those who fit given criteria on
some predictors and ask about the condi- tional relationship
between y and one x for those individuals. The coefficient
20.60 of height for predicting %body fat says that among men of
the same waist size, those who are one inch taller in height tend
to be, on average, 0.60% lower in %body fat. The multiple
regression coefficient measures that average conditional
relationship.

Don’t interpret regression causally. Regressions are usually
applied to ob- servational data. Without deliberately assigned
treatments, randomization, and control, we can’t draw
conclusions about causes and effects. We can never be certain
that there are no variables lurking in the background, caus- ing
everything we’ve seen. Don’t interpret b1, the coefficient of x1 in
the mul- tiple regression, by saying, “If we were to change an
individual’s x1 by 1 unit (holding the other x’s constant) it would
change his y by b1 units.” We have no way of knowing what
applying a change to an individual would do.

Be cautious about interpreting a regression model as
predictive. Yes, we do call the x’s predictors, and you can
certainly plug in values for each of the x’s and find a
corresponding predicted value, yˆ. But the term “prediction”
suggests extrapolation into the future or beyond the data, and we
know that we can get into trouble when we use models to
estimate yˆ values for x’s not in the range of the data. Be careful
not to extrapolate very far from the span of your data. In simple
regression it was easy to tell when you extrapolated. With many
predictor variables, it’s often harder to know when you are out-
side the bounds of your original data.7 We usually think of fitting
models to the data more as modeling than as prediction, so that’s
often a more appro- priate term.

Don’t think that the sign of a coefficient is special.
Sometimes our pri- mary interest in a predictor is whether it has
a positive or negative associa- tion with y. As we have seen,
though, the sign of the coefficient also depends on the other
predictors in the model. Don’t look at the sign in isolation and
conclude that “the direction of the relationship is positive (or
negative).” Just like the value of the coefficient, the sign is about
the relationship after

7 With several predictors we can wander beyond the data because of the combination of
values even when individual values are not extraordinary. For example, both 28-inch
waists and 76-inch heights can be found in men in the body fat study, but a single
individual with both these measurements would not be at all typical. The model we fit is
probably not appropriate for predicting the % body fat for such a tall and skinny
individual.
29-18 Part VI I • Inferenc e When Variables Are Related

allowing for the linear effects of the other predictors. The sign of
a variable can change depending on which other predictors are in
or out of the model. For example, in the regression model for
infant mortality, the coefficient of high school dropout rate was
negative and its P-value was fairly small, but the simple
association between dropout rate and infant mortality is positive.
(Check the plot matrix.)

If a coefficient’s t-statistic is not significant, don’t
interpret it at all. You can’t be sure that the value of the
corresponding parameter in the un- derlying regression model
isn’t really zero.

l se
WhatE Can Go Wrong? ● Don’t fit a linear regression to data that aren’t straight. This is the
most
^
fundamental regression assumption. If the relationship between
the x’s and y isn’t approximately linear, there’s no sense in fitting
a linear model to it. What we mean by “linear” is a model of the
form we have been writing for the regression. When we have two
predictors, this is the equation of a plane, which is linear in the
sense of being flat in all directions. With more predic- tors, the
geometry is harder to visualize, but the simple structure of the
model is consistent; the predicted values change consistently
with equal size changes in any predictor.
Usually we’re satisfied when plots of y against each of the x’s
are straight enough. We’ll also check a scatterplot of the
residuals against the predicted values for signs of nonlinearity.

Watch out for the plot thickening. The estimate of the error
standard devi- ation shows up in all the inference formulas. If se
changes with x, these esti- mates won’t make sense. The most
common check is a plot of the residuals against the predicted
values. If plots of residuals against several of the pre- dictors all
show a thickening, and especially if they also show a bend, then
consider re-expressing y. If the scatterplot against only one
predictor shows thickening, consider re-expressing that
predictor.

Make sure the errors are nearly Normal. All of our
inferences require that the true errors be modeled well by a
Normal model. Check the his- togram and Normal probability plot
of the residuals to see whether this as- sumption looks
reasonable.

Watch out for high-influence points and outliers. We
always have to be on the lookout for a few points that have
undue influence on our model, and regression is certainly no
exception. Partial regression plots are a good place to look for
influential points and to understand how they affect each of the
coefficients.

C O N N E CT I O N S
We would never consider a regression analysis without first making scatterplots. The
aspects of scat- terplots that we always look for—their direction, shape, and scatter—
relate directly to regression.
Regression inference is connected to just about every inference method we have
seen for mea- sured data. The assumption that the spread of data about the line is
constant is essentially the same as the assumption of equal variances required for the
pooled-t methods. Our use of all the residuals together to estimate their standard
deviation is a form of pooling.
Chapter 29 • Multiple Regression 29-19

Of course, the ANOVA table in the regression output connects to our consideration
of ANOVA in Chapter 28. This, too, is not coincidental. Multiple Regression, ANOVA,
pooled t-tests, and inference for means are all part of a more general statistical model
known as the General Linear Model (GLM).

Mhav have we learnedṢ


We first met regression in Chapter 8 and its inference in Chapter 27. Now we
add more predictors to our equation.
We’ve learned that there are many similarities between simple and multiple
regression:

We fit the model by least squares.

The assumptions and conditions are essentially the same. For multiple
regression:
1. The relationship of y with each x must be straight (check the scatterplots).
2. The data values must be independent (think about how they were
collected).
3. The spread about the line must be the same across the x-axis for each
predictor variable (make a scatterplot or check the plot of residuals
against predicted values).
4. The errors must follow a Normal model (check a histogram or Normal
probability plot of the residuals).

R 2 still gives us the fraction of the total variation in y accounted for by the
model.

We perform inference on the coefficients by looking at the t-values,
created from the ratio of the coefficients to their standard errors.
But we’ve also learned that there are some profound differences in
interpretation when adding more predictors:

The coefficient of each x indicates the average change in y we’d expect to
see for a unit change in that x for particular values of all the other x-variables.

The coefficient of a predictor variable can change sign when another
variable is en- tered or dropped from the model.

Finding a suitable model from among the possibly hundreds of potential
models is not straightforward.

TE R M S
Multiple regression A linear regression with two or more predictors whose coefficients are found to
minimize the sum of the squared residuals is a least squares linear multiple
regression. But it is usually just called a multiple regression. When the
distinction is needed, a least squares linear regression with a single predictor is
called a simple regression. The multiple regression model is
y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
Least squares We still fit multiple regression models by choosing the coefficients that make
the sum of the squared residuals as small as possible. This is called the
method of least squares.
Partial regression The partial regression plot for a specified coefficient is a display that helps in
understanding the
plot meaning of that coefficient in a multiple regression. It has a slope equal to the
coefficient value and shows the influences of each case on that value. A partial
regression plot for a specified x displays the residuals when y is regressed on
the other predictors against the residuals when the specified x is regressed on
the other predictors.
29-20 Part VI I • Inferenc e When Variables Are Related

Assumptions for ● Linearity. Check that the scatterplots of y against each x are straight enough and that the

inference in scatterplot of residuals against predicted values has no obvious pattern. (If we find the
regression (and relationships straight enough, we may fit the regression model to find residuals for further
conditions to check checking.)
for some of them) ● Independent errors. Think about the nature of the data. Check a residual

plot. Any evident pattern in the residuals can call the assumption of
independence into question.

Constant variance. Check that the scatterplots show consistent spread across
the ranges of the x-variables and that the residual plot has constant variance
too. A common problem is increasing spread with increasing predicted values
—the plot thickens!

Normality of the residuals. Check a histogram or a Normal probability plot of the residuals.

ANOVA The Analysis of Variance table that is ordinarily part of the multiple regression
results offers an F-test to test the null hypothesis that the overall regression is
no improvement over just model- ing y with its mean:
H0 : b1 5 b2 5 c 5 bk 5 0.
If this null hypothesis is not rejected, then you should not proceed to test the
individual coefficients.

t-ratios for the The t -ratios for the coefficients can be used to test the null hypotheses that the true value of
coefficients each coefficient is zero against the alternative that it is not.

Scatterplot matrix A scatterplot matrix displays scatterplots for all pairs of a collection of variables,
arranged so that all the plots in a row have the same variable displayed on their
y-axis and all plots in a col- umn have the same variable on their x-axis. Usually,
the diagonal holds a display of a single variable such as a histogram or Normal
probability plot, and identifies the variable in its row and column.

Adjusted R2 An adjustment to the R 2 statistic that attempts to allow for the number of
predictors in the model. It is sometimes used when comparing regression
models with different numbers of predictors.

S K I L L S When you complete this lesson you should:


• Understand that the “true” regression model is an idealized summary of the data.
• Know how to examine scatterplots of y vs. each x for violations of
assumptions that would make inference for regression unwise or invalid.

• Know how to examine displays of the residuals from a multiple regression


to check that the conditions have been satisfied. In particular, know how to
judge linearity and con-
stant variance from a scatterplot of residuals against predicted values.
Know how to judge Normality from a histogram and Normal probability
plot.

• Remember to be especially careful to check for failures of the


independence assump- tion when working with data recorded over time.
Examine scatterplots of the residuals
against time and look for patterns.

• Be able to use a statistics package to perform the calculations and make


the displays for multiple regression, including a scatterplot matrix of the
variables, a scatterplot of
residuals vs. predicted values, and partial regression plots for each coefficient.

• Know how to use the ANOVA F-test to check that the overall regression
model is better than just using the mean of y.
Chapter 29 • Multiple Regression 29-21

• Know how to test the standard hypotheses that each regression coefficient
is really zero. Be able to state the null and alternative hypotheses. Know
where to find the rele-
vant numbers in standard computer regression output.

• Be able to summarize a regression in words. In particular, be able to state


the meaning of the regression coefficients, taking full account of the effects
of the other predictors in
the model.

• Be able to interpret the F-statistic for the overall regression.


• Be able to interpret the P-value of the t-statistics for the coefficients to test
the standard null hypotheses.

Regression Analysis on the Computer


All statistics packages make a table of results for a regression. If you can read a
package’s regres- sion output table for simple regression, then you can read its table
for a multiple regression. You’ll want to look at the ANOVA table, and you’ll see
information for each of the coefficients, not just for a single slope.
Most packages offer to plot residuals against predicted values. Some will also plot
residuals against the x’s. With some packages you must request plots of the residuals
when you request the regression. Others let you find the regression first and then analyze
the residuals afterward. Either way, your analysis is not complete if you don’t check the
residuals with a histogram or Normal probability plot and a scatterplot of the residuals
against the x’s or the predicted values.
One good way to check assumptions before embarking on a multiple regression analysis is
with a scat- terplot matrix. This is sometimes abbreviated SPLOM in commands.
Multiple regressions are always found with a computer or programmable calculator.
Before computers were available, a full multiple regression analysis could take months
or even years of work.

DATA DESK
• Select Y- and X-variable icons.
• From the Calc menu, choose Regression.
Comments
You can change the regression by dragging the icon of
• Data Desk displays the regression table.
another variable over either the Y- or an X-variable
• Select plots of residuals from the Regression
name in the table and dropping it there. You can add a
table’s HyperView menu.
predictor by dragging its icon into that part of the table.
The regression will recompute auto- matically.
29-22 Part VI I • Inferenc e When Variables Are Related

EXCEL
• From the Tools menu, select Data Analysis.
• Select Regression from the Analysis Tools list.
Comments
The Y and X ranges do not need to be in the same rows
• Click the OK button.
of the spreadsheet, although they must cover the same
• Enter the data range holding the Y-variable in the box
number of cells. But it is a good idea to arrange your
labeled “Y-range.”
data in parallel columns as in a data table. The X-
• Enter the range of cells holding the X-variables in
variables must be in adjacent columns. No cells in the
the box labeled “X-range.”
data range may hold non-numeric values.
• Select the New Worksheet Ply option.
Although the dialog offers a Normal probability plot of
• Select Residuals options. Click the OK button.
the residu- als, the data analysis add-in does not make a
correct probability plot, so don’t use this option.

JMP
• From the Analyze menu select Fit Model.
Comments
• Specify the response, Y. Assign the predictors, X, in
JMP chooses a regression analysis when the response
the Con- struct Model Effects dialog box.
variable is “Continuous.” The predictors can be any
• Click on Run Model.
combination of quanti- tative or categorical. If you get a
different analysis, check the variable types.

MINITAB
• Choose Regression from the Stat menu.
• Choose Regression. . . from the Regression submenu.
• In the Regression dialog, assign the Y-variable to
the Re- sponse box and assign the X-variables to the
Predictors box.
• Click the Graphs button.
• In the Regression-Graphs dialog, select Standardized
residu- als, and check Normal plot of residuals and
Residuals ver- sus fits.
• Click the OK button to return to the Regression dialog.
• To specify displays, click Graphs, and check the
displays you want.
• Click the OK button to return to the Regression dialog.
• Click the OK button to compute the regression.

SPSS
• Choose Regression from the Analyze menu.
• Choose Linear from the Regression submenu.
• When the Linear Regression dialog appears, select the
Y- variable and move it to the dependent target. Then
move the X- variables to the independent target.
• Click the Plots button.
• In the Linear Regression Plots dialog, choose to plot the
*SRESIDs against the *ZPRED values.
• Click the Continue button to return to the Linear
Regression dialog.
• Click the OK button to compute the regression.
Chapter 29 • Multiple Regression 29-23

TI-83/84 Plus

Comments
You need a special program to compute a multiple
regression on the TI-83.

TI-89

Under STAT Tests choose B:MultREg Tests


Comments
• Specify the number of predictor variables, and which
• The first portion of the output gives the F-statistic
lists con- tain the response variable and predictor
and its P-value as well as the values of R 2, AdjR 2, the
variables.
standard devia- tion of the residuals (s), and the
• Press e to perform the calculations.
Durbin-Watson statistic, which measures correlation
among the residuals.
• The rest of the main output gives the components of
the F-test, as well as values of the coefficients, their
standard errors, and associated t-statistics along with
P-values. You can use the right arrow to scroll through
these lists (if desired).
• The calculator creates several new lists that can be
used for assessing the model and its conditions:
Yhatlist, resid, sresid (standardized residuals),
leverage, and cookd, as well as lists of the coefficients,
standard errors, t’s, and P-values.

EXERCISES
1. Interpretations. A regression performed to One of the interpretations below is
predict selling price of houses found the correct. Which is it? Explain what’s wrong
equation with the others.
¿
price 5 169328 1 35.3 area 1 0.718 lotsize a) If they did no advertising, their
2 6543 age income would be
$250 million.
where price is in dollars, area is in square feet,
lotsize is in square feet, and age is in years. The
R2 is 92%. One of the interpretations below is
correct. Which is it? Explain what’s wrong with
the others.
a) Each year a house ages it is worth $6543
less.
b) Every extra square foot of area is associated
with an additional $35.30 in average price,
for houses with a given lotsize and age.
c) Every dollar in price means lotsize increases
0.718 square feet.
d) This model fits 92% of the data points exactly.
2. More interpretations. A household appliance
manu- facturer wants to analyze the relationship
between total sales and the company’s three
primary means of adver- tising (television,
magazines, and radio). All values were in
millions of dollars. They found the regression
equation
¿
sales 5 250 1 6.75 TV 1 3.5 radio 1 2.3
magazines.
b) Every million dollars spent on radio makes sales in-
crease $3.5 million, all other things being equal.
c) Every million dollars spent on magazines increases
TV spending $2.3 million.
d) Sales increase on average about $6.75 million for
each million spent on TV, after allowing for the
effects of the other kinds of advertising.
3. Predicting final exams. How well do exams given
during the semester predict performance on the final?
One class had three tests during the semester.
Computer output of the regression gives
Dependent variable is Final
s 513.46 R-Sq 5 77.7% R-Sq(adj) 5 74.1%

Predictor Coeff SE(Coeff) t P-value


Intercept 26.72 14.00 20.48 0.636
Test1 0.2560 0.2274 1.13 0.274
Test2 0.3912 0.2198 1.78 0.091
Test3 0.9015 0.2086 4.32 ,0.0001

Analysis of Variance
Source DF SS MS F P-value
Regression 3 11961.8 3987.3 22.02 ,0.0001
Error 19 3440.8 181.1
Total 22 15402.6
29-24 Part VI I • Inferenc e When Variables Are Related

a) Write the equation of the regression model. Analysis of Variance


b) How much of the variation in final exam
scores is ac- counted for by the regression Source DF SS MS P- F
model? value
Regression 2 99303550067 49651775033 11.06 0.004
c) Explain in context what the coefficient of
Test3 scores means. Residual 9 40416679100 4490742122
d) A student argues that clearly the first exam Total 11 1.39720E111
doesn’t help to predict final performance. She
suggests that this exam not be given at all. a) Write the regression equation.
Does Test 1 have no ef- fect on the final exam b) How much of the variation in home asking
score? Can you tell from this model? (Hint: prices is accounted for by the model?
Do you think test scores are related to each c) Explain in context what the coefficient of
other?) square footage means.
d) The owner of a construction firm, upon seeing
T 4. Scottish hill races. Hill running—races up and this model, objects because the model says
down hills—has a written history in Scotland that the num- ber of bathrooms has no effect
dating back to the year 1040. Races are held on the price of the home. He says that when
throughout the year at dif- ferent locations he adds another bathroom, it increases the
around Scotland. A recent compilation of value. Is it true that the number of bathrooms
information for 71 races (for which full is unrelated to house price? (Hint: Do you
information was available and omitting two think bigger houses have more bathrooms?)
unusual races) includes the distance (miles), the T 6. More hill races. Here is the regression for the
climb (ft), and the record time (sec- onds). A women’s records for the same Scottish hill races
regression to predict the men’s records as of we considered in Exercise 4:
2000 looks like this:
Dependent variable is: Women’s record
Dependent variable is: Men’s record R-squared 5 97.7% R-squared (adjusted) 5 97.6%
R-squared 5 98.0% R-squared (adjusted) 5 98.0% s 5 479.5 with 71 2 3 5 68 degrees of freedom
s 5 369.7 with 71 2 3 5 68 degrees of freedom
Sum of Mean
Sum Mean Source Squares df Square F-ratio
Source of df Squar F-ratio Regression 658112727 2 329056364 1431
Square e Residual 15634430 68 229918
s
Variable Coefficient SE(Coeff t-ratio P-value
Regression 458947098 2 229473549 1679
)
Residual 9293383 68 136667 Intercept 2554.015 101.7 25.45 ,0.0001
Distance 418.632 15.89 26.4 ,0.0001
Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 2521.995 78.39 26.66 ,0.0001 Climb 0.780568 0.0531 14.7 ,0.0001
Distance 351.879 12.25 28.7 ,0.0001 a) Compare the regression model for the
women’s records with that found for the
Climb 0.643396 0.0409 15.7 ,0.0001
men’s records in Ex- ercise 4.
a) Write the regression equation. Give a brief
report on what it says about men’s record Here’s a scatterplot of the residuals for this regression:
times in hill races.
b) Interpret the value of R2 in this regression. 1500
c) What does the coefficient of climb mean in
this re- gression? 750
Residuals

5. Home prices. Many variables have an impact


on deter- mining the price of a house. A few of
(min)

0
these are size of the house (square feet), lot size,
and number of bathrooms. Information for a
random sample of homes for sale in the –750
Statesboro, GA, area was obtained from the
Internet. Regression output modeling the asking
price with square footage and number of
bathrooms gave the following result:
Dependent Variable is: Price
s 5 67013 R-Sq 5 71.1% R-Sq (adj) 5 64.6%
3000 6000
Predictor Coeff SE(Coeff) T P-value 900 12000
0
Intercept 2152037 85619 21.78 0.110
Predicted (min)
Baths 9530 40826 0.23 0.821
Sq ft 139.87 46.67 3.00 0.015 b) Discuss the residuals and what they say
about the assumptions and conditions for this
regression.
Chapter 29 • Multiple Regression 29-25

7. Predicting finals II. Here are some diagnostic


8. Secretary performance. The AFL-CIO has
plots for the final exam data from Exercise 3.
undertaken a study of 30 secretaries’ yearly
These were gener- ated by a computer package
salaries (in thousands of dollars). The
and may look different from the plots generated
organization wants to predict salaries from
by the packages you use. (In particu- lar, note
several other variables.
that the axes of the Normal probability plot are
swapped relative to the plots we’ve made in the The variables considered to be potential
text. We only care about the pattern of this plot, predictors of salary are:
so it shouldn’t af- fect your interpretation.) X1 5 months of
Examine these plots and dis- cuss whether the service X2 5 years
assumptions and conditions for the multiple of education
regression seem reasonable.
X3 5 score on standardized test
Residuals vs. the Fitted X4 5 words per minute (wpm) typing speed
Values (Response is X5 5 ability to take dictation in words per minute
Final) A multiple regression model with all five variables was
run on a computer package, resulting in the following
20 output:

10 Variable Coefficient Std. Error t-value


Constant 9.788 0.377 25.960
X1 0.110 0.019 5.178
Residual

0
(points)

X2 0.053 0.038 1.369


–10 X3 0.071 0.064 1.119
X4 0.004 0.307 0.013
–20 X5 0.065 0.038 1.734

50 60 70 80 90 100 110 120 130 140


150 s 5 0.430 R2 5 0.863
Fitted Value Assume that the residual plots show no
violations of the conditions for using a linear
regression model.
Normal Probability Plot of the Residuals a) What is the regression equation?
(Response is Final) b) From this model, what is the predicted salary
(in thou- sands of dollars) of a secretary with
2 10 years (120 months) of experience, 9th
grade education (9 years of education), a 50
on the standardized test, 60 wpm typ- ing
1 speed, and the ability to take 30 wpm
dictation?
0 c) Test whether the coefficient for words per
Normal

minute of typing speed (X4) is significantly


Score

different from zero at a 5 0.05.


–1 d) How might this model be improved?
e) A correlation of age with salary finds r 5
–2
0.682, and the scatterplot shows a
moderately strong positive linear
–20 –10 0 10 20 association. However, if X6 5 age is added
Residual (points) to the multiple regression, the estimated
coefficient of age turns out to be b6 5
Histogram of the Residuals 20.154. Explain some possi- ble causes for
(Response is Final) this apparent change of direction in the
relationship between age and salary.
4
9. Home prices II. Here are some diagnostic
3 plots for the home prices data from Exercise 5.
Frequenc

These were generated by a computer package


2
and may look different from the plots generated
by the packages you use. (In particular, note
y

1
that the axes of the Normal probability plot are
0 swapped relative to the plots we’ve made in the
–20 –15 –10 –5 0 5 10 15 20 text. We only care about the pattern of this plot,
Residuals (points) so it shouldn’t af- fect your interpretation.)
Examine these plots and dis- cuss whether the
assumptions and conditions for the multiple
regression seem reasonable.
29-26 Part VI I • Inferenc e When Variables Are Related

Residuals vs. the Fitted a) What is the regression equation?


Values (Response is b) From this model, what is the predicted GPA
Price) of a stu- dent with an SAT Verbal score of 500
150000
and an SAT Math score of 550?
c) What else would you want to know about this
100000
re- gression before writing a report about the
relation- ship between SAT scores and grade
50000
Residual

point averages? Why would these be


important to know?
0
($)

T 11. Body fat revisited. The data set on body fat


–50000 contains 15 body measurements on 250 men
from 22 to 81 years old. Is average %body fat
–100000 related to weight? Here’s a scatterplot:

100000 200000 300000


400000
Fitted Value

Normal Probability Plot of the


Residuals (Response is 40
Price)
30
2

% Body
20

Fat
1
10
Normal

0
Score

–1 120 160 200 240


Weight (lb)
–2
–100000 –50000 0 50000 100000 150000 And here’s the simple regression:
Residual ($)
Dependent variable is: Pct BF
R-squared 5 38.1% R-squared (adjusted) 5 37.9%
Histogram of the Residuals s 5 6.538 with 250 2 2 5 248 degrees of freedom
(Response is Price) Variable Coefficient SE(Coeff) t-ratio P-value
5 Intercept 214.6931 2.760 25.32 ,0.0001
4 Weight 0.18937 0.0153 12.4 ,0.0001
Frequenc

3 a) Is the coefficient of %body fat on weight


2 statistically distinguishable from 0? (Perform
a hypothesis test.)
y

1 b) What does the slope coefficient mean in this


0 regres- sion?
–50000 0 50000 100000 150000 We saw before that the slopes of both waist size
Residual ($) and height are statistically significant when
entered into a multiple regression equation.
What happens if we add weight to that
10. GPA and SATs. A large section of Stat 101 was regression? Recall that we’ve already checked
asked to fill out a survey on grade point average the assumptions and conditions for regression
and SAT scores. A regression was run to find out on waist size and height in the chapter. Here is
how well Math and Ver- bal SAT scores could the output from a regression on all three
predict academic performance as measured by variables:
GPA. The regression was run on a com- puter
package with the following output: Dependent variable is: Pct BF
Response: GPA R-squared 5 72.5% R-squared (adjusted) 5 72.2%
s 5 4.376 with 250 2 4 5 246 degrees of freedom
Coefficient Std Error t-ratio Prob . u t
u Sum of Mean
Constant 0.574968 0.253874 2.26 0.0249 Source Squares df Square F-ratio
SAT Verbal 0.001394 0.000519 2.69 0.0080 Regression 12418.7 3 4139.57 216
SAT Math 0.001978 0.000526 3.76 0.0002 Residual 4710.11 246 19.1468
Chapter 29 • Multiple Regression 29-27

Variable Coefficient SE(Coeff) t-ratio P-value A regression of %body fat on chest size gives the
Intercept 231.4830 11.54 22.73 0.0068 following equation:
Waist 2.31848 0.1820 12.7 ,0.0001 Dependent variable is: Pct BF
Height 20.224932 0.1583 21.42 0.1567 R-squared 5 49.1% R-squared (adjusted) 5 48.9%
Weight 20.100572 0.0310 23.25 0.0013 s 5 5.930 with 250 2 2 5 248 degrees of freedom
c) Interpret the slope for weight. How can the
Variable Coefficient SE(Coeff) t-ratio P-value
coefficient for weight in this model be
negative when its coeffi- cient was positive in Intercept 252.7122 4.654 211.3 ,0.0001
the simple regression model? Chest 0.712720 0.0461 15.5 ,0.0001
d) What does the P-value for height mean in this a) Is the slope of %body fat on chest size
regres- sion? (Perform the hypothesis test.) statistically dis- tinguishable from 0?
T 12. Breakfast cereals. We saw in Chapter 8 that (Perform a hypothesis test.)
the calorie content of a breakfast cereal is b) What does the answer in part a mean about
linearly associated with its sugar content. Is that the rela- tionship between %body fat and
the whole story? Here’s the out- put of a chest size?
regression model that regresses calories for We saw before that the slopes of both waist size
each serving on its protein(g), fat(g), fiber(g), and height are statistically significant when
carbohydrate(g), and sugars(g) content. entered into a multiple regression equation.
Dependent variable is: calories What happens if we add chest size to that
R-squared 5 84.5% R-squared (adjusted) 5 83.4% regression? Here is the output from a re-
s 5 7.947 with 77 2 6 5 71 degrees of freedom gression on all three variables:
Sum of Mean Dependent variable is: Pct BF
Source Squares df Square F-ratio R-squared 5 72.2% R-squared (adjusted) 5 71.9%
Regression 24367.5 5 4873.50 77.2 s 5 4.399 with 250 2 4 5 246 degrees of freedom
Residual 4484.45 71 63.1613
Sum of Mean
Variable Coefficient SE(Coeff) t-ratio P-value Source Squares df Square F-ratio P
Intercept 20.2454 5.984 3.38 0.0012 Regression 12368.9 3 4122.98 213 ,0.0001
Protein 5.69540 1.072 5.32 ,0.0001 Residual 4759.87 246 19.3491

Fat 8.35958 1.033 8.09 ,0.0001 Variable Coefficient SE(Coeff) t-ratio P-value
Fiber 21.02018 0.4835 22.11 0.0384 Intercept 2.07220 7.802 0.266 0.7908
Carbo 2.93570 0.2601 11.3 ,0.0001 Waist 2.19939 0.1675 13.1 ,0.0001
Sugars 3.31849 0.2501 13.3 ,0.0001 Height 20.561058 0.1094 25.13 ,0.0001
Assuming that the conditions for multiple
Chest 20.233531 0.0832 22.81 0.0054
regression are met,
a) What is the regression equation? c) Interpret the coefficient for chest.
b) Do you think this model would do a d) Would you consider removing any of the variables
reasonably good job at predicting calories? from this regression model? Why or why not?
Explain.
c) To check the conditions, what plots of the T 14. Grades. The table below shows the five scores
data might you want to examine? from an introductory Statistics course. Find a
model for predicting final exam score by trying
d) What does the coefficient of fat mean in this all possible models with two predictor variables.
model? Which model would you choose? Be sure to
T 13. Body fat again. Chest size might be a good check the conditions for multiple regression.
predictor of body fat. Here’s a scatterplot of
%body fat vs. chest size. Midterm Midterm Home
Name Final 1 2 Project work
Timothy F. 117 82 30 10.5 61
40
Karen E. 183 96 68 11.3 72
Verena Z. 124 57 82 11.3 69
30
Jonathan A. 177 89 92 10.5 84
%Body

Elizabeth L. 169 88 86 10.6 84


20
Patrick M. 164 93 81 10.0 71
Julia E. 134 90 83 11.3 79
10
Thomas A. 98 83 21 11.2 51
Marshall K. 136 59 62 9.1 58
0 Justin E. 183 89 57 10.7 79
87.5 100.0 112.5 125.0
Chest .) cntinued
(in o
29-28 Part VI I • Inferenc e When Variables Are Related

T 15. Fifty states. Here is a data set on various measures of the


50 Name
United States. TheMidter Midter
murder rate is perProject
100,000,HomeHS graduation rate is in %, income is per capita income in
Final
dol- lars, illiteracy rate is m 1 1000,mand
per 2 life expectancy
work is in years. Find a regression model for life expectancy
with three pre-
Alexandra E. dictor
171 variables
83 by trying
86 all11.5
four of 78
the possible models.
Christopher 173 95 75 8.0 77 a) Which model appears to do the best?
B. b) Would you leave all three predictors in this model?
Justin C. 164 81 66 10.7 66 c) Does this model mean that by changing the
Miguel A. 150 86 63 8.0 74 levels of the predictors in this equation, we
Brian J. 153 81 86 9.2 76
could affect life expectancy in that state?
Explain.
Gregory J. 149 81 87 9.2 75
d) Be sure to check the conditions for multiple
Kristina G. 178 98 96 9.3 84 regres- sion. What do you conclude?
Timothy B. 75 50 27 10.0 20
Jason C. 159 91 83 10.6 71
HS Life
Whitney E. 157 87 89 10.5 85
Irena R. State name Murder grad Income Illiteracy
Alexis P. 165
158 93
90 81
91 9.3
11.3 82
68
Yvon T. exp
Nicholas 168 88 66 10.5 82
Sara M. T. 171 95 82 10.5 68
Amandeep S. 186
173 99
91 90
37 7.5
10.6 77
54 Alabama 15.1 41.3 3624 2.1 69.05
Annie P.
157 89 92 10.3 68 Alaska 11.3 66.7 6315 1.5 69.31
Benjamin
177 87 62 10.0 72 Arizona 7.8 58.1 4530 1.8 70.55
S. David
170 92 66 11.5 78 Arkansas 10.1 39.9 3378 1.9 70.66
W. Josef
78 62 43 9.1 56 California 10.3 62.6 5114 1.1 71.71
H.
191 93 87 11.2 80 Colorado 6.8 63.9 4884 0.7 72.06
Rebecca
169 95 93 9.1 87 Connecticut 3.1 56.0 5348 1.1 72.48
S. Joshua
D. Ian M. 170 93 65 9.5 66 Delaware 6.2 54.6 4809 0.9 70.06
Katharine 172 92 98 10.0 77 Florida 10.7 52.6 4815 1.3 70.66
A. Emily R. 168 91 95 10.7 83 Georgia 13.9 40.6 4091 2.0 68.54
Brian M. 179 92 80 11.5 82 Hawaii 6.2 61.9 4963 1.9 73.60
Shad M. 148 61 58 10.5 65 Idaho 5.3 59.5 4119 0.6 71.87
Michael R. 103 55 65 10.3 51 Illinois 10.3 52.6 5107 0.9 70.14
Israel M. 144 76 88 9.2 67 Indiana 7.1 52.9 4458 0.7 70.88
Iris J. 155 63 62 7.5 67 Iowa 2.3 59.0 4628 0.5 72.56
Mark 141 89 66 8.0 72 Kansas 4.5 59.9 4669 0.6 72.58
G. 138 91 42 11.5 66 Kentucky 10.6 38.5 3712 1.6 70.10
Peter 180 90 85 11.2 78 Louisiana 13.2 42.2 3545 2.8 68.76
H. 120 75 62 9.1 72 Maine 2.7 54.7 3694 0.7 70.39
Catherine 86 75 46 10.3 72 Maryland 8.5 52.3 5299 0.9 70.22
R.M. 151 91 65 9.3 77 Massachusetts 3.3 58.5 4755 1.1 71.83
Christina M. 149 84 70 8.0 70 Michigan 11.1 52.8 4751 0.9 70.63
Enrique J. 163 94 92 10.5 81 Minnesota 2.3 57.6 4675 0.6 72.96
Sarah K. 153 93 78 10.3 72 Mississippi 12.5 41.0 3098 2.4 68.09
Thomas J. 172 91 58 10.5 66 Missouri 9.3 48.8 4254 0.8 70.69
Sonya P. 165 91 61 10.5 79 Montana 5.0 59.2 4347 0.6 70.56
Michael B. 155 89 86 9.1 62 Nebraska 2.9 59.3 4508 0.6 72.60
Wesley M. 181 98 92 11.2 83 Nevada 11.5 65.2 5149 0.5 69.03
Mark R. 172 96 51 9.1 83 New Hampshire 3.3 57.6 4281 0.7 71.23
Adam J. 177 95 95 10.0 87 New Jersey 5.2 52.5 5237 1.1 70.93
Jared A. 189 98 89 7.5 77 New Mexico 9.7 55.2 3601 2.2 70.32
Michael 161 89 79 9.5 44 New York 10.9 52.7 4903 1.4 70.55
T. 146 93 89 10.7 73 North Carolina 11.1 38.5 3875 1.8 69.21
Kathryn 147 74 64 9.1 72 North Dakota 1.4 50.3 5087 0.8 72.78
D. Nicole 160 97 96 9.1 80 Ohio 7.4 53.2 4561 0.8 70.82
M. 159 94 90 10.6 88 Oklahoma 6.4 51.6 3983 1.1 71.42
Wayne E. 101 81 89 9.5 62 Oregon 4.2 60.0 4660 0.6 72.13
Elizabeth 154 94 85 10.5 76
S. John 183 92 90 9.5 86
R.
Valentin
A. David
T. O.
Marc I.
Samuel E.
Brooke S.
Chapter 29 • Multiple Regression 29-29

T 17. Burger King revisited. Recall the Burger King


HS Life menu data from Chapter 8. BK’s nutrition sheet
State Murder grad Income Illiteracy lists many variables. Here’s a multiple
name exp regression to predict calories for Burger King
foods from protein content (g), total fat (g),
Pennsylvani 6.1 50.2 4449 1.0 carbohydrate (g), and sodium (mg) per serving:
a Rhode
South Carolina 11.6 70.43
37.8 3635 2.3 67.96
South Dakota 1.7 53.3 4167 0.5 72.08 Dependent variable is: Calories
Tennessee 11.0 41.8 3821 1.7 70.11 R-squared 5 100.0% R-squared (adjusted) 5 100.0%
Texas 12.2 47.4 4188 2.2 70.90 s 5 3.140 with 31 2 5 5 26 degrees of freedom

Utah 4. 67.3 0.6 Sum of Mean


Vermont 5 Source Squares df Square F-ratio
Virginia 5. 4022 72.90 Regression 1419311 4 354828 35994
Washington 5 57.1 0.6 Residual 256.307 26 9.85796
West 9. Variable Coefficient SE(Coeff) t-ratio P-value
Virginia 5 3907 71.64
Wisconsin Intercept 6.53412 2.425 2.69 0.0122
4. 47.8 1.4
Wyoming 3 Protein 3.83855 0.0859 44.7 ,0.0001
Total fat 9.14121 0.0779 117 ,0.0001
Carbs 3.94033 0.0336 117 ,0.0001
T 16. Breakfast cereals again. We saw in Chapter 8 Na/S 20.69155 0.2970 22.33 0.0279
that the calorie count of a breakfast cereal is a) Do you think this model would do a good job
linearly associated with its sugar content. Can of pre- dicting calories for a new BK menu
we predict the calories of a serving from its item? Why or why not?
vitamins and mineral content? Here’s a multiple b) The mean of calories is 455.5 with a standard
regression model of calories per serving on its deviation of 217.5. Discuss what the value of
sodium (mg), potassium (mg), and sugars (g): s in the regression means about how well the
model fits the data.
Dependent variable is: Calories c) Does the R2 value of 100.0% mean that the
R-squared 5 38.9% R-squared (adjusted) 5 36.4% residuals are all actually equal to zero?
s 5 15.74 with 75 2 4 5 71 degrees of freedom

Sum of Mean
Source Squares df Square F-ratio P-value
Regression 11211.1 3 3737.05 15.1 ,0.0001
Residual 17583.5 71 247.655

Variable Coefficie SE(Coeff) t-ratio P-value


nt
Intercept 81.9436 5.456 15.0 ,0.0001
Sodium 0.05922 0.0218 2.72 0.0082
Potassium 20.01684 0.0260 20.648 0.5193
Sugars 2.44750 0.4164 5.88 ,0.0001
Assuming that the conditions for multiple
regression are met,
a) What is the regression equation?
b) Do you think this model would do a
reasonably good job at predicting calories?
Explain.
c) Would you consider removing any of these
predictor variables from the model? Why or
why not?
d) To check the conditions, what plots of the
data might you want to examine?
Finding the Optimal Classification Cut-Off: A Detailed Exploration
Classification tasks, common in machine learning and statistics, involve predicting discrete
labels for data points based on input features. The performance of a classification model is
typically evaluated by comparing its predicted labels with actual labels from the data. A
critical aspect of evaluating a classifier is determining the threshold or cut-off value for
classifying predictions. In binary classification, this cut-off determines the point at which a
probability score is converted into a final class label. The process of selecting an optimal cut-
off is crucial for achieving the desired balance between false positives and false negatives,
and, ultimately, the overall effectiveness of the classifier.
Understanding the Cut-Off Point
In binary classification, models generally output a probability score indicating the likelihood
of an instance belonging to a positive class (often labelled as 1). The cut-off (threshold) is the
value at which this probability score is mapped to a final class label. For example, if the
model predicts a probability score above a certain threshold (say, 0.5), the instance is
classified as positive (1); otherwise, it is classified as negative (0).
The default cut-off is typically set to 0.5, meaning that if the model’s predicted probability
for the positive class is greater than 50%, the instance is classified as positive, and otherwise,
it is classified as negative. However, this default threshold is not always optimal, and
adjusting the cut-off can have significant impacts on the model’s performance.

Performance Metrics Involving Cut-Offs


To assess the performance of a classifier at various thresholds, a variety of performance
metrics can be considered, such as:
 Accuracy: The proportion of correct predictions (both true positives and true
negatives) out of the total number of predictions. While accuracy can provide a high-
level view of model performance, it may not be the best metric for imbalanced
datasets.
 Precision: The proportion of true positive predictions among all instances predicted
as positive. Precision is crucial when the cost of false positives is high.
 Recall (Sensitivity): The proportion of true positive predictions among all actual
positives. Recall is important when false negatives need to be minimized.
 F1 Score: The harmonic mean of precision and recall. The F1 score is a balanced
metric that accounts for both false positives and false negatives.
 ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the
true positive rate (recall) against the false positive rate. The Area Under the Curve
(AUC) provides a single metric that evaluates the classifier's ability to distinguish
between the positive and negative classes. The ROC curve is threshold-agnostic but
can still provide valuable insight into how different cut-off values impact
performance.
 Precision-Recall Curve: Similar to the ROC curve, but it focuses on the precision
and recall trade-offs across different thresholds. This is particularly useful for
imbalanced classes.

Methods for Determining the Optimal Cut-Off


There are several strategies for selecting the optimal classification cut-off, each with its
advantages and drawbacks. Below are some common approaches:

a. Maximizing the F1 Score


One approach is to adjust the threshold to maximize the F1 score, which balances precision
and recall. The F1 score is especially useful when there is an uneven class distribution or
when false positives and false negatives have different costs. By systematically testing
different cut-offs and selecting the one that maximizes the F1 score, you can achieve a
balance between precision and recall.
b. Maximizing Accuracy
Sometimes, maximizing overall accuracy may be the goal, particularly when the class
distribution is relatively balanced. This involves testing various cut-off values and selecting
the threshold that results in the highest overall accuracy.
c. Maximizing the Youden's J Statistic
Youden's J statistic is a metric that combines sensitivity and specificity into a single number.
It is defined as:

J = Sensitivity+Specificity−1
Maximizing this statistic involves finding the threshold where the sum of sensitivity and
specificity is maximized, which balances the trade-off between false positives and false
negatives.

d. Cost-Based Optimization
In real-world applications, the costs of false positives and false negatives are rarely equal. For
example, in medical diagnosis, a false negative may have far more serious consequences than
a false positive. To account for this, a cost-sensitive approach can be used, where the
threshold is chosen to minimize the expected cost of misclassifications. This can be done by
assigning different weights to false positives and false negatives, and adjusting the threshold
to minimize the weighted sum of these errors.
e. ROC Curve and AUC
By analysing the ROC curve, you can visually identify the threshold that best balances true
positives and false positives. A common approach is to choose the threshold at which the sum
of sensitivity and specificity is maximized, or where the point on the ROC curve is closest to
the top-left corner, representing the ideal trade-off between true positives and false positives.

f. Precision-Recall Trade-Off
For imbalanced datasets, where one class is much more frequent than the other, precision-
recall curves often provide more meaningful insights than ROC curves. By adjusting the cut-
off to achieve the desired balance of precision and recall, one can optimize model
performance for rare events, such as fraud detection or disease diagnosis.
Evaluating the Impact of Cut-Off Selection
The selection of the optimal classification cut-off must be carefully evaluated, as it directly
affects the model’s performance, and the costs associated with misclassifications. A poor
choice of cut-off can lead to overfitting or underfitting, depending on whether the threshold is
too stringent or too lenient.
In practice, the following factors should be considered when evaluating the impact of cut-off
selection:
 Class Distribution: If the dataset is imbalanced, focusing on the minority class with a
higher recall may be more important than optimizing for accuracy.
 Application Requirements: Different applications may have different priorities. For
example, in fraud detection, catching as many fraudulent transactions as possible
(high recall) may be more important than minimizing false positives.
 Evaluation Metrics: Choose evaluation metrics that align with the problem's
objectives. The importance of precision, recall, or F1 score should be considered in
the context of the problem's real-world costs.
 Cross-Validation: Always test the cut-off using cross-validation to ensure that the
chosen threshold generalizes well to unseen data.

The gain chart and lift chart are two measures that are used for Measuring the benefits of
using the model and are used in business contexts such as target marketing. It’s not just
restricted to marketing analysis. It can also be used in other domains such as risk modelling,
supply chain analytics, etc. In other words, Gain and Lift charts are two approaches used
while solving classification problems with imbalanced data sets.
Example: In target marketing or marketing campaigns, the customer responses to campaigns
are usually very low (in many cases the customers who respond to marketing campaigns are
less than 1%). The organization will raise the cost for each customer contact and hence would
like to minimize the cost of the marketing campaign and at the same time achieve desired
response level from the customers.
The gain chart and lift chart is the measures in logistic regression that will help organizations
to understand the benefits of using that model. So that better and more efficient output carry
out.

The gain and lift chart are obtained using the following steps:
1. Predict the probability Y = 1 (positive) using the LR model and arrange the
observation in the decreasing order of predicted probability [i.e., P (Y = 1)].
2. Divide the data sets into deciles. Calculate the number of positives (Y = 1) in each
decile and the cumulative number of positives up to a decile.
3. Gain is the ratio between the cumulative number of positive observations up to a
decile to the total number of positive observations in the data. The gain chart is a chart
drawn between the gain on the vertical axis and the decile on the horizontal axis.
4. Lift is the ratio of the number of positive observations up to decile i using the model
to the expected number of positives up to that decile i based on a random model. Lift
chart is the chart between the lift on the vertical axis and the corresponding decile on
the horizontal axis.

Gain Chart Calculation:

Ratio between the cumulative number of positive responses up to a decile to the total
number of positive responses in the data

Gain Chart:
Lift Chart Calculation:

Ratio of the number of positive responses up to decile i using the model to the expected
number of positives up to that decile i based on a random model
Lift Chart:

 Cumulative gains and lift charts are visual aids for measuring model performance.
 Both charts consist of Lift Curve (In Lift Chart) / Gain Chart (In Gain Chart) and
Baseline (Blue Line for Lift, Orange Line for Gain).
 The Greater the area between the Lift / Gain and Baseline, the Better the model.

L1 and L2 Regularization in Regression:


Regularization techniques are essential tools in machine learning and regression analysis,
aiming to prevent overfitting and improve model generalization. Two of the most widely
used regularization methods are L1 regularization (also known as Lasso) and L2
regularization (also known as Ridge). These techniques involve adding a penalty term to the
regression model’s cost function, encouraging simpler models by penalizing the size of the
model's coefficients.
Understanding Regularization in Regression
In regression analysis, the objective is to fit a model that best predicts the target variable. A
simple linear regression model can be expressed as:
Y = β0 + β1x1+ β2x2 +... + βpxp +ϵ
Where:
 y is the dependent variable (target),
 x1, x2,...,xp are the independent variables (features),
 β1, β2,..., βp are the coefficients,
 epsilon ϵ is the error term.
In ordinary least squares (OLS) regression, the model minimizes the sum of squared errors
(SSE) or residual sum of squares:
n

SSE=∑ ( yi− ^y i ¿) 2¿
i=1

However, this approach may lead to overfitting, where the model becomes too complex and
captures noise in the data, leading to poor generalization on unseen data. Regularization aims
to mitigate this issue by introducing a penalty term that discourages overly complex models.

L1 Regularization (Lasso)
L1 regularization introduces a penalty term proportional to the sum of the absolute values of
the model coefficients.
The effect of the L1 penalty is that it drives some coefficients to exactly zero. This is because
the L1 penalty causes the optimization process to shrink some coefficients entirely,
effectively excluding them from the model. This characteristic makes L1 regularization
useful for feature selection, as it automatically selects a subset of relevant features by
eliminating irrelevant ones.

Key Features of L1 Regularization:


 Sparsity: The L1 penalty leads to sparse models, where many coefficients become
zero. This is beneficial when dealing with high-dimensional datasets with many
irrelevant features.
 Feature Selection: Because L1 regularization can set coefficients exactly to zero, it
can be interpreted as performing automatic feature selection.
 Interpretability: The sparsity introduced by L1 regularization can make models
easier to interpret, especially when the number of features is large.

Disadvantages of L1 Regularization:
 Instability: Lasso (L1 regularization) can be unstable if the number of observations is
much smaller than the number of features, leading to high variance in the model
coefficients.
 Non-differentiability: The L1 penalty function is not differentiable at zero, which
can complicate the optimization process, though this issue is usually addressed by
using optimization algorithms such as coordinate descent.

L2 Regularization (Ridge)
L2 regularization, on the other hand, penalizes the sum of the squared values of the model
coefficients.
The L2 penalty encourages the model to keep the coefficients small, but unlike L1
regularization, it does not set coefficients to zero. Instead, it shrinks the coefficients towards
zero but keeps them non-zero, which typically results in a model where all features are
retained but their influence is reduced.
Key Features of L2 Regularization:
 Shrinkage: L2 regularization reduces the magnitude of coefficients without
eliminating them entirely. This leads to a more stable model, especially when the
features are highly correlated.
 Stability: L2 regularization is less likely to lead to extreme values in the model
coefficients, which contributes to its stability, particularly in the presence of
multicollinearity (correlation between features).
 No Feature Selection: Unlike L1 regularization, L2 regularization does not set
coefficients to zero, meaning that all features are retained in the model, albeit with
reduced influence.
Disadvantages of L2 Regularization:
 No Sparsity: L2 regularization does not perform feature selection, so it may not be as
effective in high-dimensional settings where some features are irrelevant.
 Limited Interpretability: Since all features are retained, L2 regularization can lead
to more complex models that are harder to interpret compared to L1 regularized
models.
Choosing Between L1 and L2 Regularization
The choice between L1 and L2 regularization depends on the specific goals and
characteristics of the dataset:

 When to Use L1 Regularization (Lasso):


o When performing feature selection is important, especially in high-
dimensional datasets.
o When the dataset contains many irrelevant features, and the goal is to identify
the most influential variables.
o In situations where sparsity in the model (i.e., setting many coefficients to
zero) is desired.

 When to Use L2 Regularization (Ridge):


o When the dataset contains multicollinearity, or when there is a high
correlation among predictors. L2 regularization is better at handling correlated
features than L1.
o When stability is more important than feature selection, as L2 regularization shrinks
coefficients but does not eliminate them.
o When the number of features is relatively smaller than the number of observations, as
Ridge can handle a larger number of predictors without introducing instability.

184 Business Analytics for Decision Making | Institute of Public Enterprise (IPE)

You might also like