100% found this document useful (1 vote)
74 views146 pages

Ilovepdf Merged

Uploaded by

Nandan Annamraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
74 views146 pages

Ilovepdf Merged

Uploaded by

Nandan Annamraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

BUSINESS

ANALYTICS
COMPENDIUM

2024

Created By
P. Madhav Charan
Tanmay Malhotra
Sarthak Singh
Prep Team
TABLE OF CONTENTS
1 BUSINESS ANALYSIS
IWhat is Business Analysis? 1
Scope and Roles & Responsibilities of a Business Analyst 3
Essential Skills for Business Analysts 4
The Business Analysis Process 9
Business Analysis Techniques and Tools (by Phase): 9
Business Analysis Lifecycle Models:
Identifying Opportunities for Process Improvement:
2 Requirements Gathering and Management 12
Real-World Applications of Business Analysis 14
Marketing
Case Study 1: Spotify Uses Data Analytics to Dominate Music
Streaming
3 Case Study 2: Target Discovers Teen Pregnancy with Analytics 18
Case Study 3: How Netflix Used Business Analytics for 19
understanding their users 22
Sales 24
Case Study 1: Amazon Recommends Products Based on Analytics 25
Case Study 2: Netflix Optimizes Content Delivery with Business 25
Analytics
Operations
4 Case Study 1: Walmart Optimizes Inventory Management with 25
Analytics: A Deeper Dive 27
Case Study 2: Amazon Fine-Tunes Delivery Operations with 30
Machine Learning: Efficiency at Scale 35
Finance
TABLE OF CONTENTS
1 BUSINESS ANALYSIS
Case Study 1: JPMorgan Chase Uses Analytics for Fraud 1
Detection 3
Case Study 2: Wells Fargo Leverages Business Analysis to 4
Improve Loan Default Prediction 9
Human Resources (HR) 9
Case Study 1: Google Analyzes Employee Data to Reduce
Turnover
2 Case Study 2: Walmart Optimizes Scheduling with Workforce 12
Analytics 14
Business Analysis Tools and Technologies
Building a Business Analytics Mindset
Career Path for Business Analysts
3 The Future of Business Analysis 18
19
22

2 DATA ANALYSIS 24
25
What is Data Analytics?
25
Data Analysis Project Lifecycle
Data Collection

4 Types of Data
25
Data Collection method
27
Data Cleaning and Wrangling
30
Measures of Central Tendency
35
AUC Curve
TABLE OF CONTENTS
3 DATA VISUALIZATION
Introduction 1
Memory in Data Visualization 3
What is Data Visualization? 4
Why Visualize Data? 9
Data Visualization Process 9
Choosing the right chart
Data visualization Principles
2 "The Grammar of Graphics" by Leland Wilkinson 12
The Grammar of Graphics by Leland Wilkinson 14
Preattentive Attributes
Time to Insight
Data-Ink Ratio
3 The larger the share of a graphic’s ink devoted to data, the 18
better — Edward Tufte 19
Data Visualization Glossary: Key Terms 22

4 ONLINE SOURCES AND IMPORTANT INTERVIEW QUESTIONS

Online Sources for further exploration 25


Important Interview Questions 27

4
PART 1

BUSINESS
ANALYSIS
What is Business Analysis?
Business analysis is a discipline focused on understanding an organization's business
needs and recommending solutions to improve efficiency, effectiveness, and overall
success. It involves a systematic approach to analyzing problems, identifying
opportunities, and designing solutions that bridge the gap between business needs and
technology.
Scope and Roles & Responsibilities of a Business Analyst
Business analysts are the bridge between the business world and the technical world.
While roles and responsibilities may vary depending on industry, company size, and
project methodology, at heart, they are pretty similar. Here is a breakdown of the core
functions of this position and what value a BA brings to the table:

Understanding the Business:

Needs assessment: The BA identifies and analyzes business needs, challenges, and
goals sourcing through various means such as interviews, workshops, and data
analysis. What they do is answer the "why" questions to get to the bottom of
problems to find solutions that cater to the actual needs.

Process Mapping: Proper documentation and analysis of the existing business


process is essential. BAs create visual representations of workflows that highlight
sometimes glaring inefficiencies or opportunities for process improvement, thus
streamlining operations by avoiding useless steps.

Stakeholder Management: Successful projects require collaboration. BAs work with


stakeholders across departments (e.g., sales, marketing, IT) to understand their
needs, concerns, and ensure everyone is aligned with the project goals.

Requirements Gathering and Management:

Elicitation: BAs skillfully elicit requirements from stakeholders using various


techniques. This might involve conducting interviews, surveys, or facilitating
workshops to capture user stories that detail desired functionalities

6 Business Analytics Compendium 2024-25


Documentation: Clear and concise documentation is essential. BAs document both
functional and non-functional requirements, ensuring they are complete,
consistent, and traceable. This avoids confusion and ensures everyone is on the
same page.

Prioritization and Validation: Admittedly, not all requirements are equal. The BAs
work with the stakeholders to prioritize the criteria based on importance,
feasibility, and their ability to meet business objectives. They further ensure that
such needs are achievable and meet up to business needs

Solution Design and Development:

Feasibility Analysis: BAs don't just identify problems; they also help find solutions.
They evaluate potential solutions considering factors like cost, time, technical
feasibility, and potential risks. This ensures chosen solutions are practical and
deliver value.

Solution Design: Translating requirements into actionable plans is key. BAs


collaborate with the multiple teams to design solutions that meet the documented
needs. This may involve creating user interface (UI) mockups, system flowcharts,
or detailed specifications.

Communication: The ability to communicate effectively is paramount. BAs act as


interpreters, translating complex technical concepts into clear, understandable
language for non-technical stakeholders. They also explain business needs and
goals to technical teams, ensuring everyone has a shared understanding.

Implementation and Validation:

Working with Development Teams: BAs collaborate closely with developers and IT
professionals throughout the development process. They ensure the solution
aligns with the design and requirements, answer questions, and provide
clarifications.

7 Business Analytics Compendium 2024-25


Testing and Validation: BAs are involved in testing activities to guarantee the
solution meets user needs and delivers the expected results. They work with the
testing team to create test plans, identify and report bugs, and ensure the final
product functions as intended.

Key Responsibilities of a Business Analyst:

Requirements Gathering and Analysis: To understand the data requirements, it is


essential to comprehend the needs of various stakeholders. A business analyst's
responsibility is to ensure these requirements are clear, complete, and achievable.

Data Analysis and Interpretation: Data is king in today's world. The Data Analyst’s
responsibility is to make sense of the data by uncovering trends, patterns, and
insights for informed decision-making.
Process Mapping and Improvement: Business Analysts (BAs) go beyond merely
identifying problems; they actively seek and implement solutions. By meticulously
mapping out processes, they can streamline operations, remove bottlenecks, and
significantly boost overall efficiency.

Facilitating Communication and Collaboration: Acting as bridges between business


and IT to ensure shared understanding of goals and requirements.

Solution Design and Evaluation: Business Analysts (BAs) collaborate closely with
teams to craft solutions that align with business objectives. They carefully evaluate
various options, ensuring that the selected solutions are practical, cost-effective,
and deliver substantial value.

Change Management: Implementing changes is often challenging. BAs play a crucial


role in developing comprehensive change management plans, effectively
communicating with stakeholders, and addressing any concerns to facilitate a
seamless transition.

Quality Assurance and Testing: BAs are involved in the quality assurance process.
They collaborate with the testing team to identify and rectify issues before the final
solution is implemented.
8 Business Analytics Compendium 2024-25
Continuous Monitoring and Feedback: A BA's job doesn't end with implementation.
They monitor system performance, gathering user feedback, and identifying
opportunities for improvement.

Essential Skills for Business Analysts

Data Analysis: A strong foundation in data analysis is crucial. You'll need to utilize
tools like spreadsheets, SQL, and more advanced data analysis softwares like
python, R etc.

Basic Understanding of Programming: While extensive coding skills may not always
be required, a foundational understanding of programming logic can be beneficial.
This allows you to better understand technical limitations and collaborate
effectively with developers.

The Business Analysis Process

The following are the various phases, techniques, tools and lifecycle models of Business
Process Analysis (BPA):

Business Analysis Techniques and Tools (by Phase):

9 Business Analytics Compendium 2024-25


Techniques for Process Mapping and Analysis:

Flowcharts: These visual tools outline the steps involved in a process, clearly
marking decision points and the flow of information.

Swimlane Diagrams: These diagrams highlight the roles and responsibilities of


various participants within a process, ensuring clarity and accountability.

10 Business Analytics Compendium 2024-25


Data Flow Diagrams (DFDs): These diagrams track the flow of data through a
system, identifying data sources, processing steps, and destinations, providing a
clear overview of data handling.

Value Stream Mapping: This technique focuses on identifying and eliminating


activities that do not add value to the process, streamlining operations for greater
efficiency.

11 Business Analytics Compendium 2024-25


Gap Analysis: This method compares the current state of a process with its desired
future state, pinpointing areas that need improvement to achieve optimal
performance.

Techniques for Process Mapping and Analysis:

Waterfall Model: A sequential approach where each phase is completed before


moving on to the next (e.g., planning, requirements gathering, design, development,
testing, deployment).

Agile Model: An iterative and incremental approach where requirements are


delivered in short cycles (sprints) with continuous feedback and adaptation.

12 Business Analytics Compendium 2024-25


Identifying Opportunities for Process Improvement

Identifying Bottlenecks: Analyze steps that cause delays or slow down the process.
Redundancy: Look for steps that can be eliminated or combined for efficiency.
Lack of Automation: Explore opportunities to automate repetitive tasks.
Ineffective Communication: Improve communication flow between process
participants.
Data Accuracy: Ensure data used in the process is accurate and reliable.

Requirements Gathering and Management

Eliciting Requirements from Stakeholders:

This is where you, as the BA, act as a bridge between stakeholders with varying needs
and the development team. Here are some effective techniques:

Interviews: One-on-one sessions are ideal for in-depth exploration of individual


stakeholder perspectives. Prepare a mix of open-ended and closed-ended questions
to gather specific details and uncover underlying concerns. Active listening is
crucial here – pay attention to both verbal and nonverbal cues.

Workshops: Facilitate group discussions to brainstorm ideas, discuss functionalities,


and prioritize requirements. Utilize techniques like whiteboarding or user story
mapping to encourage participation and ensure everyone has a voice.

Use Cases: Develop scenarios outlining user interactions with the project
deliverable. This helps visualize user journeys, identify specific functionalities, and
ensure the solution addresses actual needs.

Identifying Different Stakeholder Perspectives:

End Users: Focus on their needs and pain points. What tasks should the project
solve for them? How will it improve their work experience?

13 Business Analytics Compendium 2024-25


Management: Understand business goals, budget constraints, and return on
investment (ROI) expectations.

Development Team: Consider their technical expertise and limitations. What can be
realistically developed within the project timeframe?

Documenting and Managing Requirements Effectively:

Clear and Concise Requirements: Each requirement should be a single, atomic


statement written in plain language. Avoid ambiguity and ensure everyone
interprets it the same way.

Traceability: Link requirements to specific project deliverables for easy reference.


This helps track how each requirement translates into the final product.

Requirements Management Tools: Utilize software like Jira, Asana, or specialized


requirements management platforms. These tools help organize, track changes, and
maintain version control of requirements throughout the project lifecycle.

Prioritizing and Validating Requirements:

Prioritization Techniques: Techniques like MoSCoW (Must-Have, Should-Have,


Could-Have, Won't-Have) help prioritize requirements based on importance and
urgency.

Feasibility Analysis: Evaluate the feasibility of each requirement considering


technical limitations, budget constraints, and project timelines. Not everything can
be done, so identify what's essential and what can be revisited later.

Verification and Validation: Ensure requirements are:


Complete: All necessary details are captured.
Consistent: No contradictions exist between different requirements.
Achievable: Realistic within project constraints.

14 Business Analytics Compendium 2024-25


Feasibility Analysis for Proposed Solutions:

Once you have identified potential information system solutions, the next step is to
assess their viability. Feasibility analysis involves evaluating a proposed solution based
on three key factors:

Cost Feasibility: This involves estimating the costs associated with developing,
implementing, and maintaining the system. Does the organization have the budget
to support this solution?

Time Feasibility: Can the system be developed and implemented within the required
timeframe? This analysis considers the complexity of the system and the available
resources.

Technical Feasibility: Does the organization have the necessary technology


infrastructure and expertise to implement and maintain the chosen solution? This
includes hardware, software, and technical skills.

A thorough feasibility analysis helps avoid pursuing unrealistic solutions and ensures
the chosen system aligns with the organization's practical limitations.

Real-World Applications of Business Analysis

Here are some examples of how business analytics has benefited different core
business functions:

Marketing
Sales
Operations
Finance
Human Resources (HR)

Note:It is highly recommended that students conduct their own research to understand how these large corporations use
analytics to improve their business.

15 Business Analytics Compendium 2024-25


Marketing
Case Study 1: Spotify Uses Data Analytics to Dominate Music Streaming
Spotify, the world's leading music streaming platform, thrives on its ability to leverage
vast amounts of user data for superior marketing.

Challenge:

Stay relevant in a crowded music streaming market.


Target individual users with music they'll love to increase engagement and
retention.

Solution:

Data Collection: Spotify gathers a wealth of data on user behavior, including


listening habits, playlists created, skipped songs, and social sharing.

Customer Segmentation: This data allows them to segment users into distinct
groups based on genre preferences, listening times, and device usage.

Personalized Recommendations: Spotify's powerful recommendation engine


analyzes user data to suggest new music, playlists, and podcasts. These are much
more likely to resonate with each user than generic recommendations.

Real-time Marketing: By understanding user behaviour in real-time, Spotify can


tailor marketing campaigns. For example, if a new album drops by an artist a user
frequently listens to, they might receive a push notification or an email about it.

Results:

Increased User Engagement: Personalized recommendations lead to users


discovering new music they enjoy, keeping them engaged for longer.

Improved Retention Rates: By understanding user preferences and keeping them


happy with relevant content, Spotify reduces churn and keeps users subscribed.

16 Business Analytics Compendium 2024-25


Targeted Advertising: Data allows Spotify to offer targeted advertising space to music
labels and artists. These ads are more likely to resonate with listeners, generating
valuable revenue for Spotify.

Key Takeaways:

The power of user data: When leveraged strategically, user data can provide a deep
understanding of customer preferences and behavior.

Personalization is key: By tailoring the user experience through recommendations and


marketing, businesses can create a more engaging experience.

Data-driven decisions: Business analytics empowers data-driven decision making,


leading to more effective marketing strategies.

Monetization through insights: Understanding user behavior allows for targeted


advertising, generating additional revenue streams.

Case Study 2: Target Discovers Teen Pregnancy with Analytics

This example might surprise you, but it demonstrates the unexpected yet impactful results
business analytics can bring.

Challenge:

Develop targeted marketing campaigns for baby products.

Solution:

Target worked with a data analytics firm to analyze purchasing data. This included
factors like browsing history, past purchases, and demographics.

17 Business Analytics Compendium 2024-25


Unexpected Discovery:

The analysis revealed a peculiar pattern: specific groceries and household items
were being purchased together more frequently by individuals who hadn't
previously purchased baby products.

Deeper Dive:

Target delved deeper to understand this connection. They discovered these


seemingly unrelated items were often precursors to teen pregnancy.

Ethical Dilemma:

Target faced a major ethical dilemma. Should they leverage this knowledge for
targeted marketing to pregnant teens?

Solution:

Target chose not to exploit this information for marketing purposes. However, the
insights did prove valuable.

Results:

Target used the analytics to develop a generic marketing campaign promoting


healthy lifestyle choices for young women. This campaign resonated with their
target audience and avoided ethical concerns.

Key Takeaways:

Business analytics can uncover hidden patterns and connections that traditional
marketing methods might miss.

Businesses must weigh the ethical implications of leveraging such insights,


especially when dealing with sensitive demographics.

18 Business Analytics Compendium 2024-25


Unexpected Discovery:

The analysis revealed a peculiar pattern: specific groceries and household items
were being purchased together more frequently by individuals who hadn't
previously purchased baby products.

Deeper Dive:

Target delved deeper to understand this connection. They discovered these


seemingly unrelated items were often precursors to teen pregnancy.

Ethical Dilemma:

Target faced a major ethical dilemma. Should they leverage this knowledge for
targeted marketing to pregnant teens?

Solution:

Target chose not to exploit this information for marketing purposes. However, the
insights did prove valuable.

Results:

Target used the analytics to develop a generic marketing campaign promoting


healthy lifestyle choices for young women. This campaign resonated with their
target audience and avoided ethical concerns.

Key Takeaways:

Business analytics can uncover hidden patterns and connections that traditional
marketing methods might miss.

Businesses must weigh the ethical implications of leveraging such insights,


especially when dealing with sensitive demographics.

19 Business Analytics Compendium 2024-25


Case Study 3: How Netflix Used Business Analytics for understanding their users

Netflix, the streaming giant, wasn't always the leader it is today. In its early days, they
relied on traditional methods like surveys and focus groups to understand customer
preferences. While these provided some insights, they lacked the depth and granularity
needed to truly thrive.

The Challenge:

Difficulty predicting customer preferences and content demand accurately.

High churn rate (subscribers canceling their subscriptions).

Limited understanding of how content recommendations influenced viewing habits.

The Solution:

Netflix embraced business analytics and built a data-driven culture. Here's how they did
it:

Data Gathering: Netflix collects massive amounts of data on user behavior, including:

What shows and movies are watched


Viewing times and completion rates
Pausing, rewinding, and fast-forwarding behavior
Search queries within the platform

Data Analysis: Advanced analytics tools analyze this data to identify patterns and
trends. Here are some examples of what they uncover:

Micro-genres: Highly specific subcategories within genres (e.g., "British historical


dramas set in the 18th century").
Binge-watching patterns: How long viewers typically watch before taking a
break.

20 Business Analytics Compendium 2024-25


Content recommendation effectiveness: Whether recommendations are leading
viewers to shows they enjoy.

Actionable Insights: The data provides valuable insights that fuel strategic decisions:

Content acquisition: They prioritize acquiring shows and movies that cater to
specific micro-genres with high engagement.

Personalization: The recommendation engine tailors content suggestions to each


user's unique viewing habits.

Original content creation: Data helps identify themes and storylines likely to
resonate with their audience, guiding their original content production.

The Results:

Netflix boasts over 220 million subscribers worldwide.

Churn rates are significantly lower than the industry average.

Original content like "Stranger Things" and "Squid Game" have become global
phenomena.

Key Takeaways:

Netflix's success story exemplifies the power of business analytics in sales and marketing.
By leveraging data, they were able to:

Gain a deeper understanding of their customers.


Predict content demand more accurately.
Personalize the user experience.
Create highly-engaging original content.

21 Business Analytics Compendium 2024-25


Sales

Case Study 1: Amazon Recommends Products Based on Analytics

Challenge: In a highly competitive e-commerce landscape, Amazon sought to:

Increase customer engagement by offering relevant product suggestions.

Boost conversion rates by driving users towards products they're likely to


purchase.

Solution: Data-Driven Product Recommendations

Amazon leverages sophisticated recommendation algorithms powered by vast amounts of


user data.

This data includes:

Past purchase history: Analyzing what a customer has bought previously helps
predict their future interests.

Browsing behaviour: Tracking what products users view and for how long
provides insights into their current interests.

Demographic data: While age, location, and similar data can be less personal, it
can still suggest broader product categories relevant to user segments.

Results: Increased Sales and Customer Satisfaction

By recommending products tailored to individual preferences, Amazon achieves:

Increased sales: Customers are more likely to buy products they are
recommended based on their past behavior and interests

22 Business Analytics Compendium 2024-25


Higher conversion rates: By suggesting relevant products, users are more likely
to complete a purchase.

Improved customer satisfaction: A personalized shopping experience leads to a


more satisfying customer journey.

Key Takeaways:

The Power of Data Analytics: By analyzing user data, businesses can gain deep
insights into customer behaviour, enabling them to personalize product
recommendations effectively.

The Importance of Personalization: Customizing product suggestions to align with


individual preferences is essential for enhancing sales and conversion rates in the e-
commerce sector.

Data-Driven Decision Making: Utilizing data analytics allows businesses to make well-
informed decisions regarding product placement and marketing strategies, ultimately
driving better outcomes.

Case Study 2: Netflix Optimizes Content Delivery with Business Analytics

Challenge: Netflix needed to ensure a smooth streaming experience for all users,
especially during peak hours. This was critical to avoid user frustration and potential
churn.

Solution: They implemented a data-driven approach to content delivery. This involved:

Analyzing User Data: Netflix gathers data on user viewing habits, including location,
time of day, and preferred device.

Predicting Demand: Based on historical data and current trends, they predict when
and where demand for specific content will be high.

23 Business Analytics Compendium 2024-25


Content Prepositioning: Popular content is strategically cached on servers closest to
users most likely to watch it during peak times. This reduces latency and buffering
issues.

Network Optimization: Data analysis helps identify network bottlenecks and allows
for targeted infrastructure improvements.

Results: Improved Streaming Experience, Increased Sales, and Customer Satisfaction

Reduced Buffering: Prepositioning content close to users minimizes the distance data
needs to travel, resulting in faster loading times and less buffering.

Scalability: By anticipating demand, Netflix can efficiently manage traffic spikes and
prevent service disruptions.

Increased Sales and Customer Retention: A smooth and reliable streaming experience
reduces frustration, keeps users subscribed, and even encourages them to watch
more content, ultimately leading to increased sales.

Key Takeaways:

Proactive Data Analysis: By analyzing user data, businesses can take proactive steps
to ensure a seamless user experience.

Data-Driven Decision Making: Business analytics enable informed decisions regarding


content delivery infrastructure and resource allocation.

Enhanced Customer Satisfaction: Delivering a reliable and high-quality streaming


experience boosts customer satisfaction and loyalty, ultimately driving increased
sales.

Operations

Case Study 1: Walmart Optimizes Inventory Management with Analytics: A Deeper Dive

24 Business Analytics Compendium 2024-25


Walmart can use this insight to:

Order the right amount of inventory: They won't be caught understocked when the
beach season hits, frustrating customers.

Optimize storage space: They can allocate space based on predicted demand,
avoiding unnecessary storage costs for off-season items.

The Results: A Win-Win for Walmart and Customers

By leveraging data analytics, Walmart achieved significant improvements in their


inventory management system:

Minimized Stockouts: By predicting demand more precisely, they ensured shelves are
stocked with the right products at the right time.

Reduced Overstocking Costs: They optimized inventory levels, freeing up capital and
storage space for other uses.

Improved Overall Supply Chain Efficiency: Data-driven insights allowed them to


streamline ordering, transportation, and warehouse operations.

The Ripple Effect:

The benefits extended beyond Walmart's bottom line. Customers enjoyed a smoother
shopping experience with fewer stockouts. Additionally, suppliers benefited from clearer
demand forecasts, allowing them to optimize their own production and deliveries. This
case study highlights how business analytics can transform traditional business practices.
By embracing data-driven decision making, companies like Walmart can achieve greater
efficiency, improve customer satisfaction, and gain a competitive edge.

Case Study 2: Amazon Fine-Tunes Delivery Operations with Machine Learning: Efficiency
at Scale.

25 Business Analytics Compendium 2024-25


Amazon, the e-commerce giant, handles a massive volume of deliveries daily. Optimizing
their delivery network is crucial for maintaining efficiency and customer satisfaction.
Here's how they leverage business analytics, specifically machine learning, to achieve
this:

Challenge: Streamlining Deliveries in a Dynamic Landscape

Complex Routing: With millions of packages and geographically dispersed


warehouses, finding the most efficient delivery routes becomes a logistical
nightmare.

Unforeseen Delays: Traffic congestion, weather events, and other factors can disrupt
delivery schedules.

Last-Mile Delivery Challenges: The final leg of the delivery process, getting packages
to customers' doorsteps, presents unique logistical hurdles.

These challenges can lead to late deliveries, frustrated customers, and increased
operational costs.

Solution: Machine Learning to the Rescue

Amazon utilizes machine learning algorithms to analyze vast amounts of data related to
deliveries, including:

Historical Delivery Data: Past delivery times, routes taken, and encountered obstacles
are used to identify patterns and predict future delivery times.

Real-Time Traffic Information: Traffic congestion data is integrated to dynamically


adjust delivery routes for optimal efficiency.

Weather Forecasts: Anticipated weather conditions are factored in to avoid delays


caused by rain, snow, or other disruptions.

26 Business Analytics Compendium 2024-25


Here's how this translates into real-world applications:

Route Optimization: Algorithms suggest the most efficient delivery routes for drivers,
considering factors like traffic patterns, package sizes, and delivery locations.

Predictive Delivery Times: Machine learning helps predict accurate delivery windows,
setting realistic expectations for customers.

Dynamic Route Adjustments: In case of unforeseen delays, the system can reroute
deliveries in real-time to minimize disruptions.

Results: Faster Deliveries, Happier Customers

By leveraging machine learning in their operations, Amazon has achieved significant


improvements:

Reduced Delivery Times: Optimized routes and predictive models lead to faster
deliveries, exceeding customer expectations.

Lower Operational Costs: Efficient route planning minimizes fuel consumption and
driver time, leading to cost savings.

Enhanced Customer Satisfaction: Predictable delivery times and fewer delays improve
customer experience and loyalty.

Beyond Delivery:

Machine learning finds applications beyond just delivery optimization at Amazon. It's used
in areas like:

Demand Forecasting: Predicting customer demand for products to optimize inventory


management.

Fraud Detection: Identifying and preventing fraudulent transactions on the platform.

Product Recommendations: Recommending products to customers based on their


past purchases and browsing behavior.
27 Business Analytics Compendium 2024-25
Finance

Case Study 1: JPMorgan Chase Uses Analytics for Fraud Detection

Challenge: JPMorgan Chase, like many financial institutions, faces the constant threat of
fraudulent transactions. Their primary objective was to:

Prevent financial losses caused by fraudulent activity.


Protect customer accounts and maintain trust.

Solution: Implementing Analytics for Fraud Detection

JPMorgan Chase utilizes data analytics to analyze vast amounts of customer


transaction data.

This analysis focuses on identifying anomalies in spending habits that deviate from a
customer's typical spending patterns.

These anomalies could be suspicious transactions based on factors such as location,


amount, or type of purchase.

Result: Real-time Fraud Detection

By employing real-time analytics, JPMorgan Chase can identify suspicious patterns as


they occur.

This allows for immediate action, such as blocking transactions or contacting


customers for verification.

As a result, they are able to prevent significant financial losses and protect customer
accounts from fraudulent activity.

Key Takeaways:

The Power of Data Analytics: Analyzing customer data allows for proactive
identification of fraudulent behavior.

28 Business Analytics Compendium 2024-25


Real-time Monitoring is Crucial: Monitoring transactions in real-time enables
immediate intervention against fraud attempts.

Enhanced Customer Protection: Utilizing data analytics strengthens customer


account security and builds trust.

Case Study 2: Wells Fargo Leverages Business Analysis to Improve Loan Default
Prediction

Challenge: Wells Fargo, a major financial institution, faced challenges in accurately


predicting loan defaults. This resulted in:

Increased financial risk due to bad loans.


Inefficient loan approval processes impacting customer experience.

Solution: Business Analysis Approach

Data Gathering and Analysis: Wells Fargo's business analysts collaborated with data
scientists to gather and analyze a comprehensive dataset. This included customer
demographics, financial history, loan details, and historical default data.

Identifying Key Factors: By analyzing the data, they identified key factors that
significantly influence loan default rates. These could include credit score, debt-to-
income ratio, employment history, and loan purpose.

Developing Predictive Model: Using the identified factors, business analysts


collaborated with data scientists to develop a robust predictive model for loan
defaults.

Process Improvement: The model was then integrated into the loan approval process,
enabling a more data-driven approach.

Results: Improved Loan Default Prediction and Risk Management

29 Business Analytics Compendium 2024-25


Reduced Loan Defaults: The new predictive model allowed Wells Fargo to more
accurately assess loan risk, leading to a significant decrease in loan defaults.

Streamlined Loan Approval Process: By focusing on the most relevant factors, the
process became more efficient, improving customer experience.

Data-Driven Decision Making: Business analysis facilitated a shift towards data-driven


decision making, leading to better risk management strategies.

Key Takeaways:

Collaboration is Key: Effective business analysis involves collaboration between


business analysts, data scientists, and other stakeholders.

Data-Driven Approach: Utilizing data analysis can significantly improve financial


decision-making within the banking sector.

Improved Risk Management: Business analysis can contribute to more robust risk
management practices, safeguarding financial institutions and their customers.

Human Resources (HR)

Case Study 1: Google Analyzes Employee Data to Reduce Turnover

Challenge: A high employee turnover rate can be detrimental to any organization.


Google, known for its innovative culture, sought to:

Understand the root causes of employee departures.


Identify areas for improvement to retain top talent.

Solution: Data-Driven Approach to Employee Retention

Google's HR department leveraged people analytics to examine a comprehensive


dataset. This included:

30 Business Analytics Compendium 2024-25


Employee data: Demographics, job titles, tenure, etc.
Performance reviews: Feedback from managers provided valuable insights into
employee performance and potential areas of dissatisfaction.
Company culture surveys: Employee engagement surveys provided data on
overall satisfaction with company culture, work-life balance, and leadership.

Results: Targeted Retention Strategies and Reduced Turnover

By analyzing the data, Google was able to identify patterns and trends associated with
employee dissatisfaction. This led to:

Targeted retention programs: They developed programs to address specific


factors contributing to employee departures, such as career development
opportunities or mentorship initiatives.

Improved work environment: Areas for improvement in company culture,


workload, or work-life balance were addressed based on the data.

Decreased turnover: By proactively addressing employee concerns, Google


reduced its employee churn rate.

Key Takeaways:

People Analytics are Powerful: Analyzing employee data provides valuable insights
into employee sentiment and factors impacting retention.

Proactive Retention Strategies: By understanding employee needs, businesses can


develop targeted programs to improve employee satisfaction and reduce turnover.

Data-Driven HR Decisions: Utilizing data empowers HR departments to make informed


decisions for a more positive and engaging work environment.

31 Business Analytics Compendium 2024-25


Employee data: Demographics, job titles, tenure, etc.
Performance reviews: Feedback from managers provided valuable insights into
employee performance and potential areas of dissatisfaction.
Company culture surveys: Employee engagement surveys provided data on
overall satisfaction with company culture, work-life balance, and leadership.

Results: Targeted Retention Strategies and Reduced Turnover

By analyzing the data, Google was able to identify patterns and trends associated with
employee dissatisfaction. This led to:

Targeted retention programs: They developed programs to address specific


factors contributing to employee departures, such as career development
opportunities or mentorship initiatives.

Improved work environment: Areas for improvement in company culture,


workload, or work-life balance were addressed based on the data.

Decreased turnover: By proactively addressing employee concerns, Google


reduced its employee churn rate.

Key Takeaways:

People Analytics are Powerful: Analyzing employee data provides valuable insights
into employee sentiment and factors impacting retention.

Proactive Retention Strategies: By understanding employee needs, businesses can


develop targeted programs to improve employee satisfaction and reduce turnover.

Data-Driven HR Decisions: Utilizing data empowers HR departments to make informed


decisions for a more positive and engaging work environment.

32 Business Analytics Compendium 2024-25


Case Study 2: Walmart Optimizes Scheduling with Workforce Analytics

Challenge: Walmart, the world's largest retailer, faced challenges with inefficient
scheduling that led to:

Employee dissatisfaction: Inconsistent schedules and long hours could lead to fatigue
and low morale.

Increased costs: Overstaffing during peak hours and understaffing during off-peak
hours impacted profitability.

Solution: Workforce Analytics for Optimized Scheduling

Data Collection: Walmart implemented a system to collect data on customer traffic


patterns, sales trends, and employee availability.

Predictive Analytics: By analyzing historical data and sales forecasts, they predicted
customer traffic and staffing needs for various times and departments.

Optimized Scheduling: Using the insights from analytics, they created data-driven
schedules that:

Matched staffing levels to predicted customer demand.


Took employee preferences and availability into account.
Ensured a balanced workload to reduce employee burnout.

Results: Improved Employee Satisfaction, Reduced Costs, and Increased Efficiency

Reduced Employee Scheduling Issues: Data-driven schedules minimized schedule


conflicts and ensured fairer workload distribution.

Increased Employee Satisfaction: Predictable schedules and reduced workload


improved employee morale and engagement.

33 Business Analytics Compendium 2024-25


Key Takeaways:

Workforce Analytics Drive Efficiency: Analyzing data empowers businesses to


optimize employee schedules for better efficiency and cost-effectiveness.

Employee Satisfaction Matters: Data-driven scheduling can address employee


concerns and contribute to a more satisfied and engaged workforce.

Improved Operational Performance: Effective workforce management leads to


optimized operations and a more positive customer experience.

Business Analysis Tools and Technologies

A list of commonly used software categories and specific tools within each, along with a
brief explanation of their purpose:

Process Modeling and Diagramming:

Microsoft Visio: The go-to tool for creating professional flowcharts, process maps,
and UML diagrams. BAs use it to visualize workflows, identify bottlenecks, and
communicate processes clearly to stakeholders.

34 Business Analytics Compendium 2024-25


Lucidchart: A cloud-based alternative to Visio, offering collaborative features and
real-time editing. This is ideal for distributed teams working on process models
together.

Draw.io: A free, open-source option with a user-friendly interface and various


templates. It's a good choice for quick process visualizations or for budget-conscious
projects.

Requirements Management:

Jira: A popular agile project management tool with robust features for user story
management, requirements traceability, and issue tracking. BAs utilize Jira to capture,
track, and prioritize requirements throughout the development lifecycle.

35 Business Analytics Compendium 2024-25


Lucidchart: A cloud-based alternative to Visio, offering collaborative features and
real-time editing. This is ideal for distributed teams working on process models
together.

Draw.io: A free, open-source option with a user-friendly interface and various


templates. It's a good choice for quick process visualizations or for budget-conscious
projects.

Requirements Management:

Jira: A popular agile project management tool with robust features for user story
management, requirements traceability, and issue tracking. BAs utilize Jira to capture,
track, and prioritize requirements throughout the development lifecycle.

36 Business Analytics Compendium 2024-25


Jama Connect: A dedicated requirements management solution offering advanced
features like version control, baselines, and integrations with development tools. This
is ideal for complex projects with intricate requirement sets.

Microsoft Word: While not a dedicated tool, Word can be used effectively, especially
for smaller projects. BAs can document requirements in a structured format,
facilitating clear communication with stakeholders.

37 Business Analytics Compendium 2024-25


Data Analysis and Manipulation:

Microsoft Excel: A versatile tool for data manipulation and analysis. BAs use power
queries, pivot tables, and charts to clean data, identify trends, and extract insights.

SQL: Essential for interacting with relational databases. BAs use SQL to retrieve and
integrate data for further analysis.

Python: A powerful programming language with libraries like pandas and NumPy for
advanced data manipulation, analysis, and visualization.

38 Business Analytics Compendium 2024-25


R: A statistical programming language for deep data analysis. BAs use R to uncover
patterns and build data-driven models for better business decisions.

Cloud-Based Collaboration Tools:

Microsoft Teams: A versatile platform for chat, video conferencing, file sharing, and
task management. Ideal for BAs to collaborate with team members and stakeholders,
particularly in remote settings

Slack: A real-time messaging tool that supports file sharing and project-specific
channels. BAs use Slack for quick updates, brainstorming, and maintaining project
communication.

39 Business Analytics Compendium 2024-25


Confluence: An Atlassian knowledge management platform for creating wikis, sharing
documents, and centralizing project information. BAs leverage Confluence to store
meeting notes, requirements, and other project artifacts.

Data Visualization Platforms:

Microsoft Power BI: A business intelligence (BI) tool for creating interactive
dashboards and reports. BAs can leverage Power BI to transform data into visually
appealing insights for stakeholders, enabling better data-driven decision-making.

Tableau: Another leading data visualization platform known for its ease of use and
rich visual capabilities. BAs can create clear and compelling dashboards, charts, and
maps to communicate complex data insights to both technical and non-technical
audiences.

Google Data Studio: A free data visualization tool from Google, offering integrations
with various Google products and a user-friendly interface. BAs can use Data Studio
to create basic to complex data visualizations, depending on project needs.

40 Business Analytics Compendium 2024-25


Additional BA Tools:

Mind Mapping Tools (e.g., MindMeister, XMind): For brainstorming ideas, capturing
requirements, and visually organizing information.

User Interface (UI) Prototyping Tools (e.g., Figma, InVision): To create mockups and
prototypes of user interfaces, helping stakeholders visualize potential solutions.

Project Management Tools (e.g., Asana, Trello): For managing tasks, setting
deadlines, and tracking project progress, especially for smaller projects or for
personal task organization.

Building a Business Analytics Mindset

The heart of a Business Analyst (BA) lies not just in the technical tools, but in fostering a
unique way of thinking. A strong Business Analytics Mindset equips you to be an
insightful translator, transforming raw data into actionable strategies.

1. Asking the Right Questions

Challenge Assumptions: Don't be a passive data receiver. Question existing


processes, data sources, and stakeholder perspectives. Dig deeper with "why" to
uncover root causes of problems and ensure solutions target the core issues.

Think Strategically: Move beyond the tactical. Consider how data insights can inform
long-term business goals and strategic decision-making. Ask questions that bridge
the gap between data and business objectives, like "How can customer data
analytics improve our competitive advantage?"

Identify the "So What?": Don't get lost in the data jungle. Every analysis should lead
to a clear conclusion or actionable recommendation. Ask yourself "So what does this
data tell us? How can we leverage these insights to make a positive impact?"

41 Business Analytics Compendium 2024-25


Embrace Curiosity: Be an inquisitive BA. Constantly seek to understand the "why"
behind the data. Curiosity fuels deeper analysis and unearths hidden patterns that
might otherwise be overlooked.

Quantify Whenever Possible: Not everything can be a number, but strive to quantify
aspects whenever possible. This strengthens the foundation of your analysis and
adds objectivity to your recommendations.

2. Communicating Data-Driven Insights Effectively: From Analyst to Storyteller

Know Your Audience: Tailor your communication style and level of technical detail
to resonate with your audience. Speak in clear, concise language for non-technical
stakeholders, and provide more technical details when presenting to data-savvy
audiences.

Focus on Storytelling: Data visualizations and compelling narratives can breathe life
into insights. Use charts, graphs, and real-world examples to make your message
impactful and memorable. Think of yourself as a translator, transforming complex
data sets into a story that everyone can understand.

Actionable Recommendations: Don't just present findings; translate them into


actionable steps. Recommend specific courses of action based on your data analysis,
making it easy for stakeholders to understand how to leverage the insights.

Practice Active Listening: Communication is a two-way street. Actively listen to


stakeholder concerns and feedback. This fosters collaboration and ensures your
recommendations address their practical needs.

Embrace Visual Communication: A picture is worth a thousand words. Utilize data


visualization tools to create clear and informative charts, graphs, and dashboards
that effectively communicate complex data insights.

42 Business Analytics Compendium 2024-25


3. Embracing Ethical Considerations: The Responsible BA

Data Privacy: Be aware of data privacy regulations (e.g., GDPR, CCPA) and ensure
data collection and analysis comply with all relevant laws and ethical codes. Respect
user privacy and prioritize data security.

Data Bias: Data can be biased, reflecting the real world. Be mindful of potential
biases in data sets and how they might influence your analysis. Present findings with
transparency and acknowledge any limitations.

Transparency and Fairness: Be transparent about data sources, methodologies, and


limitations. Ensure your recommendations are fair and don't unfairly disadvantage
any group.

Consider the Long-Term Impact: Think beyond immediate solutions. Consider the
long-term implications of your recommendations. How might they impact
stakeholders, business processes, and even society as a whole?

Stay Updated on Ethical Issues: The data landscape is constantly evolving. Stay
updated on emerging ethical considerations in data collection, analysis, and AI to
ensure your practices remain responsible.

43 Business Analytics Compendium 2024-25


Building a Business Analytics Mindset

44 Business Analytics Compendium 2024-25


Different Prominent BA Roles:

IT Business Analyst

An IT Business Analyst focuses on aligning information technology solutions with the


strategic goals of the organization. They work closely with IT teams to ensure that
technology projects meet business requirements. This includes analyzing system
capabilities, designing IT solutions, and facilitating communication between technical
and non-technical stakeholders.

Commercial Business Analyst

Commercial Business Analysts concentrate on the business aspects of an


organization, often involved in market analysis, pricing strategies, and revenue
optimization. They play a crucial role in helping businesses make informed decisions
to maximize profitability and market competitiveness.

Process Business Analyst

Process Business Analysts specialize in optimizing business processes. They analyze


current workflows, identify inefficiencies, and recommend improvements to enhance
operational efficiency. Process Business Analysts are instrumental in driving
organizational change by streamlining processes and ensuring they align with
business objectives.

Business Analyst as a Proxy Product Owner

In agile development methodologies, Business Analysts often take on the role of a


Proxy Product Owner. In this capacity, they act as a liaison between business
stakeholders and development teams, representing the interests of end-users. They
play a vital role in defining user stories, prioritizing features, and ensuring that the
final product meets the needs of the business and its customers.

45 Business Analytics Compendium 2024-25


Go-To-Market Business Analyst

A Go-To-Market (GTM) or Business Analyst specializes in launching new products or


services into the market. They conduct market research, analyze consumer trends,
and develop strategies for introducing products successfully. Business Analysts
collaborate with marketing, sales, and product development teams to create
effective launch plans and maximize market penetration.

The Future of Business Analysis

Artificial Intelligence (AI) and Machine Learning (ML): AI and ML are transforming
numerous industries. BAs will need to understand how these technologies can be
leveraged to automate tasks, generate insights from data, and support better decision-
making. This might involve working with data scientists to identify opportunities for AI
integration or ensuring ethical considerations are addressed in AI-powered solutions.

Big Data and Data Analytics: The ever-increasing volume of data presents both
challenges and opportunities. BAs will need to be familiar with data analysis techniques
and tools to extract meaningful insights from data sets. This might involve learning to
work with tools like SQL for data querying or Python for data manipulation.

Agile and DevOps Methodologies: Agile and DevOps approaches are gaining traction
across organizations. BAs will need to adapt to these faster-paced development cycles,
focusing on iterative requirements gathering, continuous feedback loops, and
collaboration with development teams.

Cloud Computing: The shift towards cloud-based solutions requires BAs to understand
cloud platforms and their capabilities. They might be involved in evaluating cloud-based
solutions, ensuring data security in the cloud, and adapting existing business processes
for cloud environments.

46 Business Analytics Compendium 2024-25


Cybersecurity: With the growing threat of cyberattacks, BAs must be aware
of cybersecurity risks and incorporate security considerations throughout the
development lifecycle. This might involve understanding data privacy
regulations or working with security teams to identify and mitigate potential
vulnerabilities.

5 Business Analytics Compendium 2024-25


PART 2

DATA
ANALYSIS
What is Data Analysis ?

Data analytics is the process of examining raw data to uncover patterns, draw
conclusions, and support decision-making. It involves various techniques and
tools to transform, organize, and model data in meaningful ways.

It's the science of examining raw data with the goal of making informed
conclusions about the information it contains.

From a business perspective, data analytics is the art and science of


uncovering hidden gems within a company's data to make data-driven
decisions that improve performance. It's like having a crystal ball that helps
you understand your customers, operations, and market better.

Why is it Important?

Unveiling Hidden Insights: We generate massive amounts of data every


day. Data analytics acts like a translator, transforming this data into
understandable insights. Businesses can use these insights to identify
trends, customer preferences, and hidden patterns that would be difficult
or impossible to see with the naked eye.

Informed Decision-Making: Data analytics empowers businesses to make


data-driven decisions based on factual evidence. This leads to more
effective strategies, reduced risks, and overall better decision-making
across all levels of an organization.

5 Business Analytics Compendium 2024-25


Optimizing Performance: Data analytics helps businesses identify areas for
improvement. By analyzing operational data, companies can streamline
processes, reduce costs, and optimize resource allocation. This leads to
increased efficiency and overall better performance.

Personalization and Customer Experience: In today's competitive


landscape, customer experience is king. Data analytics lets businesses
personalize their offerings and marketing strategies to each customer. This
can be anything from recommending products based on past purchases to
tailoring content on a website.

Predictive Analytics: Data analytics isn't just about looking at the past; it's
about predicting the future. Businesses can use advanced analytics
techniques to forecast future trends, customer behavior, and market
demands. This allows them to be proactive and make strategic decisions
that position them for success.

Innovation and Competitive Advantage:. By uncovering hidden insights,


businesses can identify new opportunities, develop new products and
services, and stay ahead of the competition. In today's data-driven world,
companies that can effectively leverage data analytics gain a significant
competitive edge.

5 Business Analytics Compendium 2024-25


Real world examples of the application of Data Analytics

Netflix uses data analytics to recommend content to users based on their


viewing history and preferences. By analyzing vast amounts of data on user
behavior, Netflix creates personalized viewing experiences that keep
subscribers engaged. Moreover, data analytics informs content production
decisions, helping Netflix invest in shows and movies that are likely to be
successful.

Siemens uses data analytics to optimize energy consumption in smart grids.


By analyzing data from sensors and smart meters, they can predict energy
demand, reduce waste, and ensure a stable supply, leading to more
efficient energy use and cost savings for consumers.

Uber relies on data analytics to match riders with drivers efficiently. By


analyzing real-time data on traffic conditions, rider demand, and driver
availability, Uber optimizes routes and reduces wait times, improving the
overall user experience. Furthermore, data analytics helps in dynamic
pricing (surge pricing) to balance supply and demand during peak times.

Amazon uses data analytics to personalize recommendations for customers.


By analyzing browsing and purchase history, Amazon’s recommendation
engine suggests products that customers are likely to buy, increasing sales
and customer satisfaction. Additionally, data analytics helps manage
inventory by predicting demand for products, ensuring optimal stock levels
and reducing excess inventory costs.

5 Business Analytics Compendium 2024-25


Data Analysis Project Lifecycle

Let's walk through the data analysis project lifecycle with an example project

“A retail company wants to improve its customer retention by predicting which


customers are likely to churn”.

1. Problem Definition
Objective: Identify customers who are likely to stop
purchasing from the company and understand the factors i
nfluencing customer churn.
Activities: Meet with stakeholders to define "churn," gather requirements,
and set the project goal to reduce churn rate by 10% over the next six
months.

2. Data Collection
Objective: Gather data relevant to customer churn.
Activities: Collect data from various sources,
such as transaction records, customer service interactions,
loyalty program data, and demographic information.
This data could be stored in the company’s CRM system, databases, and
external data sources.

5 Business Analytics Compendium 2024-25


3. Data Preparation
Objective: Prepare the collected data for analysis.
Activities:
Data Cleaning: Remove duplicates, correct errors, handle missing values
(e.g., filling in missing values with averages or medians).
Data Transformation: Normalize data (e.g., scale numerical features),
encode categorical variables (e.g., one-hot encoding for categorical data
like "membership level").
Feature Engineering: Create new features such as "average purchase
value," "days since last purchase," and "number of customer service
calls."

4. Data Exploration and Analysis


Activities:
Use statistical techniques to describe data distributions and central
tendencies.
Create visualizations (e.g., histograms, scatter plots, box plots) to
identify trends, correlations, and anomalies.
Formulate hypotheses, such as "Customers with lower average purchase
values are more likely to churn."

5. Modeling and Algorithm Development


Objective: Develop a model to predict customer churn.
Activities:
Select appropriate algorithms (e.g., logistic regression, decision trees,
random forest, or gradient boosting).
Split the data into training and test sets.
Train models on the training set and tune hyperparameters using cross-
validation.
Compare model performance using metrics such as accuracy, precision,
recall, and the area under the ROC curve (AUC-ROC).

5 Business Analytics Compendium 2024-25


6. Evaluation and Interpretation
Objective: Evaluate the model and interpret the results.
Activities:
Assess the model’s performance on the test set.
Choose the best-performing model based on evaluation metrics.
Interpret the model to understand which features are most influential in
predicting churn (e.g., feature importance scores in a random forest model).

7. Deployment and Implementation


Objective: Deploy the predictive model to production.
Activities:
Integrate the model into the company's CRM system.
Set up automated processes for real-time churn prediction (e.g., flagging at-
risk customers).
Ensure data pipelines are in place for continuous data updates and model
predictions.

8. Monitoring and Maintenance


Objective: Ensure the deployed model continues to perform well.
Activities:
Monitor the model’s performance over time to detect any degradation.
Retrain the model periodically with new data to maintain accuracy.
Adjust the model and features as necessary to adapt to changing customer
behaviors and market conditions.

5 Business Analytics Compendium 2024-25


9. Reporting and Communication
Objective: Communicate the findings and actionable
insights to stakeholders.
Activities:
Create comprehensive reports and dashboards showing key metrics and
predictions.
Use visualizations to highlight the model’s predictions and areas for
intervention (e.g., customers at high risk of churn).
Present findings to management and suggest strategies to improve
customer retention (e.g., targeted marketing campaigns, personalized
offers).

10. Review and Feedback


Objective: Review the project outcomes and
gather feedback for future improvements.
Activities:
Conduct a project post-mortem to evaluate what worked well and what
didn’t.
Gather feedback from stakeholders on the usefulness and accuracy of the
churn predictions.
Document lessons learned and best practices for future data analysis
projects.

5 Business Analytics Compendium 2024-25


Data Collection
Types of Data
Data can be classified in several ways based on its characteristics, format, and
source. Here are the main types of data:

1. By Nature:
Qualitative Data: Descriptive data that cannot be measured numerically. It is
often used to categorize or classify objects. Examples include colors,
names, labels, and opinions.
Quantitative Data: Numerical data that can be measured and quantified. It is
used to describe quantities and includes both discrete and continuous data.
Examples include age, height, weight, and temperature.

2. By Format:
Structured Data: Organized in a predefined manner, often in rows and
columns (e.g., databases, spreadsheets). It is easily searchable and
analyzable. Examples include SQL databases and Excel files.
Unstructured Data: Not organized in a predefined structure, making it more
challenging to analyze. Examples include text documents, emails, videos,
social media posts, and images.
Semi-structured Data: Does not fit into a rigid structure like structured
data but contains tags or markers to separate data elements. Examples
include JSON, XML, and HTML files.
3. By Source:
Primary Data: Collected directly from the source or original data that has
not been altered or manipulated. Examples include survey responses,
experimental results, and sensor readings.
Secondary Data: Collected from existing sources that have been previously
gathered, processed, and published by others. Examples include research
papers, reports, and datasets from government agencies.
5 Business Analytics Compendium 2024-25
4. By Measurement Scale:
Nominal Data: Categorical data without a specific order. Examples include
gender, race, and types of cuisine.
Ordinal Data: Categorical data with a specific order but no fixed interval
between categories. Examples include rankings (e.g., first, second, third)
and satisfaction levels (e.g., satisfied, neutral, dissatisfied).
Interval Data: Numerical data with ordered categories and a fixed interval
between values but no true zero point. Examples include temperature in
Celsius and calendar dates.
Ratio Data: Numerical data with ordered categories, a fixed interval, and a
true zero point. Examples include height, weight, age, and income.

5. By Temporal Characteristics:
Cross-sectional Data: Collected at a single point in time, representing a
snapshot. Examples include census data collected on a specific date.
Time Series Data: Collected over different time periods, showing how data
points change over time. Examples include stock prices, monthly sales
figures, and daily temperatures.
Longitudinal Data: Similar to time series data but often involves repeated.
observations of the same subjects over time. Examples include panel
studies and cohort studies.

6. By Sensitivity:
Public Data: Openly available and not sensitive. Examples include open
government data and public datasets.
Private Data: Sensitive and restricted data requiring authorization for
access. Examples include personal identifiable information (PII), financial
records, and medical records.

5 Business Analytics Compendium 2024-25


Data Collection method

Collecting data for a data analytics project involves various methods, each
suited to different types of data and analysis objectives. Here are some
common ways to collect data for such projects:

1. Surveys and Questionnaires:

Online Surveys: Use platforms like SurveyMonkey, Google Forms, or


Typeform to collect data from respondents over the internet.
Paper Surveys: Distribute physical forms for data collection in locations
without internet access.
Mobile Surveys: Use mobile apps to collect data from users on the go.

2. Interviews:

Structured Interviews: Conduct interviews with a fixed set of questions to


ensure consistency.
Semi-structured Interviews: Use a mix of fixed and open-ended questions
for more detailed insights.
Unstructured Interviews: Conduct open-ended, conversational interviews to
explore topics deeply.

5 Business Analytics Compendium 2024-25


3. Observations:

Direct Observation: Observe subjects in their natural environment without


interference.
Participant Observation: Engage with subjects as a participant while
observing their behaviors and interactions.
Remote Observation: Use cameras or other devices to observe subjects
without being physically present.

4. Experiments:

Controlled Experiments: Conduct experiments in a controlled setting to


test specific hypotheses.
Field Experiments: Test hypotheses in real-world settings to see how
variables interact in natural conditions.

5. Existing Data Sources:

Administrative Data: Use data collected by organizations for operational


purposes, such as sales records, HR data, and financial reports.
Public Databases: Access publicly available data from government
agencies, research institutions, or organizations.
Historical Records: Analyze archival data, historical documents, or past
research reports.

5 Business Analytics Compendium 2024-25


6. Digital Data Collection:

Web Scraping: Use automated tools to extract data from websites.


Social Media Mining: Collect data from social media platforms to analyze
trends, sentiments, and user behavior.
IoT Devices and Sensors: Gather data from connected devices like smart
home systems, wearables, and environmental sensors.

7. Transactional Data:

Point-of-Sale Systems: Collect data from retail transactions, including


product details, purchase amounts, and customer information.
Online Transactions: Gather data from e-commerce activities, such as
purchase history, clickstream data, and user interactions.

8. Crowdsourcing:

Crowdsourced Data Collection: Leverage platforms like Amazon


Mechanical Turk or Zooniverse to gather data from a large group of
participants.

5 Business Analytics Compendium 2024-25


9. Focus Groups:

Group Discussions: Facilitate discussions with a selected group of participants


to collect qualitative data on opinions, attitudes, and perceptions.

10. Geospatial Data Collection:

GIS and GPS: Use Geographic Information Systems (GIS) and Global Positioning
Systems (GPS) for collecting location-based data.
Satellite Imagery: Utilize remote sensing data from satellites for large-scale
environmental and geographical analysis.

11. APIs (Application Programming Interfaces):

Third-party APIs: Access data from external services and platforms through
their APIs, such as social media analytics, financial data, or weather
information.

5 Business Analytics Compendium 2024-25


Examples of Data Collection Methods in Action:
Online Surveys: A retail company uses an online survey to gather customer
feedback on their latest product launch.
Web Scraping: A real estate analyst uses web scraping tools to collect data on
property prices and trends from various listing websites.
IoT Sensors: A smart agriculture project collects soil moisture and weather data
using IoT sensors to optimize irrigation schedules.
Transactional Data: An e-commerce platform analyzes transaction data to
understand purchasing patterns and customer preferences.
Social Media Mining: A marketing team mines social media data to gauge public
sentiment about their brand.

5 Business Analytics Compendium 2024-25


Data Cleaning and Wrangling

Data cleaning and wrangling techniques are language specific. Here we will be
taking examples in python language.

1. Identifying and handling missing data

df.info() help in identifying columns that have null values.

df.isnull().sum() helps in identifying the count of null values in each column

5 Business Analytics Compendium 2024-25


With the help of seaborn heatmaps we can visually identify the density of null
values in each column.

Matplotlib pyplot can be used to draw a bar plot which helps in identifying the
number of null values in each column.

There are two ways in which we can handle null values:


1. Removing Missing Values:

a. Dropping Rows
df_cleaned = df.dropna()

b. Dropping Columns
df_cleaned = df.dropna(axis=1)

5 Business Analytics Compendium 2024-25


2. Imputing missing values:

a. Mean/Media/Mode Imputation
df['column'].fillna(df['column'].mean(), inplace=True)
df['column'].fillna(df['column'].median(), inplace=True)

b. Forward/Backward Fill
df.fillna(method='ffill', inplace=True) # Forward fill
df.fillna(method='bfill', inplace=True) # Backward fill

c. Interpolation
df['column'].interpolate(method='linear', inplace=True)

d. K-Nearest Neighbor(KNN)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

2. Identify Outliers

Visual Methods
1.Box plots:

5 Business Analytics Compendium 2024-25


2. Scatter plot for outlier detection in 2D Data:

3. Histogram:

Statistical Methods

1. Z-Score:It is the distance of an observation from the mean in terms of


number of standard deviations. By using Z-Score we can determine whether
an observation is an outlier or not.

from scipy import stats


import numpy as np
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3] #Removing values that have z-score greater
than 3

5 Business Analytics Compendium 2024-25


2. Interquartile Range: Identify outlier using IQR. The Interquartile Range (IQR)
is a measure of statistical dispersion or how spread out the values in a
dataset are.

from scipy import stats


import numpy as np
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3]

3. Handling outlier:

1. Drop outlier:
df_cleaned = df[(z_scores <= 3)]
df_cleaned = df[~((df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 +
1.5 * IQR)))] #Removing data points that are outside interquartile range

2. Replace with Mean/Median:

median = df['column'].median()
df['column'] = np.where((df['column'] < (Q1 - 1.5 * IQR)) | (df['column']
> (Q3 + 1.5 * IQR)), median, df['column']) #Replacing data points outside
interquartile range with median value

3. Interpolation to remove outliers:


Lower = Q1 - 1.5 * IQR
Upper = Q1 + 1.5*IQR
df['column'] = df['column'].apply(lambda x: np.nan if ((x < lower) | (x >
outer)) else x) #Replacing data points with null values outside interquartile
range
df['column'].interpolate(method='linear', inplace=True)
5 Business Analytics Compendium 2024-25
4. Data Transformation Techniques:

1. Normalization:
To scale the data to a specific range, typically [0, 1]. This is particularly
useful when features have different ranges, helping to ensure that they
contribute equally to the analysis. Normalization: Suitable for algorithms
that compute distances between data points, such as KNN, Kmeans, SVM
and Neural Networks.

2.Standardization:
To transform data so that it has a mean of 0 and a standard deviation of
1. Standardization is useful when data needs to be normally distributed
and when features are measured on different scales. Commonly used in
machine learning algorithms that assume or benefit from normally
distributed data, such as Linear Regression, Logistic Regression, and
Principal Component Analysis (PCA).

3.Encoding Categorical Variables:


To convert categorical data into numerical format, enabling it to be used
in machine learning models. For example if we are predicting car sales
and color is one of the attributes, the machine learning model won't be
able to understand text labels for colors. To make the model recognize
and utilize the color attribute, we need to encode this categorical data
into a numerical format.

5 Business Analytics Compendium 2024-25


4. Binning:
To convert continuous data into categorical bins, which can simplify.
models and improve interpretability. Binning helps models focus on
general patterns of data and also helps in handling outliers.

5. Log Transformation:
Log transformation can help in handling skewed data by compressing the
range of values and making data more normally distributed. For example
Income data often has a long right tail, with a few individuals earning
significantly more than the majority. This skewed distribution can di`stort
statistical analyses and model performance

6. Box Cox Transformation:


Box-Cox transformation is a statistical technique used to stabilize
variance and make the data more normally distributed. This is particularly
useful in data analysis and modeling because many statistical methods and
machine learning algorithms assume that the data follows a normal
distribution and has constant variance.

Exploratory Data Analysis:

Data Distributions

5 Business Analytics Compendium 2024-25


Measures of Central Tendency:

These measures provide an estimate of the center of distribution.


1. Mean : The average of all data points.

2. Median : The middle value when the data points are ordered.
3. Mode : The most frequently occurring value in the data set.

Measure of dispersion:

1. Range: The difference between the maximum and minimum value.


2. Variance: The average of the square distances from the mean.

3. Standard deviation: The square root of variance.

4. Interquartile Range (IQR): The difference between the 75th and 25th
percentiles.

5 Business Analytics Compendium 2024-25


Shape of the Distribution:

1. Skewness: A measure of the asymmetry of the distribution.

2. Kurtosis: A measure of the “tailedness” of Distribution. A high kurtosis.


means that the probability of happening a rare event is high.

Visualization:
Visual tools can provide a clear picture of the data distribution.

1. Histogram:

2. Box Plot:

5 Business Analytics Compendium 2024-25


3. Density Plot:

4. Probability Plots:
Compares the quantiles of the data to the quantiles of a standard
distribution (e.g., normal distribution. The closer the points of the
distribution are to the line, the more the distribution resembles a normal
distribution.
Refer https://www.youtube.com/watch?app=desktop&v=okjYjClSjOg for
better understanding.

5 Business Analytics Compendium 2024-25


Analysis

Descriptive analysis

Descriptive analytics is a branch of business analytics focused on


understanding and interpreting historical data to identify patterns, trends, and
insights. It involves summarizing and describing past events to gain a clear
picture of what has happened within a business or system.

The primary goal of descriptive analytics is to answer the question, "What


happened?" By providing a detailed and accurate account of past
performance, businesses can use these insights to make informed decisions,
identify areas for improvement, and set the stage for more advanced analytics
like predictive and prescriptive analytics.

For example:- Retail Store Sales Analysis. Seasonal Trends(Identification of


peak sales periods, such as the holiday season), Product Trends(Noting which
products are most popular and which are underperforming), Customer
Insights(Understanding customer demographics that contribute most to sales).

Diagnostic analysis

Diagnostic analysis is a type of data analysis that goes beyond describing


historical data (as done in descriptive analytics) to investigate the reasons
behind past performance and events. It aims to answer the question, "Why did
this happen?" by identifying causes and relationships within the data.
Diagnostic analysis involves a deeper dive into the data to uncover root
causes of trends, anomalies, and patterns observed.

5 Business Analytics Compendium 2024-25


For Example
Observation: The online retail store notices 20% drop in sales during the last
quarter.
Data Collection: The store collects data on sales transaction, website traffic,
customer feedback, marketing activities and competitor pricing.
Data Drilling: Analyzing sales data by product category, region and customer
segment to pinpoint where the drop is more significant. Upon analyzing we
identify the drop is more significant in electronics sales.
Correlation and Causation: website traffic data to see if there was a
corresponding drop in visits. Identifying that website traffic remained steady,
suggesting the issue might be related to conversion rates rather than visitor
numbers.
Comparison Analysis: Comparing the affected quarter with previous quarters
and identifying changes in external factors, such as increased competitor
promotions or changes in customer purchasing behavior. Noting that
competitors launched aggressive discount campaigns during the same period.

Predictive analytics

Predictive analysis is a branch of advanced analytics that uses historical data,


statistical algorithms, and machine learning techniques to identify the likelihood
of future outcomes based on historical data. It aims to answer the question,
"What is likely to happen?" By analyzing past data and identifying patterns,
predictive analytics can forecast future trends, behaviors, and events.
For Example: Running simulations to evaluate the impact of different supply
chain strategies, such as sourcing from different suppliers or adjusting
production schedules.

5 Business Analytics Compendium 2024-25


Inferential Statistics

Inferential statistics is a branch of statistics that focuses on drawing conclusions


about a population based on a sample of data taken from that population. It
involves using various statistical methods to make predictions, inferences, or
decisions about a larger group based on the analysis of a subset of data.

Some Key Concepts in Inferential Statistics:

1. Population: The entire group of individuals or instances about whom we


want to draw conclusions.
2. Sample: A subset of the population selected for analysis.
3. Parameter: A numerical characteristic of a population such as population
mean and population standard deviation.
4. Statistic: A numerical characteristic of a sample, such as the sample mean
(x̄ ) or sample standard deviation (s).
5. Sampling Methods: Techniques used to select a sample from the population,
such as random sampling, stratified sampling, and cluster sampling.
6. Point Estimation: Using sample data to estimate a population parameter, like
estimating the population mean using the sample mean.
7. Interval Estimation (Confidence Intervals): A range of values derived from
the sample statistic that is likely to contain the population parameter. For
example, a 95% confidence interval for the population mean.
8. Hypothesis Testing: A method for testing a hypothesis about a population
parameter based on sample data.
9. Null Hypothesis (H₀): A statement of no effect or no difference, which we
seek to test.
10. Alternative Hypothesis (H₁ or Ha): A statement that contradicts the null
hypothesis.

5 Business Analytics Compendium 2024-25


Inferential Statistics

Inferential statistics is a branch of statistics that focuses on drawing conclusions


about a population based on a sample of data taken from that population. It
involves using various statistical methods to make predictions, inferences, or
decisions about a larger group based on the analysis of a subset of data.

Some Key Concepts in Inferential Statistics:

1. Population: The entire group of individuals or instances about whom we


want to draw conclusions.
2. Sample: A subset of the population selected for analysis.
3. Parameter: A numerical characteristic of a population such as population
mean and population standard deviation.
4. Statistic: A numerical characteristic of a sample, such as the sample mean
(x̄ ) or sample standard deviation (s).
5. Sampling Methods: Techniques used to select a sample from the population,
such as random sampling, stratified sampling, and cluster sampling.
6. Point Estimation: Using sample data to estimate a population parameter, like
estimating the population mean using the sample mean.
7. Interval Estimation (Confidence Intervals): A range of values derived from
the sample statistic that is likely to contain the population parameter. For
example, a 95% confidence interval for the population mean.
8. Hypothesis Testing: A method for testing a hypothesis about a population
parameter based on sample data.
9. Null Hypothesis (H₀): A statement of no effect or no difference, which we
seek to test.
10. Alternative Hypothesis (H₁ or Ha): A statement that contradicts the null
hypothesis.

5 Business Analytics Compendium 2024-25


11. p-value: The probability of observing the sample data, or something more
extreme, if the null hypothesis is true. A low p-value (typically < 0.05) leads
to rejecting the null hypothesis.
12. Significance Level (α): The threshold for rejecting the null hypothesis,
commonly set at 0.05 or 5%.

Type 1 error:

A type 1 error is also known as a false positive and occurs when a researcher
incorrectly rejects a true null hypothesis. This means that your report that your
findings are significant (accept alternate hypothesis) when in fact they have
occurred by chance. The probability of making a type I error is represented by
your alpha level (α), which is the p-value below which you reject the null
hypothesis. A p-value of 0.05 indicates that you are willing to accept a 5%
chance that you are wrong when you reject the null hypothesis. You can reduce
your risk of committing a type I error by using a lower value for p. For example,
a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to
detect a true difference if one really exists (thus risking a type II error).

Type II error:

A type II error is also known as a false negative and occurs when a researcher
fails to reject a null hypothesis which is false. Here a researcher concludes there
is not a significant effect when there really Is. The probability of making a type
II error is called Beta (β), and this is related to the power of the statistical test
(power = 1- β). You can decrease your risk of committing a type II error by
ensuring your test has enough power. You can do this by ensuring your sample
size is large enough to detect a practical difference when one truly exists.

5 Business Analytics Compendium 2024-25


For Example, H0 (Null Hypothesis): The return from stock A is higher than
return from stock B
H1 (Alternate): The return from stock B is higher than the return from stock A
Here, the hypothesis that the statistician wants to prove is ‘The return from
stock B is higher than the return from stock A’ based on available historical
data of stock prices. The null hypothesis is either accepted or rejected based
on the P-value obtained by performing statistical tests.

Now let’s see what P-value is


The P-value is the probability of occurring result is given that null hypothesis is
true. The confidence interval is associated with the hypothesis which is known
as alpha.
Alpha: Alpha is nothing but the probability of rejecting the null hypothesis when
it’s true. The lower the alpha, the better after all it’s the probability of
occurring an error.
Whereas beta is completely the opposite of alpha.
Beta: It is the probability of accepting a null hypothesis when not true. 1-beta
i.e., ‘Probability of not making type 2 error is known as the power of the test.

Ideally, you would like to keep both errors as low as possible which is
practically not possible as both errors are complementary to each other. Hence,
commonly used values of alpha are 0.01, 0.05, and 0.10 which gives a good
balance between alpha and beta.

5 Business Analytics Compendium 2024-25


So how is this P-value used to prove the hypothesis
When P-value > alpha i.e., the probability of the null hypothesis being true is
higher than the probability of type I error (rejecting the null hypothesis when
true), the null hypothesis is accepted.
When P-value < alpha i.e., the probability of the null hypothesis being true is
lesser than the probability of type I error (rejecting the null hypothesis when
true), the null hypothesis is rejected.

Directional Hypothesis Tests


A directional hypothesis is a prediction made by a researcher regarding a
positive or negative change, relationship, or difference between two variables
of a population. This prediction is typically based on past research, accepted
theory, extensive experience, or literature on the topic. Keywords that
distinguish a directional hypothesis are higher,
lower, more, less, increase, decrease, positive, and negative.

Example:
The salaries of postgraduates are higher than the Salaries of graduates.

5 Business Analytics Compendium 2024-25


Non-Directional Hypothesis Tests

A nondirectional hypothesis differs from a directional hypothesis in that it predicts


change, relationship, or difference between two variables but does not specifically
designate the change, relationship, or difference as being positive or negative.
Another difference is the type of statistical test that is used.
Example: Salaries of postgraduates are significantly different from the Salaries of
graduates.

5 Business Analytics Compendium 2024-25


Z-TEST

Z-test is a statistical procedure used to test an alternative hypothesis


against a null hypothesis. Z-test is any statistical hypothesis used to
determine whether two samples’ means are different when variances are
known and the sample is large (n ≥ 30). It is a Comparison of the means
of two independent groups of samples, taken from one population with
known variance.
Null: Sample mean is same as the population mean
Alternate: Sample mean is not the same as the population mean
Understanding a One-Sample Z-Test
A teacher claims that the mean score of students in his class is greater
than 82 with a standard deviation of 20. If a sample of 81 students was
selected with a mean score of 90 then check if there is enough evidence
to support this claim at a 0.05 significance level.
As the sample size is 81 and population standard deviation is known, this
is an example of a right-tailed one-sample z-test.

Mean 90
The size of the sample is 81
The population mean is 82
Standard Deviation for Population is 20

H0: μ=82
H1 : μ>82
From the z table the critical value at α = 1.645
x ̄ = 90, μ = 82, n = 81, σ = 20
z = 3.6

5 Business Analytics Compendium 2024-25


As 3.6 > 1.645 thus, the null hypothesis is rejected and it is concluded that
there is enough evidence to support the teacher's claim.
Answer: Reject the null hypothesis

Understanding a Two-Sample Z-Test:

Here, let’s say we want to know if Girls on average score 10 marks more
than boys. We have the information that the standard deviation for girls'
scores is 100 and for boys’ scores is 90. Then we collect the data of 20
girls and 20 boys by using random samples and recording their marks.
Finally, we also set our a value (significance level) to be 0.05.

In this example:
Mean Score for Girls (SampleMean) is 641
Mean Score for Boys (SampleMean) is 613.3
Standard Deviation for the Population of Girls is 100
Standard deviation for the Population of Boys is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
Putting in the above formula, we get a z-score, and thereby we compute p-
values as 0.278 from the z-score which is greater than 0.05, hence we fail
to reject the null hypothesis

5 Business Analytics Compendium 2024-25


T-TEST

If we have a sample size of less than 30 and do not know the population
variance, then we must use a t-test.
One-sample and Two-sample Hypothesis Tests the one-sample t-test is a
statistical hypothesis test used to determine whether an unknown population
parameter is different from a specific value.
In statistical hypothesis testing, a two-sample test is a test performed on the
data of two random samples, each of which is independently obtained. The
purpose of the test is to determine whether the difference between these two
populations is statistically significant.

Understanding a One-Sample t-Test:


Let’s say we want to determine if on average girls score more than 600 in
the exam. We do not have the information related to variance (or
standard deviation) for girls’ scores. To perform a t-test, we randomly
collect the data of 10 girls with their marks and choose our a value
(significance level) to be 0.05 for Hypothesis Testing.

5 Business Analytics Compendium 2024-25


In this example:
●Mean Score for Girls is 606.8
●The size of the sample is 10
●The population mean is 600
●Standard Deviation for the sample is 13.14

Putting in the above formula, we get an at-score, and thereby we compute p-


value as 0.06 from t-score which is greater than 0.05, hence we fail to reject
the null hypothesis and don’t have enough evidence to support the hypothesis
that on average, girls score more than 600 in the Exam.

Understanding a Two-Sample t-Test

Here, let’s say we want to determine if on average, boys score 15 marks more
than girls in the exam. We do not have the information related to variance (or
standard deviation) for girls’ scores or boys’ scores. To perform a t-test. we
randomly collect the data of 10 girls and boys with their marks. We choose our
a value (significance level) to be 0.05 as the criteria for Hypothesis Testing.

In this example:
Mean Score for Boys is 630.1
Mean Score for Girls is 606.8
Difference between Population Mean 15
Standard Deviation for Boys’ score is 13.42
Standard Deviation for Girls’ score is 13.14
Putting in the above formula, we get an at-score, and thereby we
compute p-value as 0.019 from t-score which is less than 0.05, hence
we reject the null hypothesis and conclude that on average boys
score 15 marks more than girls in the exam.

5 Business Analytics Compendium 2024-25


Supervised learning and unsupervised learning are two main types of machine
learning techniques used for different purposes. Here's a detailed explanation
of each:

Supervised Learning

Supervised learning involves training a model on a labeled dataset, meaning


that each training example is paired with an output label. The goal is to learn a
mapping from inputs to outputs that can be used to predict the output for new,
unseen inputs. Supervised learning is used for tasks where the desired output is
known and available during training.

Key Points:

Labeled Data: Requires a dataset where each input is paired with the correct
output.
Objective: Learn a function that maps inputs to the correct output.
Applications: Classification (assigning inputs to predefined categories) and
regression (predicting continuous values).

Examples:

Classification: Email spam detection, image recognition (e.g., classifying images


of cats and dogs), sentiment analysis.
Regression: Predicting house prices based on features like size, location, and
number of rooms, forecasting stock prices.

5 Business Analytics Compendium 2024-25


Supervised learning can be broadly categorized into two main types:
classification and regression. Each of these types can be further broken down
into specific tasks and associated algorithms.

Classification
Classification involves predicting a categorical label for an input. The goal is to
assign inputs to predefined classes or categories.

Types of Classification:

Binary Classification: The task is to classify the input into one of two possible
classes. Examples include spam detection (spam vs. not spam) and medical
diagnosis (disease vs. no disease).
Multiclass Classification: The task is to classify the input into one of three or
more classes. Examples include digit recognition (0-9) and document
categorization (sports, politics, technology).
Multilabel Classification: Each input can be assigned multiple labels. Examples
include tagging multiple objects in an image or categorizing a document into
multiple topics

5 Business Analytics Compendium 2024-25


Regression:

Regression involves predicting a continuous numerical value for an input. The


goal is to learn the relationship between the input variables and the continuous
output.

Types of Regression:

Simple Linear Regression: Models the relationship between two variables by


fitting a linear equation.
Multiple Linear Regression: Models the relationship between one dependent
variable and multiple independent variables.
Polynomial Regression: Models the relationship using a polynomial equation.
Ridge Regression: A type of linear regression that includes a regularization term
to prevent overfitting.
Lasso Regression: Another form of regularized linear regression that performs
both variable selection and regularization.
Elastic Net Regression: Combines the properties of Ridge and Lasso regression.

Unsupervised Learning

Unsupervised learning involves training a model on data that does not have
labeled outputs. The goal is to infer the natural structure present within a set of
data points. This is used for tasks where we do not know the desired output and
want to discover patterns or groupings in the data.

Key Points:
Unlabeled Data: Uses data that does not have associated labels.
Objective: Find hidden patterns, groupings, or structures in the data.

5 Business Analytics Compendium 2024-25


Regression:

Regression involves predicting a continuous numerical value for an input. The


goal is to learn the relationship between the input variables and the continuous
output.

Types of Regression:

Simple Linear Regression: Models the relationship between two variables by


fitting a linear equation.
Multiple Linear Regression: Models the relationship between one dependent
variable and multiple independent variables.
Polynomial Regression: Models the relationship using a polynomial equation.
Ridge Regression: A type of linear regression that includes a regularization term
to prevent overfitting.
Lasso Regression: Another form of regularized linear regression that performs
both variable selection and regularization.
Elastic Net Regression: Combines the properties of Ridge and Lasso regression.

Unsupervised Learning

Unsupervised learning involves training a model on data that does not have
labeled outputs. The goal is to infer the natural structure present within a set of
data points. This is used for tasks where we do not know the desired output and
want to discover patterns or groupings in the data.

Key Points:
Unlabeled Data: Uses data that does not have associated labels.
Objective: Find hidden patterns, groupings, or structures in the data.

5 Business Analytics Compendium 2024-25


Applications: Clustering (grouping similar items), dimensionality reduction
(reducing the number of features while preserving the important information),
anomaly detection.
Examples:

Clustering: Customer segmentation in marketing, grouping similar news articles,


image compression.
Dimensionality Reduction: Principal Component Analysis (PCA) for reducing the
number of variables in a dataset, t-SNE for visualizing high-dimensional data in
2D or 3D.
Anomaly Detection: Identifying unusual transactions in fraud detection, spotting
defects in manufacturing processes.

Unsupervised learning involves discovering patterns and structures in data


without labeled outputs

Clustering

Clustering aims to group similar data points into clusters based on their
characteristics.

5 Business Analytics Compendium 2024-25


Common Clustering Algorithms:

K-means Clustering: Divide s data into 𝑘 k clusters by minimizing the variance


within each cluster.
Hierarchical Clustering: Builds a tree of clusters by either merging or splitting
existing clusters.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups
points that are closely packed together and identifies points that lie alone in
low-density regions (outliers).
Gaussian Mixture Models (GMM): Assumes data is generated from a mixture of
several Gaussian distributions and identifies the parameters of these
distributions.

Dimensionality Reduction

Dimensionality reduction techniques reduce the number of features in a dataset


while preserving as much information as possible.

Common Dimensionality Reduction Algorithms:

Principal Component Analysis (PCA): Projects data to a lower-dimensional space


by finding the directions (principal components) that maximize variance.

t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduces dimensions while


preserving the local structure of the data, useful for visualization.

Linear Discriminant Analysis (LDA): While primarily used for supervised learning,
LDA can also be used in an unsupervised manner to reduce dimensions.
Autoencoders: Neural network-based models that learn to compress data into a
lower-dimensional space and then reconstruct it.

5 Business Analytics Compendium 2024-25


Linear regression

Linear regression is a statistical method used to model the relationship between


a dependent variable and one or more independent variables by fitting a linear
equation to observed data. The goal is to predict the value of the dependent
variable based on the values of the independent variables.

Y = β0 + β1x1 + β2x2 + ⋯ + βnxn + ϵ


Where
y is the dependent variable.
𝑥1,𝑥2,…,𝑥𝑛
x1 , x2,…,xn are the independent variables.
𝛽0 is the y-intercept.
𝛽1,𝛽2,…,𝛽𝑛 are the coefficients (slopes) associated with each independent
variable.
ϵ is the error term.

5 Business Analytics Compendium 2024-25


Assumptions of Linear Regression

For the linear regression model to be valid, certain assumptions must


be met:

Linearity: The relationship between the dependent and


independent variables is linear.
Independence: Observations are independent of each other.
Homoscedasticity: The variance of the error terms is constant
across all levels of the independent variables.
Normality: The error terms are normally distributed.
No Multicollinearity (for multiple regression): Independent
variables are not highly correlated with each other.

Evaluating the Model

Several metrics are used to evaluate the performance of a linear


regression model:

R-squared (𝑅2): Represents the proportion of the variance in the


dependent variable that is predictable from the independent variables.
Values range from 0 to 1, with higher values indicating better fit.
Adjusted R-squared: Adjusts the 𝑅2 value based on the number of
predictors, providing a more accurate measure for multiple regression
models.
Mean Squared Error (MSE): The average of the squared differences
between observed and predicted values.
5 Business Analytics Compendium 2024-25
Root Mean Squared Error (RMSE): The square root of MSE, providing a
measure of the average magnitude of the errors.
Residual Plots: Graphical representation to check the assumptions of
linearity, independence, and homoscedasticity.

Application
Sales and Revenue Forecasting:
Predicting future sales based on past sales data, economic indicators,
and market trends.
Pricing Strategy:
Determining the optimal price point for products by analyzing the
relationship between price and demand.
Marketing Campaign Analysis:
Evaluating the effectiveness of marketing campaigns by assessing the
impact of advertising spend on sales growth.

Underfitting and Overfitting

5 Business Analytics Compendium 2024-25


Definition:

Underfitting occurs when a model is too simple to capture the


underlying patterns in the data. It performs poorly on both the training
data and unseen data (test set) because it fails to learn the
relationships in the data adequately.

Causes:
The model is too simple (e.g., using a linear model for a non-
linear problem).
Insufficient training time (in the case of iterative algorithms like
neural networks).
Inadequate features or too few features used in the model.
Too much regularization, which restricts the model's capacity.

Symptoms:
High bias: The model makes strong assumptions about the data,
leading to poor performance.
Low training accuracy.
Low test accuracy.

5 Business Analytics Compendium 2024-25


Ways to Overcome Underfitting:

Increase Model Complexity:


Use a more complex model that can capture nonlinear
relationships (e.g., polynomial regression, decision trees).

Add More Features:


Incorporate more relevant features that provide additional
information about the problem.

Reduce Regularization: Decrease the regularization parameters


(e.g., lower the lambda in Ridge/Lasso regression) to allow the
model to fit the data better.

Increase Training Time: Train the model for more epochs (in the
case of neural networks) to ensure it has enough time to learn the
data patterns.

Overfitting

Definition:

Overfitting occurs when a model is too complex and captures not only
the underlying patterns but also the noise in the training data. It
performs very well on the training data but poorly on unseen data (test
set) because it does not generalize well.

5 Business Analytics Compendium 2024-25


Causes:
The model is too complex relative to the amount of training data.
Too many features, especially if some are irrelevant or noisy.
Insufficient training data.
Lack of regularization, allowing the model to become too flexible.

Symptoms:
High variance: The model is overly sensitive to small fluctuations in
the training data.
High training accuracy.
Low test accuracy.

Ways to Overcome Overfitting:

Simplify the Model: Use a simpler model that is less likely to


capture noise in the data (e.g., reduce the number of features, use
a less complex algorithm).
Regularization: Add regularization techniques (e.g., L1/L2
regularization, dropout for neural networks) to penalize large
coefficients and reduce model complexity.
Cross-Validation: Use cross-validation techniques (e.g., k-fold
cross-validation) to ensure the model generalizes well to unseen
data.
Pruning (for Decision Trees): Prune the tree to remove nodes that
provide little power in predicting target variables to reduce
complexity.

5 Business Analytics Compendium 2024-25


Add More Data: Increasing the size of the training dataset can
help the model learn more generalizable patterns.
Feature Selection: Remove irrelevant or redundant features to
reduce noise and simplify the model.
Ensemble Methods: Use ensemble methods like bagging (e.g.,
Random Forests) or boosting (e.g., Gradient Boosting Machines) to
improve generalization by combining multiple models.

Logistic Regression

Logistic regression is a statistical method used for binary classification


problems, where the outcome variable is categorical and typically
represents two classes (e.g., yes/no, true/false, success/failure). Unlike
linear regression, which predicts continuous outcomes, logistic
regression predicts the probability of a categorical outcome using a
logistic function.

Key Concepts of Logistic Regression

Logistic Function:
The logistic function, also known as the sigmoid function, is used to
map predicted values to probabilities between 0 and 1. The function is
defined as:

Where 𝑧 is a linear combination of the input features, typically


represented as:

5 Business Analytics Compendium 2024-25


Maximum Likelihood Estimation (MLE):
The parameters 𝛽0,𝛽1,…,𝛽𝑛 are estimated using maximum
likelihood estimation. The likelihood function measures how likely
the observed data is given the model parameters. The goal is to
find the parameter values that maximize this likelihood function.
Log-Likelihood: The log-likelihood is the natural logarithm of the
likelihood function. For logistic regression, it is given b

Assumptions of Logistic Regression


Binary Outcome: The dependent variable is binary.
Independence: Observations are independent of each other.
Linearity of Logit: The logit (log-odds) of the outcome is a linear
combination of the input features.
No Multicollinearity: Independent variables are not highly
correlated with each other

5 Business Analytics Compendium 2024-25


Evaluating the Model

1. Confusion Matrix
A confusion matrix provides a summary of prediction results on a
classification problem. The matrix shows the number of true positives
(TP), true negatives (TN), false positives (FP), and false negatives (FN).

Components:
True Positive (TP): The model correctly predicts the positive class.
True Negative (TN): The model correctly predicts the negative
class.
False Positive (FP): The model incorrectly predicts the positive
class.
False Negative (FN): The model incorrectly predicts the negative
class.

2. Accuracy

Accuracy measures the proportion of correct predictions (both true


positives and true negatives) among the total number of cases.

5 Business Analytics Compendium 2024-25


3. Precision

Precision (also called Positive Predictive Value) measures the


proportion of true positive predictions among all positive predictions.

For Example: Suppose we are building a machine learning model that


predicts whether incoming mail is spam or not. We don't want our
model to classify genuine mail as spam, as this can lead to losses for
the recipient. We can afford to have some spam emails in the user's
inbox. In this case, precision is more important because we want to
ensure that the maximum number of emails classified as spam are
actually spam.

4. Recall

Recall (also called True Positive Rate or Sensitivity) measures the


proportion of true positive predictions among all actual positive cases.

For Example: In detecting fraudulent transactions or activities, high


recall helps in identifying as many fraudulent instances as possible to
prevent financial losses, even if it means investigating some false
positives.

5 Business Analytics Compendium 2024-25


5. F-Score

The F1 Score is the harmonic mean of precision and recall, providing a


balance between the two metrics.

For Example: In scenarios where one class significantly outnumbers the


other (class imbalance), accuracy alone can be misleading. The F1-
score accounts for this by considering how well the model performs on
both the minority and majority classes.

6. AUC-ROC Curve

ROC Curve (Receiver Operating Characteristic Curve)


Definition:

The ROC curve is a plot that illustrates the diagnostic ability of a


binary classification model as its discrimination threshold is varied. It
plots two metrics:
True positive rate (Recall)
False positive rate

5 Business Analytics Compendium 2024-25


Steps to build AUC-ROC Curve

The model outputs probabilities, and different thresholds are used to decide
the class labels. For each threshold, calculate TPR and FPR.
Each threshold results in a point on the ROC curve with FPR on the x-axis
and TPR on the y-axis.
Connect the points to form the ROC curve.

AUC Curve
Definition:
AUC represents the area under the ROC curve. It provides a single scalar value
to summarize the model's performance across all thresholds.
AUC = 1: Perfect model that distinguishes between positive and negative
classes without any errors.
AUC = 0.5: Model with no discrimination power, equivalent to random
guessing.
0.5 < AUC < 1: The model has some degree of discrimination power.

5 Business Analytics Compendium 2024-25


The ROC-AUC curve is a powerful tool for evaluating binary classification
models. It provides insights into the model's performance across all
classification thresholds and helps in understanding the trade-offs between
true positive and false positive rates. By summarizing this information into a
single metric, AUC, it facilitates the comparison of different models and offers
a robust measure of model performance, especially in scenarios where class
distribution is imbalanced.

7. Logarithmic Loss (Log Loss):

Log-loss measures the performance of a classification model where the


prediction is a probability value between 0 and 1. Lower log-loss values indicate
better model performance.
where N is the number of observations, is the actual class label, and is the
predicted probability.

Log loss is particularly valuable when you need to understand and trust the
model's confidence in its predictions. It's not just about whether the model got
the prediction right, but also about how confident it was in that prediction.

Log loss is sensitive to the accuracy of the predicted probabilities, providing a


nuanced measure of model performance. Small improvements in probability
predictions can lead to a noticeable reduction in log loss, making it a good
metric for model tuning and evaluation. It allows for fine-grained evaluation
and comparison of models, even when they are very close in performance
according to other metrics like accuracy.

5 Business Analytics Compendium 2024-25


Decision Tree
Decision Tree: It is a supervised learning algorithm that can be used for
both classification and regression tasks.
How it works: The decision tree works by splitting the data into smaller and
smaller subsets based on a series of questions. The questions are based on
the features of the data. For example, if you are trying to classify a
customer as likely to buy a product, you might ask questions about their
age, income, and past purchase history.
ID3 Algorithm: This is a common algorithm used to build decision trees. It
works by recursively splitting the data into subsets based on the attribute
that best.
Information Gain: This is a measure of how much information is gained by
splitting the data on a particular attribute. The higher the information gain,
the better the attribute is for splitting the data.

Entropy: This is a measure of the randomness of the data. The higher the
entropy, the more random the data is.

Here are the steps involved in building a decision tree using the ID3 algorithm:
Calculate the entropy of the target variable.
For each attribute in the data, calculate the information gain that would be
achieved by splitting the data on that attribute.
Choose the attribute with the highest information gain.
Split the data on the chosen attribute.

5 Business Analytics Compendium 2024-25


Repeat steps 1-4 for each of the resulting subsets, until a stopping
criterion is met.

The stopping criterion is a set of rules that determines when to stop growing
the tree. Common stopping criteria include:

When all of the data in a subset belongs to the same class.


When there are no more attributes left to split on.
When the information gain for splitting on any attribute is below a certain
threshold.

Decision trees are a powerful and versatile machine learning algorithm that can
be used for a wide variety of tasks. They are relatively easy to understand and
interpret, which makes them a good choice for many applications.

Here we've got an example with lots of points on our two dimensional scatter
plot. Now how does a decision tree work. So what it is going to do is cut it up
into slices in several iterations.

5 Business Analytics Compendium 2024-25


We split the data and construct a decision tree side by side which we will use
later. This very task is achieved by using various algorithms. It builds a decision
tree from a fixed set of examples and the resulting tree is used to classify
future samples.

The resulting Tree (obtained by applying algorithms like CART, ID3) which will
be later used to predict the outcomes.

5 Business Analytics Compendium 2024-25


PART 3

Data
Visualization
Introduction
Data visualization is the art of representing information and data in visual formats like
charts, graphs, maps, and infographics, making complex information quickly and easily
understandable. Instead of deciphering rows of numbers in a spreadsheet, a clear and
colorful chart can effectively reveal trends and patterns. This accessibility allows
everyone, regardless of technical background, to grasp key insights.

There is a story in your data. As the analyst, you know the story within your data, but
how do you communicate it effectively and ensure your audience takes concrete
actions? Data visualization is the final step in your analytical journey, enabling you to tell
your story compellingly and convert insights into decisive measures.

But telling a compelling story is no easy task. Like any other type of communication, the
key challenge in Data Visualization is to identify which elements in your message signal
— the information you want to communicate, and which are noise — unnecessary
information polluting your message.

With that in mind, your main goal is to present content to your audience in a way that
highlights what's important, eliminating any distractions. You've probably already spent a
lot of time understanding, cleaning, and modeling your data to reach a conclusion worth
sharing. So don't let this final step get in the way of properly communicating your key
insights.

5 Business Analytics Compendium 2024-25


Memory in Data Visualization

Iconic Memory
Processes visual information very quickly, lasting only a fraction of a second.
Acts as a flash storage for visual stimuli, deciding whether to discard or transfer the
information to short-term memory.

Short-Term Memory
Holds information for a few minutes but has limited capacity.
Can only process a limited amount of data at a time and is easily overwhelmed.

Long-Term Memory
Stores information for an extended period.
Information moves here from short-term memory if retained.

As the creator of data visualizations, your goal is to leverage your audience's iconic
memory to capture attention immediately and minimize the load on their short-term
memory to maintain focus. This approach ensures your key insights are effectively
communicated and more likely to be retained in long-term memory.

Before delving deeper into this let us understand what Data Visualization is.

5 Business Analytics Compendium 2024-25


What is Data Visualization?
Definition: Data visualization is the practice of representing information and data in
visual formats such as charts, graphs, maps, and infographics. It transforms complex
data sets into visual representations that are easier to understand and interpret,
enabling viewers to quickly grasp patterns, trends, and relationships within the data.

Why Visualize Data?


The human brain is wired to process visual information far more effectively than raw
data tables and text. Data visualization bridges this gap, transforming complex datasets
into clear and engaging visuals that unlock deeper insights and revolutionize
communication. Here's are the reasons why visualizing data is essential:

Enhanced Comprehension and Memory:


Overcoming Information Overload: Raw data tables can be overwhelming.
Visualizations present information in a way that allows for quicker grasp of patterns,
trends, and relationships between data points.
Unlocking Hidden Insights: Visualizations can reveal patterns and trends that might
be missed in raw data. A spike in a line chart might prompt investigation into an
outlier or a hidden cause.
Improved Retention and Recall: People remember information presented visually
much better than text alone. Visualizations leave a lasting impression and aid in
information recall, making them ideal for presentations and reports.

Effective Communication and Shared Understanding:


Clear and Concise Communication: A well-designed visualization can convey
complex information in a simple and easy-to-understand manner. This promotes
clear communication and reduces the risk of misinterpretation of data.
Shared Understanding: Visualizations provide a common ground for discussions and
presentations. Everyone can see the data presented the same way, fostering a
shared understanding of the information being presented.
Engaging Your Audience: Visualizations can spark interest and capture attention
more effectively than dry reports. Interactive visualizations can further enhance
audience engagement and participation.

5 Business Analytics Compendium 2024-25


Data-Driven Decision Making:
Seeing the Evidence: Visualizations enable viewers to see the evidence behind a
story. This helps in making informed decisions based on factual data rather than
intuition or guesswork. Data visualizations become powerful tools for strategic
planning and resource allocation.
Faster Identification of Issues: Visualizations can quickly highlight areas of concern,
allowing for faster problem identification and response. A heatmap might reveal a
website's sections with low user engagement, prompting redesign efforts.
Identification of Opportunities: Emerging trends and patterns can be readily spotted
through visualizations. This allows businesses to identify and capitalize on new
opportunities, leading to innovation and growth.

Additional Benefits:
Accessibility for Wider Audiences: Data visualizations can cater to a wider audience,
including those with limited data analysis expertise. Complicated data becomes
approachable through clear visuals, making data analysis more inclusive.
Storytelling with Data: Data visualizations are powerful tools for storytelling. By
weaving data into a narrative, you can connect with your audience on an emotional
level, making the information more impactful and memorable.

Data Visualization Process

1. Defining Goals and Audience:


What message do you want to convey? What insights should viewers gain from the
visualization?
Who is your audience? Are they data experts or novices? Understanding their level
of data literacy helps determine the complexity and information density of the
visualization.
What action do you want viewers to take? Is it to understand a trend, compare data
points, or make a decision? Clearly define your goals upfront to guide the entire
visualization process.

5 Business Analytics Compendium 2024-25


2. Data Preparation and Cleaning:
Identify the data source: This could be spreadsheets, databases, APIs, or other data
repositories.
Ensure data accuracy and completeness: Double-check for errors, missing values,
and inconsistencies.
Clean and organize the data: This might involve formatting data types, handling
missing values, and filtering irrelevant information for visualization purposes.
You will learn more about this in the data analysis section of this compendium.

3. Exploratory Data Analysis (EDA):


Get familiar with your data: Perform basic calculations like summary statistics
(mean, median, standard deviation) to understand central tendencies and spread.
Identify initial trends and patterns: Look for relationships, outliers, and areas of
interest within the data.
Consider different visualization options: Explore how different chart types might
represent the data effectively based on its characteristics (numerical, categorical,
relationships).

4. Choosing the Right Chart Type:


Match the chart to your data type and goals: Bar charts for comparisons, line charts
for trends, scatter plots for relationships, etc. We will discuss in detail in the
following section.
Consider the clarity and effectiveness of the chosen chart: Will it accurately
represent the data and avoid misleading interpretations?

5. Design and Formatting Principles for Clear Communication:


Apply design principles for clarity and aesthetics: Utilize color theory, visual
hierarchy (prioritizing important information), proper labeling, and minimize chart
junk (unnecessary decorative elements).
Fonts, colors, and axes should be chosen to enhance readability and visual appeal.
Focus on clear and concise communication. Don't let visual elements overwhelm the
data story.
We will be discussing in detail about the various crucial principles that are relevant
to data visualization in the following sections.

5 Business Analytics Compendium 2024-25


6. Iteration and Refinement:
Get feedback from your target audience: Show them the draft and see if the
visualization effectively communicates the message.
Revise based on feedback: This might involve adjusting the chart type, design
elements, or adding clarifications.

7. Presenting and Sharing Visualizations:


Choose an appropriate platform: Presentation slides, reports, dashboards, or
interactive tools depending on the context.
Provide context and explanation: Don't leave viewers guessing; explain the data,
methodology, and key takeaways alongside the visualization.

Choosing the right chart

Step 1: Determine the Type of Data


Categorical: Non-numerical values (e.g., product categories)
Quantitative: Numerical values (e.g., sales figures)
Temporal: Time-based data (e.g., monthly sales)
Spatial: Location-based data (e.g., customer addresses)
Step 2: Identify the Relationship Between Variables
Comparison: Show differences between data points (e.g., bar chart)
Distribution: Show how data is spread out (e.g., histogram)
Relationship: Show how variables are connected (e.g., scatter plot)
Step 3: Determine the Purpose of Visualization
Trend: Line chart or area chart
Comparison: Bar chart or column chart
Distribution: Histogram or box plot
Step 4: Identify the Audience
Data-savvy audience: Consider complex charts (heat maps)
Less familiar audience: Use simpler charts (pie charts)
Step 5: Select the Appropriate Chart Type
Experiment with different options to find the best fit
No single chart is perfect, consider using multiple charts for complex messages

5 Business Analytics Compendium 2024-25


https://www.linkedin.com/pulse/choosing-right-chart-type-data-
visualization-strategy-suneel-patel

https://infogram.com/blog/choose-the-right-chart/

5 Business Analytics Compendium 2024-25


https://activewizards.com/blog/how-to-choose-the-right-chart-type-
infographic/

The objective of your visual:

5 Business Analytics Compendium 2024-25


Comparison Chart

In this chart, we compare one value with the other like region-wise sales, economy rate
comparison of bowler in cricket. We can use the following charts for comparison.

Column charts
It is used to compare values across multiple categories.
Here, the category appears horizontally(X-axis) and values vertically(Y-axis).
In the column charts, you can also show information about parts of a whole
across different categories, and you can show this in absolute value as well as
relative terms. Here comes the concept of a stacked column chart and 100%
stacked column charts.

5 Business Analytics Compendium 2024-25


Bar charts
As you’re quite familiar with column charts, you will find that working with bar
charts is very synonymous.
The only difference between them is that in a bar chart, values are represented
on the X-axis and categories on the Y-axis.
We typically use a bar graph to show values across categories when the
duration or category text is long.
Stacked bar charts are used to compare parts of a whole(relative and absolute)
and compare change over categories or time.

Line charts
It is one of the most popular charts and vitally used in most industries.
Whether you’re analyzing sales data, whether you’re looking at year-on-year
profit, whether you’re looking at how a person’s salary increases in the last year,
line charts are very helpful in these scenarios.
The line chart is used to show trends over time or categories.
Here, the category appears horizontally(X-axis) and value vertically(Y-axis).

5 Business Analytics Compendium 2024-25


Scatter plots
An XY(Scatter) chart uses numerical values along both axes.
Scatter plots are useful for showing a correlation between the data points that
may not be easy to see from the data alone.
It is used for displaying and comparing numerical values, such as scientific or
statistical data.

5 Business Analytics Compendium 2024-25


Distribution charts

These charts are used to show the spread of the data values over categories or
continuous values. We can use the following charts in order to visualize the distribution
of the data. For example Distribution of bugs found in 10 weeks of the software testing
phase.

Histogram
It is used to graphing the frequency over a distribution. It is a very useful graph
in the analytics world and can infer many useful insights from the data.
Visually, all the bars are touching each other with no space between them.

5 Business Analytics Compendium 2024-25


Box plot
It is also known as Box and whiskers plot.
The line in the middle of the box is the median value. This means that 50% of
the data are above the median value and 50% of the data are below the median
value.
Medians are useful because they’re not swayed by outliers as mean is.
Within the box itself, there is 25% of data above the median and 25% of data
below the median, so 50% of the data is within the box.

KDE Plot
KDE is an abbreviation for the Kernel Density Estimation plot.
It’s a smooth form of a histogram.
A kernel density estimate (KDE) plot is a method for visualizing the distribution
of observations in a dataset, analogous to a histogram.
Relative to a histogram, KDE can produce a plot that is less cluttered and more
interpretable, especially when drawing multiple distributions.

5 Business Analytics Compendium 2024-25


The Breakup of a Whole Chart

These charts are used to analyze, how various parts comprise the whole. These charts
are very handy in many scenarios where we have to analyze revenue contribution by
different regions, batsmen scored on which sides of the ground. Charts used to
represent these are listed below.

5 Business Analytics Compendium 2024-25


Pie Chart
If you want to represent your categorical data as part of the whole, then you
should use a pie chart.
Each slice represents the percentage that the given category occupies out of
the whole.
It’s better to use a pie chart if you’re having less than 5 categories.

Donut Chart
It is a variant of a pie chart, with the hole in the center.
It displays the categories as arcs rather than slices.

5 Business Analytics Compendium 2024-25


Stacked Column Chart
A Stacked column chart is used when you want to show the relative percentage
of multiple data series in stacked columns, the total (cumulative) of stacked
columns always equals 100%.
The 100% stacked column chart can show the part-to-whole proportions over
time, for example, the proportion of quarterly sales per region or the proportion
of monthly mortgage payment that goes toward interest vs. principal.

Stacked Bar Chart


A Stacked Bar chart is used to show the relative percentage of multiple data
series in a stacked bar.

5 Business Analytics Compendium 2024-25


Relationship charts

These relationships charts are very helpful when we want to know that what is the
relation between the different variables. Charts used to visualize the relationship
between the variables are listed below.

Scatter Plot
A scatter chart uses numerical values along both axes.
It uses dots to represent the values for two different numerical values.
The position of each dot on the horizontal axis and the vertical axis signifier the
value of a particular data point.
It is useful for showing a correlation between the data points that may not be
easy to see from the data alone.
It is used for displaying and comparing numerical values, such as scientific or
statistical data.

5 Business Analytics Compendium 2024-25


Line Chart
As discussed above, a line chart is also used to find the relationship between the
two variables.

5 Business Analytics Compendium 2024-25


Trend charts

This is used to visualize trends of values over time and categories, it is also known as
“Time Series” data in the data-driven world. For example Run rate tracker over by over,
Hourly temperature variation during a day. Listed below are the charts used to represent
time series data.

Line Chart
The best way to visualize trend data is by line chart.
Line charts are also used to see the trends in various domains.

5 Business Analytics Compendium 2024-25


Area Chart
It is used to see the magnitude of the values.
It shows the relative importance of values over time.
It is similar to a line chart, but because the area between lines is filled in, the
area chart emphasizes the magnitude of values more than the line chart does.

Column Chart
A column chart as discussed above is also used to show the trends of values
over time and categories.

5 Business Analytics Compendium 2024-25


Data visualization Principles
The Grammar of Graphics by Leland Wilkinson
"The Grammar of Graphics" by Leland Wilkinson is a seminal work in the field of data
visualization. First published in 1999, it proposes a foundational framework for
understanding and creating effective statistical graphics.

Core Idea: A Grammar for Graphics

Wilkinson argues that just like a sentence in language follows grammatical rules,
effective visualizations can be built using a set of core building blocks. This "grammar"
provides a systematic approach to describe and construct various statistical graphics.

Components of the Grammar

Data: The raw information being represented (e.g., numerical values, categories)
Aesthetic Mappings: How data attributes are linked to visual properties like position,
size, color, etc. (e.g., position on x-axis corresponds to time, color represents
category)
Scales: The transformation of data values into visual scales (e.g., linear scale for
temperature, logarithmic scale for earthquake magnitudes)
Geometrical Shapes: The basic visual marks used to represent data points (e.g.,
points, lines, bars)
Statistical Transformations: Techniques for summarizing or transforming data for
visual representation (e.g., means, medians, binning)

5 Business Analytics Compendium 2024-25


Benefits of the Grammar
Clarity and Consistency: The framework promotes clear communication by ensuring
each element in a visualization has a well-defined purpose.
Flexibility and Reusability: By separating data from its visual representation, the
grammar allows for creating different visualizations from the same data set.
Understanding Complexity: The breakdown into components helps analyze and
improve complex visualizations.

Preattentive Attributes
What is it?

Our brains are constantly bombarded with visual information. But how do we process it
all so quickly? The answer lies in preattentive processing. This is an automatic,
subconscious ability to pick up on basic visual features like color, size, and position. It
happens within milliseconds, allowing us to grasp the gist of a scene before we even
consciously focus on it.

Think of it like a filter. Preattentive processing sifts through the visual clutter,
highlighting elements that stand out. These "preattentive attributes" act as attention
magnets, drawing our eyes to the most salient or relevant information. For instance, a
bright red bar in a chart can highlight a significant outlier, while using different sizes can
emphasize comparisons between data points.

This preattentive processing plays a crucial role in various fields. From design and
advertising, where capturing attention is key, to education and cognitive science,
understanding how we process visual information unlocks powerful tools for
communication and learning. By leveraging preattentive attributes, we can create
visualizations that guide viewers' attention to the most important details, saving them
time and effort in deciphering complex data.

While it might look like a fuzzy concept at first, the power of these preattentive
attributes is relatively easy to demonstrate. To do so, look at the sequence below and
count how many times the number 9 appears.

5 Business Analytics Compendium 2024-25


Serial processing

The correct answer is five. But in this example, there's no visual indication you can rely
on to help you reach this conclusion. You had to scan each number one by one to see if
it was a 9 or not.

Let's repeat the same exercise with the exact same sequence, but now, let's see what
happens when we make a single visual change.

Preattentive processing

Because we changed the color intensity of these numbers, they now clearly stand out.
Suddenly, there are five 9s in front of you. This is preattentive processing and iconic
memory in action.

5 Business Analytics Compendium 2024-25


What Are Preattentive Visual Properties?

Colin Ware, in his book “Information Visualization: Perception for Design” defines the
four preattentive visual properties as follows:
1. Form
2. Color
3. Spatial Position
4. Movement

1. Form
The form applies to various attributes listed below. In design, the form can be used
either to increase attention to specific elements or to reduce attention to it.
Form attributes include:
Collinearity
Curvature
length, breadth, and width
Marks added to objects
Numerosity
Shape
Size
Spatial grouping
Spatial orientation

2. Colour
Color is one of the most common
properties used to call attention. Color can be expressed in many different ways:

RGB (Red, Green, Blue)


CMYK (Cyan, Magenta, Yellow, Key/Black)
HSL (Hue, Saturation, Lightness)

5 Business Analytics Compendium 2024-25


But in terms of pre-attentiveness, the HSL scale is useful to us when we examine color.

HSL Scale in Pre-Attentive Processing:


Hue:
Refers to pure spectrum colors (e.g., Red).
Important for naming and identifying colors.
Saturation and Lightness:
Measure the intensity and brightness of colors.
Affect how vibrant or dull a color appears.

3. Movement
Movement can be used very effectively to call someone’s attention to a design or image.
Attributes of Movement:
Flicker
Motion
While these attributes are most attention-grabbing, they have some negative effects too.
Motion or flicker elements in design sometimes become annoying and distracting for
users from the information presented. A designer should carefully use these elements in
design or image.

4. Spatial Position
Our ability to perceive the location of objects in space, both relative to ourselves and to
each other, is called spatial position perception.
The Gestalt principle of figure-ground is a fundamental concept in visual perception that
explains how we see objects in relation to their background. It essentially boils down to
this:
Our brains automatically separate a scene into two parts: a figure (the object in
focus) and the ground (the background).
These are mutually exclusive – you can't perceive both the figure and ground at the
same time.
The relationship between figure and ground is crucial for understanding the visual
scene. Changing one element (e.g., making the background brighter) affects how we
perceive the other.

5 Business Analytics Compendium 2024-25


Gestalt Psychology Categories:
Proximity
Closure
Continuity
Connectedness
Similarity

Examples of Spatial Positioning

Closure: The logo of WWF , exemplifying the Gestalt


principle of closure, appears to be a black and white
silhouette of a panda bear. Despite missing complete
outlines for the panda's body, ears, and legs, our brains
automatically fill in the gaps based on our prior experience
with panda bears. This ability to perceive a recognizable
shape, like the familiar black and white fur pattern and
characteristic bear-like posture, allows our brains to process
visual information quickly and efficiently, even when
presented with incomplete data.

Rubin Vase: This classic example allows you to see either


a white vase in the center of a black background or two
black profiles facing each other.

Dalmatian Dog:The black spots on the


white fur create the figure of a dog, while
the white fur recedes into the background.

5 Business Analytics Compendium 2024-25


Pre-Attentive Attributes use in Data Visualization
Now let's see how to integrate this concept in Data Visualization, using the following
example, analyzing the correlation between the number of orders and sales.

Note how, without any visual indication, you are left to process all the information by
yourself. You might be able to find an insight on your own from this chart, but you'll
have to make good use of your short-term memory for that, which will take time.

Now check out what happens when we include preattentive attributes to the same
graph.

Now check out what happens when we include preattentive attributes to the same
graph.

By modifying the color hue of these four data points, you make them stand out, and you
now clearly see a pattern you might have missed in the previous example.

5 Business Analytics Compendium 2024-25


Time to Insight
The time to insight corresponds to the time it takes to draw insights from a graph or a
visualization. The lower, the better. You want your audience to get insights from a
visualization as quickly as possible.

Pie charts are an excellent example to illustrate this concept, and while they are still
widely used, you really want to stay away from them.

High time to insight 📈


Using the pie chart example above, you can notice that the time it takes to get insights
from this type of chart is very high. You need to go back and forth between the slices
and the legend to understand it. Here, you're making your audience work hard to get
your message.
Now let's look at how we could improve it and reduce the time to insight.

Time to insight reduced 📉


5 Business Analytics Compendium 2024-25
Convert your visualization to a simple horizontal bar chart, and voila! Your eyes naturally
scan down through each country. They don't need to move around in the chart as they
did with the previous example.

Want to focus your audience's attention on the top-performing European market? You
can use the preattentive attribute concepts seen above to reduce the time to insight
even more.

Your audience is now starting to see your story. And it took them only a few seconds for
that.

5 Business Analytics Compendium 2024-25


Data-Ink Ratio
The larger the share of a graphic’s ink devoted to data, the better — Edward Tufte

Your graphs are made of ink. Some of this ink represents what's important, and some
doesn’t. Edward Tufte's book, The Visual Display of Quantitative Information, introduces
the data-ink ratio as a concept that says you should dedicate as much ink as possible to
the data. In other words, you should eliminate all the unnecessary information distracting
your audience from the message you're trying to convey.

To maximize your data-ink ratio in your graphs, you should ask yourself, 'Would the data
suffer any loss if this were eliminated?' If the answer is 'no,' get rid of it.

Take a moment to look at the combo line chart below, measuring two critical mobile app
performance metrics.

Let's see how to maximize the data-ink ratio in just a few steps.

5 Business Analytics Compendium 2024-25


By applying a set of simple actions, you have eliminated all the noise in this graph and
reduced your audience's cognitive load. Your message is now hitting them faster.

5 Business Analytics Compendium 2024-25


Data Visualization Glossary: Key Terms
Data & Sources:
Dataset: A collection of data, often presented in a table format, that serves as the
foundation for a visualization.
Data Source: The origin of the data, such as spreadsheets, databases, APIs, or
surveys.

Chart Types:
Bar Chart: Uses rectangular bars to represent data values, often used for comparisons
between categories.

Line Chart: Connects data points with a line to show trends or changes over time.

5 Business Analytics Compendium 2024-25


Pie Chart: A circular chart divided into slices representing proportions of a whole (best
for limited categories).

Scatter Plot: Uses dots to represent data points, revealing relationships between two
variables.

5 Business Analytics Compendium 2024-25


Heatmap: Uses color intensity to represent data values in a grid, often used for large
datasets.

Box and Whisker Plot: Summarizes data distribution, showing median, quartiles, and
outliers.

5 Business Analytics Compendium 2024-25


Histogram: Represents the frequency distribution of data points, visually similar to bar
charts but for continuous data.

Area Chart: Similar to a line chart, but the space between the line and the x-axis is filled
with color, emphasizing the magnitude of change over time.

5 Business Analytics Compendium 2024-25


Stacked Bar Chart: Extends the bar chart concept by layering bars on top of each other,
allowing visualization of component parts that contribute to a whole.

Sankey Diagram (Ribbon Charts): Represents flows between different categories or


stages in a process, using arrows with varying widths to depict the volume of flow.

5 Business Analytics Compendium 2024-25


Treemap: A hierarchical chart that displays data as nested rectangles, with the area of
each rectangle proportional to the value it represents.

Components & Design:


Axis: Axes in data visualization represent the reference lines that define the scale
and boundaries of a chart.
X-axis (horizontal): This axis most commonly represents the independent
variable, the variable you are changing or controlling in your experiment or
analysis.
Y-axis (vertical): This axis most commonly represents the dependent variable,
the variable you are measuring and whose changes you are observing in
response to the independent variable on the x-axis.
Scale: The range of values represented on an axis.
Legend: Explains the meaning of colors, symbols, or patterns used in the
visualization.
Labels: Text descriptions for data points, axes, and chart elements.
Color Theory: The use of colors to evoke emotions, differentiate data points, and
enhance visual appeal.
Visual Hierarchy: Arranging elements to guide viewers' attention towards the most
important information.
Chart Junk: Unnecessary decorative elements that distract from the data.

5 Business Analytics Compendium 2024-25


Data Analysis & Exploration:
Exploratory Data Visualization: Creating visual representations of data to gain
insights, identify patterns, and uncover trends or anomalies. It is often used in the
early stages of data analysis.
Frequency Distribution: Shows the frequency or count of different data values
within a data set. Histograms and bar charts are commonly used to display
frequency distributions.
Normalization: Scaling data to a standard range to facilitate fair comparisons. It
ensures that data from different scales can be plotted together.
Outlier: A data point that significantly deviates from the rest of the data. Outliers
can impact the accuracy of visualizations and may need special treatment.
Quantitative Visualization: Represents data with numerical values, allowing precise
measurements and comparisons. Charts such as bar charts and scatter plots are
examples of quantitative visualization.
Regression Analysis: A statistical technique used to model variables' relationships. It
helps understand how one variable depends on another.
Time Series Visualization: Displays data points collected over successive intervals. It
is used to analyze trends and patterns in time-dependent data.

Interactivity:
Interactive Visualization: Allows users to engage with and manipulate data
visualizations in real time. It enables users to explore different aspects of the data
and customize their viewing experience.
Jitter: A technique used to add a small amount of random variation to data points,
especially in scatter plots, to avoid overlapping points.
User Interface (UI): The visual layout and controls that allow users to interact with
and explore data visualizations. A well-designed UI enhances the user experience.
Zooming: Allows users to magnify specific areas of a chart or plot for closer
examination. It helps explore fine details in large datasets.

5 Business Analytics Compendium 2024-25


Other Terms:
Color Palette: A set of colors used to represent different data elements or
categories in a chart. A well-chosen color palette enhances visual appeal and aids in
conveying information effectively.
Data Labels: Text elements that provide specific values associated with data points
in a chart.
Gridlines: Horizontal and vertical lines forming a chart grid, aiding in reading data
values and aligning data points.
Key (Legend): In data visualization, a part of the chart that explains the meaning of
different colors, symbols, or patterns used to represent data categories.
Matplotlib: A popular Python library for creating static, interactive, and animated
data visualizations. It provides a wide range of customizable charts and plots.
Seaborn: Built on top of Matplotlib, Seaborn is a Python library specifically designed
for creating high-level statistical graphics. It offers a user-friendly interface and a
streamlined approach to creating visually appealing and informative statistical charts
and plots.

5 Business Analytics Compendium 2024-25

You might also like