Ilovepdf Merged
Ilovepdf Merged
ANALYTICS
COMPENDIUM
2024
Created By
P. Madhav Charan
Tanmay Malhotra
Sarthak Singh
Prep Team
TABLE OF CONTENTS
1 BUSINESS ANALYSIS
IWhat is Business Analysis? 1
Scope and Roles & Responsibilities of a Business Analyst 3
Essential Skills for Business Analysts 4
The Business Analysis Process 9
Business Analysis Techniques and Tools (by Phase): 9
Business Analysis Lifecycle Models:
Identifying Opportunities for Process Improvement:
2 Requirements Gathering and Management 12
Real-World Applications of Business Analysis 14
Marketing
Case Study 1: Spotify Uses Data Analytics to Dominate Music
Streaming
3 Case Study 2: Target Discovers Teen Pregnancy with Analytics 18
Case Study 3: How Netflix Used Business Analytics for 19
understanding their users 22
Sales 24
Case Study 1: Amazon Recommends Products Based on Analytics 25
Case Study 2: Netflix Optimizes Content Delivery with Business 25
Analytics
Operations
4 Case Study 1: Walmart Optimizes Inventory Management with 25
Analytics: A Deeper Dive 27
Case Study 2: Amazon Fine-Tunes Delivery Operations with 30
Machine Learning: Efficiency at Scale 35
Finance
TABLE OF CONTENTS
1 BUSINESS ANALYSIS
Case Study 1: JPMorgan Chase Uses Analytics for Fraud 1
Detection 3
Case Study 2: Wells Fargo Leverages Business Analysis to 4
Improve Loan Default Prediction 9
Human Resources (HR) 9
Case Study 1: Google Analyzes Employee Data to Reduce
Turnover
2 Case Study 2: Walmart Optimizes Scheduling with Workforce 12
Analytics 14
Business Analysis Tools and Technologies
Building a Business Analytics Mindset
Career Path for Business Analysts
3 The Future of Business Analysis 18
19
22
2 DATA ANALYSIS 24
25
What is Data Analytics?
25
Data Analysis Project Lifecycle
Data Collection
4 Types of Data
25
Data Collection method
27
Data Cleaning and Wrangling
30
Measures of Central Tendency
35
AUC Curve
TABLE OF CONTENTS
3 DATA VISUALIZATION
Introduction 1
Memory in Data Visualization 3
What is Data Visualization? 4
Why Visualize Data? 9
Data Visualization Process 9
Choosing the right chart
Data visualization Principles
2 "The Grammar of Graphics" by Leland Wilkinson 12
The Grammar of Graphics by Leland Wilkinson 14
Preattentive Attributes
Time to Insight
Data-Ink Ratio
3 The larger the share of a graphic’s ink devoted to data, the 18
better — Edward Tufte 19
Data Visualization Glossary: Key Terms 22
4
PART 1
BUSINESS
ANALYSIS
What is Business Analysis?
Business analysis is a discipline focused on understanding an organization's business
needs and recommending solutions to improve efficiency, effectiveness, and overall
success. It involves a systematic approach to analyzing problems, identifying
opportunities, and designing solutions that bridge the gap between business needs and
technology.
Scope and Roles & Responsibilities of a Business Analyst
Business analysts are the bridge between the business world and the technical world.
While roles and responsibilities may vary depending on industry, company size, and
project methodology, at heart, they are pretty similar. Here is a breakdown of the core
functions of this position and what value a BA brings to the table:
Needs assessment: The BA identifies and analyzes business needs, challenges, and
goals sourcing through various means such as interviews, workshops, and data
analysis. What they do is answer the "why" questions to get to the bottom of
problems to find solutions that cater to the actual needs.
Prioritization and Validation: Admittedly, not all requirements are equal. The BAs
work with the stakeholders to prioritize the criteria based on importance,
feasibility, and their ability to meet business objectives. They further ensure that
such needs are achievable and meet up to business needs
Feasibility Analysis: BAs don't just identify problems; they also help find solutions.
They evaluate potential solutions considering factors like cost, time, technical
feasibility, and potential risks. This ensures chosen solutions are practical and
deliver value.
Working with Development Teams: BAs collaborate closely with developers and IT
professionals throughout the development process. They ensure the solution
aligns with the design and requirements, answer questions, and provide
clarifications.
Data Analysis and Interpretation: Data is king in today's world. The Data Analyst’s
responsibility is to make sense of the data by uncovering trends, patterns, and
insights for informed decision-making.
Process Mapping and Improvement: Business Analysts (BAs) go beyond merely
identifying problems; they actively seek and implement solutions. By meticulously
mapping out processes, they can streamline operations, remove bottlenecks, and
significantly boost overall efficiency.
Solution Design and Evaluation: Business Analysts (BAs) collaborate closely with
teams to craft solutions that align with business objectives. They carefully evaluate
various options, ensuring that the selected solutions are practical, cost-effective,
and deliver substantial value.
Quality Assurance and Testing: BAs are involved in the quality assurance process.
They collaborate with the testing team to identify and rectify issues before the final
solution is implemented.
8 Business Analytics Compendium 2024-25
Continuous Monitoring and Feedback: A BA's job doesn't end with implementation.
They monitor system performance, gathering user feedback, and identifying
opportunities for improvement.
Data Analysis: A strong foundation in data analysis is crucial. You'll need to utilize
tools like spreadsheets, SQL, and more advanced data analysis softwares like
python, R etc.
Basic Understanding of Programming: While extensive coding skills may not always
be required, a foundational understanding of programming logic can be beneficial.
This allows you to better understand technical limitations and collaborate
effectively with developers.
The following are the various phases, techniques, tools and lifecycle models of Business
Process Analysis (BPA):
Flowcharts: These visual tools outline the steps involved in a process, clearly
marking decision points and the flow of information.
Identifying Bottlenecks: Analyze steps that cause delays or slow down the process.
Redundancy: Look for steps that can be eliminated or combined for efficiency.
Lack of Automation: Explore opportunities to automate repetitive tasks.
Ineffective Communication: Improve communication flow between process
participants.
Data Accuracy: Ensure data used in the process is accurate and reliable.
This is where you, as the BA, act as a bridge between stakeholders with varying needs
and the development team. Here are some effective techniques:
Use Cases: Develop scenarios outlining user interactions with the project
deliverable. This helps visualize user journeys, identify specific functionalities, and
ensure the solution addresses actual needs.
End Users: Focus on their needs and pain points. What tasks should the project
solve for them? How will it improve their work experience?
Development Team: Consider their technical expertise and limitations. What can be
realistically developed within the project timeframe?
Once you have identified potential information system solutions, the next step is to
assess their viability. Feasibility analysis involves evaluating a proposed solution based
on three key factors:
Cost Feasibility: This involves estimating the costs associated with developing,
implementing, and maintaining the system. Does the organization have the budget
to support this solution?
Time Feasibility: Can the system be developed and implemented within the required
timeframe? This analysis considers the complexity of the system and the available
resources.
A thorough feasibility analysis helps avoid pursuing unrealistic solutions and ensures
the chosen system aligns with the organization's practical limitations.
Here are some examples of how business analytics has benefited different core
business functions:
Marketing
Sales
Operations
Finance
Human Resources (HR)
Note:It is highly recommended that students conduct their own research to understand how these large corporations use
analytics to improve their business.
Challenge:
Solution:
Customer Segmentation: This data allows them to segment users into distinct
groups based on genre preferences, listening times, and device usage.
Results:
Key Takeaways:
The power of user data: When leveraged strategically, user data can provide a deep
understanding of customer preferences and behavior.
This example might surprise you, but it demonstrates the unexpected yet impactful results
business analytics can bring.
Challenge:
Solution:
Target worked with a data analytics firm to analyze purchasing data. This included
factors like browsing history, past purchases, and demographics.
The analysis revealed a peculiar pattern: specific groceries and household items
were being purchased together more frequently by individuals who hadn't
previously purchased baby products.
Deeper Dive:
Ethical Dilemma:
Target faced a major ethical dilemma. Should they leverage this knowledge for
targeted marketing to pregnant teens?
Solution:
Target chose not to exploit this information for marketing purposes. However, the
insights did prove valuable.
Results:
Key Takeaways:
Business analytics can uncover hidden patterns and connections that traditional
marketing methods might miss.
The analysis revealed a peculiar pattern: specific groceries and household items
were being purchased together more frequently by individuals who hadn't
previously purchased baby products.
Deeper Dive:
Ethical Dilemma:
Target faced a major ethical dilemma. Should they leverage this knowledge for
targeted marketing to pregnant teens?
Solution:
Target chose not to exploit this information for marketing purposes. However, the
insights did prove valuable.
Results:
Key Takeaways:
Business analytics can uncover hidden patterns and connections that traditional
marketing methods might miss.
Netflix, the streaming giant, wasn't always the leader it is today. In its early days, they
relied on traditional methods like surveys and focus groups to understand customer
preferences. While these provided some insights, they lacked the depth and granularity
needed to truly thrive.
The Challenge:
The Solution:
Netflix embraced business analytics and built a data-driven culture. Here's how they did
it:
Data Gathering: Netflix collects massive amounts of data on user behavior, including:
Data Analysis: Advanced analytics tools analyze this data to identify patterns and
trends. Here are some examples of what they uncover:
Actionable Insights: The data provides valuable insights that fuel strategic decisions:
Content acquisition: They prioritize acquiring shows and movies that cater to
specific micro-genres with high engagement.
Original content creation: Data helps identify themes and storylines likely to
resonate with their audience, guiding their original content production.
The Results:
Original content like "Stranger Things" and "Squid Game" have become global
phenomena.
Key Takeaways:
Netflix's success story exemplifies the power of business analytics in sales and marketing.
By leveraging data, they were able to:
Past purchase history: Analyzing what a customer has bought previously helps
predict their future interests.
Browsing behaviour: Tracking what products users view and for how long
provides insights into their current interests.
Demographic data: While age, location, and similar data can be less personal, it
can still suggest broader product categories relevant to user segments.
Increased sales: Customers are more likely to buy products they are
recommended based on their past behavior and interests
Key Takeaways:
The Power of Data Analytics: By analyzing user data, businesses can gain deep
insights into customer behaviour, enabling them to personalize product
recommendations effectively.
Data-Driven Decision Making: Utilizing data analytics allows businesses to make well-
informed decisions regarding product placement and marketing strategies, ultimately
driving better outcomes.
Challenge: Netflix needed to ensure a smooth streaming experience for all users,
especially during peak hours. This was critical to avoid user frustration and potential
churn.
Analyzing User Data: Netflix gathers data on user viewing habits, including location,
time of day, and preferred device.
Predicting Demand: Based on historical data and current trends, they predict when
and where demand for specific content will be high.
Network Optimization: Data analysis helps identify network bottlenecks and allows
for targeted infrastructure improvements.
Reduced Buffering: Prepositioning content close to users minimizes the distance data
needs to travel, resulting in faster loading times and less buffering.
Scalability: By anticipating demand, Netflix can efficiently manage traffic spikes and
prevent service disruptions.
Increased Sales and Customer Retention: A smooth and reliable streaming experience
reduces frustration, keeps users subscribed, and even encourages them to watch
more content, ultimately leading to increased sales.
Key Takeaways:
Proactive Data Analysis: By analyzing user data, businesses can take proactive steps
to ensure a seamless user experience.
Operations
Case Study 1: Walmart Optimizes Inventory Management with Analytics: A Deeper Dive
Order the right amount of inventory: They won't be caught understocked when the
beach season hits, frustrating customers.
Optimize storage space: They can allocate space based on predicted demand,
avoiding unnecessary storage costs for off-season items.
Minimized Stockouts: By predicting demand more precisely, they ensured shelves are
stocked with the right products at the right time.
Reduced Overstocking Costs: They optimized inventory levels, freeing up capital and
storage space for other uses.
The benefits extended beyond Walmart's bottom line. Customers enjoyed a smoother
shopping experience with fewer stockouts. Additionally, suppliers benefited from clearer
demand forecasts, allowing them to optimize their own production and deliveries. This
case study highlights how business analytics can transform traditional business practices.
By embracing data-driven decision making, companies like Walmart can achieve greater
efficiency, improve customer satisfaction, and gain a competitive edge.
Case Study 2: Amazon Fine-Tunes Delivery Operations with Machine Learning: Efficiency
at Scale.
Unforeseen Delays: Traffic congestion, weather events, and other factors can disrupt
delivery schedules.
Last-Mile Delivery Challenges: The final leg of the delivery process, getting packages
to customers' doorsteps, presents unique logistical hurdles.
These challenges can lead to late deliveries, frustrated customers, and increased
operational costs.
Amazon utilizes machine learning algorithms to analyze vast amounts of data related to
deliveries, including:
Historical Delivery Data: Past delivery times, routes taken, and encountered obstacles
are used to identify patterns and predict future delivery times.
Route Optimization: Algorithms suggest the most efficient delivery routes for drivers,
considering factors like traffic patterns, package sizes, and delivery locations.
Predictive Delivery Times: Machine learning helps predict accurate delivery windows,
setting realistic expectations for customers.
Dynamic Route Adjustments: In case of unforeseen delays, the system can reroute
deliveries in real-time to minimize disruptions.
Reduced Delivery Times: Optimized routes and predictive models lead to faster
deliveries, exceeding customer expectations.
Lower Operational Costs: Efficient route planning minimizes fuel consumption and
driver time, leading to cost savings.
Enhanced Customer Satisfaction: Predictable delivery times and fewer delays improve
customer experience and loyalty.
Beyond Delivery:
Machine learning finds applications beyond just delivery optimization at Amazon. It's used
in areas like:
Challenge: JPMorgan Chase, like many financial institutions, faces the constant threat of
fraudulent transactions. Their primary objective was to:
This analysis focuses on identifying anomalies in spending habits that deviate from a
customer's typical spending patterns.
As a result, they are able to prevent significant financial losses and protect customer
accounts from fraudulent activity.
Key Takeaways:
The Power of Data Analytics: Analyzing customer data allows for proactive
identification of fraudulent behavior.
Case Study 2: Wells Fargo Leverages Business Analysis to Improve Loan Default
Prediction
Data Gathering and Analysis: Wells Fargo's business analysts collaborated with data
scientists to gather and analyze a comprehensive dataset. This included customer
demographics, financial history, loan details, and historical default data.
Identifying Key Factors: By analyzing the data, they identified key factors that
significantly influence loan default rates. These could include credit score, debt-to-
income ratio, employment history, and loan purpose.
Process Improvement: The model was then integrated into the loan approval process,
enabling a more data-driven approach.
Streamlined Loan Approval Process: By focusing on the most relevant factors, the
process became more efficient, improving customer experience.
Key Takeaways:
Improved Risk Management: Business analysis can contribute to more robust risk
management practices, safeguarding financial institutions and their customers.
By analyzing the data, Google was able to identify patterns and trends associated with
employee dissatisfaction. This led to:
Key Takeaways:
People Analytics are Powerful: Analyzing employee data provides valuable insights
into employee sentiment and factors impacting retention.
By analyzing the data, Google was able to identify patterns and trends associated with
employee dissatisfaction. This led to:
Key Takeaways:
People Analytics are Powerful: Analyzing employee data provides valuable insights
into employee sentiment and factors impacting retention.
Challenge: Walmart, the world's largest retailer, faced challenges with inefficient
scheduling that led to:
Employee dissatisfaction: Inconsistent schedules and long hours could lead to fatigue
and low morale.
Increased costs: Overstaffing during peak hours and understaffing during off-peak
hours impacted profitability.
Predictive Analytics: By analyzing historical data and sales forecasts, they predicted
customer traffic and staffing needs for various times and departments.
Optimized Scheduling: Using the insights from analytics, they created data-driven
schedules that:
A list of commonly used software categories and specific tools within each, along with a
brief explanation of their purpose:
Microsoft Visio: The go-to tool for creating professional flowcharts, process maps,
and UML diagrams. BAs use it to visualize workflows, identify bottlenecks, and
communicate processes clearly to stakeholders.
Requirements Management:
Jira: A popular agile project management tool with robust features for user story
management, requirements traceability, and issue tracking. BAs utilize Jira to capture,
track, and prioritize requirements throughout the development lifecycle.
Requirements Management:
Jira: A popular agile project management tool with robust features for user story
management, requirements traceability, and issue tracking. BAs utilize Jira to capture,
track, and prioritize requirements throughout the development lifecycle.
Microsoft Word: While not a dedicated tool, Word can be used effectively, especially
for smaller projects. BAs can document requirements in a structured format,
facilitating clear communication with stakeholders.
Microsoft Excel: A versatile tool for data manipulation and analysis. BAs use power
queries, pivot tables, and charts to clean data, identify trends, and extract insights.
SQL: Essential for interacting with relational databases. BAs use SQL to retrieve and
integrate data for further analysis.
Python: A powerful programming language with libraries like pandas and NumPy for
advanced data manipulation, analysis, and visualization.
Microsoft Teams: A versatile platform for chat, video conferencing, file sharing, and
task management. Ideal for BAs to collaborate with team members and stakeholders,
particularly in remote settings
Slack: A real-time messaging tool that supports file sharing and project-specific
channels. BAs use Slack for quick updates, brainstorming, and maintaining project
communication.
Microsoft Power BI: A business intelligence (BI) tool for creating interactive
dashboards and reports. BAs can leverage Power BI to transform data into visually
appealing insights for stakeholders, enabling better data-driven decision-making.
Tableau: Another leading data visualization platform known for its ease of use and
rich visual capabilities. BAs can create clear and compelling dashboards, charts, and
maps to communicate complex data insights to both technical and non-technical
audiences.
Google Data Studio: A free data visualization tool from Google, offering integrations
with various Google products and a user-friendly interface. BAs can use Data Studio
to create basic to complex data visualizations, depending on project needs.
Mind Mapping Tools (e.g., MindMeister, XMind): For brainstorming ideas, capturing
requirements, and visually organizing information.
User Interface (UI) Prototyping Tools (e.g., Figma, InVision): To create mockups and
prototypes of user interfaces, helping stakeholders visualize potential solutions.
Project Management Tools (e.g., Asana, Trello): For managing tasks, setting
deadlines, and tracking project progress, especially for smaller projects or for
personal task organization.
The heart of a Business Analyst (BA) lies not just in the technical tools, but in fostering a
unique way of thinking. A strong Business Analytics Mindset equips you to be an
insightful translator, transforming raw data into actionable strategies.
Think Strategically: Move beyond the tactical. Consider how data insights can inform
long-term business goals and strategic decision-making. Ask questions that bridge
the gap between data and business objectives, like "How can customer data
analytics improve our competitive advantage?"
Identify the "So What?": Don't get lost in the data jungle. Every analysis should lead
to a clear conclusion or actionable recommendation. Ask yourself "So what does this
data tell us? How can we leverage these insights to make a positive impact?"
Quantify Whenever Possible: Not everything can be a number, but strive to quantify
aspects whenever possible. This strengthens the foundation of your analysis and
adds objectivity to your recommendations.
Know Your Audience: Tailor your communication style and level of technical detail
to resonate with your audience. Speak in clear, concise language for non-technical
stakeholders, and provide more technical details when presenting to data-savvy
audiences.
Focus on Storytelling: Data visualizations and compelling narratives can breathe life
into insights. Use charts, graphs, and real-world examples to make your message
impactful and memorable. Think of yourself as a translator, transforming complex
data sets into a story that everyone can understand.
Data Privacy: Be aware of data privacy regulations (e.g., GDPR, CCPA) and ensure
data collection and analysis comply with all relevant laws and ethical codes. Respect
user privacy and prioritize data security.
Data Bias: Data can be biased, reflecting the real world. Be mindful of potential
biases in data sets and how they might influence your analysis. Present findings with
transparency and acknowledge any limitations.
Consider the Long-Term Impact: Think beyond immediate solutions. Consider the
long-term implications of your recommendations. How might they impact
stakeholders, business processes, and even society as a whole?
Stay Updated on Ethical Issues: The data landscape is constantly evolving. Stay
updated on emerging ethical considerations in data collection, analysis, and AI to
ensure your practices remain responsible.
IT Business Analyst
Artificial Intelligence (AI) and Machine Learning (ML): AI and ML are transforming
numerous industries. BAs will need to understand how these technologies can be
leveraged to automate tasks, generate insights from data, and support better decision-
making. This might involve working with data scientists to identify opportunities for AI
integration or ensuring ethical considerations are addressed in AI-powered solutions.
Big Data and Data Analytics: The ever-increasing volume of data presents both
challenges and opportunities. BAs will need to be familiar with data analysis techniques
and tools to extract meaningful insights from data sets. This might involve learning to
work with tools like SQL for data querying or Python for data manipulation.
Agile and DevOps Methodologies: Agile and DevOps approaches are gaining traction
across organizations. BAs will need to adapt to these faster-paced development cycles,
focusing on iterative requirements gathering, continuous feedback loops, and
collaboration with development teams.
Cloud Computing: The shift towards cloud-based solutions requires BAs to understand
cloud platforms and their capabilities. They might be involved in evaluating cloud-based
solutions, ensuring data security in the cloud, and adapting existing business processes
for cloud environments.
DATA
ANALYSIS
What is Data Analysis ?
Data analytics is the process of examining raw data to uncover patterns, draw
conclusions, and support decision-making. It involves various techniques and
tools to transform, organize, and model data in meaningful ways.
It's the science of examining raw data with the goal of making informed
conclusions about the information it contains.
Why is it Important?
Predictive Analytics: Data analytics isn't just about looking at the past; it's
about predicting the future. Businesses can use advanced analytics
techniques to forecast future trends, customer behavior, and market
demands. This allows them to be proactive and make strategic decisions
that position them for success.
Let's walk through the data analysis project lifecycle with an example project
1. Problem Definition
Objective: Identify customers who are likely to stop
purchasing from the company and understand the factors i
nfluencing customer churn.
Activities: Meet with stakeholders to define "churn," gather requirements,
and set the project goal to reduce churn rate by 10% over the next six
months.
2. Data Collection
Objective: Gather data relevant to customer churn.
Activities: Collect data from various sources,
such as transaction records, customer service interactions,
loyalty program data, and demographic information.
This data could be stored in the company’s CRM system, databases, and
external data sources.
1. By Nature:
Qualitative Data: Descriptive data that cannot be measured numerically. It is
often used to categorize or classify objects. Examples include colors,
names, labels, and opinions.
Quantitative Data: Numerical data that can be measured and quantified. It is
used to describe quantities and includes both discrete and continuous data.
Examples include age, height, weight, and temperature.
2. By Format:
Structured Data: Organized in a predefined manner, often in rows and
columns (e.g., databases, spreadsheets). It is easily searchable and
analyzable. Examples include SQL databases and Excel files.
Unstructured Data: Not organized in a predefined structure, making it more
challenging to analyze. Examples include text documents, emails, videos,
social media posts, and images.
Semi-structured Data: Does not fit into a rigid structure like structured
data but contains tags or markers to separate data elements. Examples
include JSON, XML, and HTML files.
3. By Source:
Primary Data: Collected directly from the source or original data that has
not been altered or manipulated. Examples include survey responses,
experimental results, and sensor readings.
Secondary Data: Collected from existing sources that have been previously
gathered, processed, and published by others. Examples include research
papers, reports, and datasets from government agencies.
5 Business Analytics Compendium 2024-25
4. By Measurement Scale:
Nominal Data: Categorical data without a specific order. Examples include
gender, race, and types of cuisine.
Ordinal Data: Categorical data with a specific order but no fixed interval
between categories. Examples include rankings (e.g., first, second, third)
and satisfaction levels (e.g., satisfied, neutral, dissatisfied).
Interval Data: Numerical data with ordered categories and a fixed interval
between values but no true zero point. Examples include temperature in
Celsius and calendar dates.
Ratio Data: Numerical data with ordered categories, a fixed interval, and a
true zero point. Examples include height, weight, age, and income.
5. By Temporal Characteristics:
Cross-sectional Data: Collected at a single point in time, representing a
snapshot. Examples include census data collected on a specific date.
Time Series Data: Collected over different time periods, showing how data
points change over time. Examples include stock prices, monthly sales
figures, and daily temperatures.
Longitudinal Data: Similar to time series data but often involves repeated.
observations of the same subjects over time. Examples include panel
studies and cohort studies.
6. By Sensitivity:
Public Data: Openly available and not sensitive. Examples include open
government data and public datasets.
Private Data: Sensitive and restricted data requiring authorization for
access. Examples include personal identifiable information (PII), financial
records, and medical records.
Collecting data for a data analytics project involves various methods, each
suited to different types of data and analysis objectives. Here are some
common ways to collect data for such projects:
2. Interviews:
4. Experiments:
7. Transactional Data:
8. Crowdsourcing:
GIS and GPS: Use Geographic Information Systems (GIS) and Global Positioning
Systems (GPS) for collecting location-based data.
Satellite Imagery: Utilize remote sensing data from satellites for large-scale
environmental and geographical analysis.
Third-party APIs: Access data from external services and platforms through
their APIs, such as social media analytics, financial data, or weather
information.
Data cleaning and wrangling techniques are language specific. Here we will be
taking examples in python language.
Matplotlib pyplot can be used to draw a bar plot which helps in identifying the
number of null values in each column.
a. Dropping Rows
df_cleaned = df.dropna()
b. Dropping Columns
df_cleaned = df.dropna(axis=1)
a. Mean/Media/Mode Imputation
df['column'].fillna(df['column'].mean(), inplace=True)
df['column'].fillna(df['column'].median(), inplace=True)
b. Forward/Backward Fill
df.fillna(method='ffill', inplace=True) # Forward fill
df.fillna(method='bfill', inplace=True) # Backward fill
c. Interpolation
df['column'].interpolate(method='linear', inplace=True)
d. K-Nearest Neighbor(KNN)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
2. Identify Outliers
Visual Methods
1.Box plots:
3. Histogram:
Statistical Methods
3. Handling outlier:
1. Drop outlier:
df_cleaned = df[(z_scores <= 3)]
df_cleaned = df[~((df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 +
1.5 * IQR)))] #Removing data points that are outside interquartile range
median = df['column'].median()
df['column'] = np.where((df['column'] < (Q1 - 1.5 * IQR)) | (df['column']
> (Q3 + 1.5 * IQR)), median, df['column']) #Replacing data points outside
interquartile range with median value
1. Normalization:
To scale the data to a specific range, typically [0, 1]. This is particularly
useful when features have different ranges, helping to ensure that they
contribute equally to the analysis. Normalization: Suitable for algorithms
that compute distances between data points, such as KNN, Kmeans, SVM
and Neural Networks.
2.Standardization:
To transform data so that it has a mean of 0 and a standard deviation of
1. Standardization is useful when data needs to be normally distributed
and when features are measured on different scales. Commonly used in
machine learning algorithms that assume or benefit from normally
distributed data, such as Linear Regression, Logistic Regression, and
Principal Component Analysis (PCA).
5. Log Transformation:
Log transformation can help in handling skewed data by compressing the
range of values and making data more normally distributed. For example
Income data often has a long right tail, with a few individuals earning
significantly more than the majority. This skewed distribution can di`stort
statistical analyses and model performance
Data Distributions
2. Median : The middle value when the data points are ordered.
3. Mode : The most frequently occurring value in the data set.
Measure of dispersion:
4. Interquartile Range (IQR): The difference between the 75th and 25th
percentiles.
Visualization:
Visual tools can provide a clear picture of the data distribution.
1. Histogram:
2. Box Plot:
4. Probability Plots:
Compares the quantiles of the data to the quantiles of a standard
distribution (e.g., normal distribution. The closer the points of the
distribution are to the line, the more the distribution resembles a normal
distribution.
Refer https://www.youtube.com/watch?app=desktop&v=okjYjClSjOg for
better understanding.
Descriptive analysis
Diagnostic analysis
Predictive analytics
Type 1 error:
A type 1 error is also known as a false positive and occurs when a researcher
incorrectly rejects a true null hypothesis. This means that your report that your
findings are significant (accept alternate hypothesis) when in fact they have
occurred by chance. The probability of making a type I error is represented by
your alpha level (α), which is the p-value below which you reject the null
hypothesis. A p-value of 0.05 indicates that you are willing to accept a 5%
chance that you are wrong when you reject the null hypothesis. You can reduce
your risk of committing a type I error by using a lower value for p. For example,
a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to
detect a true difference if one really exists (thus risking a type II error).
Type II error:
A type II error is also known as a false negative and occurs when a researcher
fails to reject a null hypothesis which is false. Here a researcher concludes there
is not a significant effect when there really Is. The probability of making a type
II error is called Beta (β), and this is related to the power of the statistical test
(power = 1- β). You can decrease your risk of committing a type II error by
ensuring your test has enough power. You can do this by ensuring your sample
size is large enough to detect a practical difference when one truly exists.
Ideally, you would like to keep both errors as low as possible which is
practically not possible as both errors are complementary to each other. Hence,
commonly used values of alpha are 0.01, 0.05, and 0.10 which gives a good
balance between alpha and beta.
Example:
The salaries of postgraduates are higher than the Salaries of graduates.
Mean 90
The size of the sample is 81
The population mean is 82
Standard Deviation for Population is 20
H0: μ=82
H1 : μ>82
From the z table the critical value at α = 1.645
x ̄ = 90, μ = 82, n = 81, σ = 20
z = 3.6
Here, let’s say we want to know if Girls on average score 10 marks more
than boys. We have the information that the standard deviation for girls'
scores is 100 and for boys’ scores is 90. Then we collect the data of 20
girls and 20 boys by using random samples and recording their marks.
Finally, we also set our a value (significance level) to be 0.05.
In this example:
Mean Score for Girls (SampleMean) is 641
Mean Score for Boys (SampleMean) is 613.3
Standard Deviation for the Population of Girls is 100
Standard deviation for the Population of Boys is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
Putting in the above formula, we get a z-score, and thereby we compute p-
values as 0.278 from the z-score which is greater than 0.05, hence we fail
to reject the null hypothesis
If we have a sample size of less than 30 and do not know the population
variance, then we must use a t-test.
One-sample and Two-sample Hypothesis Tests the one-sample t-test is a
statistical hypothesis test used to determine whether an unknown population
parameter is different from a specific value.
In statistical hypothesis testing, a two-sample test is a test performed on the
data of two random samples, each of which is independently obtained. The
purpose of the test is to determine whether the difference between these two
populations is statistically significant.
Here, let’s say we want to determine if on average, boys score 15 marks more
than girls in the exam. We do not have the information related to variance (or
standard deviation) for girls’ scores or boys’ scores. To perform a t-test. we
randomly collect the data of 10 girls and boys with their marks. We choose our
a value (significance level) to be 0.05 as the criteria for Hypothesis Testing.
In this example:
Mean Score for Boys is 630.1
Mean Score for Girls is 606.8
Difference between Population Mean 15
Standard Deviation for Boys’ score is 13.42
Standard Deviation for Girls’ score is 13.14
Putting in the above formula, we get an at-score, and thereby we
compute p-value as 0.019 from t-score which is less than 0.05, hence
we reject the null hypothesis and conclude that on average boys
score 15 marks more than girls in the exam.
Supervised Learning
Key Points:
Labeled Data: Requires a dataset where each input is paired with the correct
output.
Objective: Learn a function that maps inputs to the correct output.
Applications: Classification (assigning inputs to predefined categories) and
regression (predicting continuous values).
Examples:
Classification
Classification involves predicting a categorical label for an input. The goal is to
assign inputs to predefined classes or categories.
Types of Classification:
Binary Classification: The task is to classify the input into one of two possible
classes. Examples include spam detection (spam vs. not spam) and medical
diagnosis (disease vs. no disease).
Multiclass Classification: The task is to classify the input into one of three or
more classes. Examples include digit recognition (0-9) and document
categorization (sports, politics, technology).
Multilabel Classification: Each input can be assigned multiple labels. Examples
include tagging multiple objects in an image or categorizing a document into
multiple topics
Types of Regression:
Unsupervised Learning
Unsupervised learning involves training a model on data that does not have
labeled outputs. The goal is to infer the natural structure present within a set of
data points. This is used for tasks where we do not know the desired output and
want to discover patterns or groupings in the data.
Key Points:
Unlabeled Data: Uses data that does not have associated labels.
Objective: Find hidden patterns, groupings, or structures in the data.
Types of Regression:
Unsupervised Learning
Unsupervised learning involves training a model on data that does not have
labeled outputs. The goal is to infer the natural structure present within a set of
data points. This is used for tasks where we do not know the desired output and
want to discover patterns or groupings in the data.
Key Points:
Unlabeled Data: Uses data that does not have associated labels.
Objective: Find hidden patterns, groupings, or structures in the data.
Clustering
Clustering aims to group similar data points into clusters based on their
characteristics.
Dimensionality Reduction
Linear Discriminant Analysis (LDA): While primarily used for supervised learning,
LDA can also be used in an unsupervised manner to reduce dimensions.
Autoencoders: Neural network-based models that learn to compress data into a
lower-dimensional space and then reconstruct it.
Application
Sales and Revenue Forecasting:
Predicting future sales based on past sales data, economic indicators,
and market trends.
Pricing Strategy:
Determining the optimal price point for products by analyzing the
relationship between price and demand.
Marketing Campaign Analysis:
Evaluating the effectiveness of marketing campaigns by assessing the
impact of advertising spend on sales growth.
Causes:
The model is too simple (e.g., using a linear model for a non-
linear problem).
Insufficient training time (in the case of iterative algorithms like
neural networks).
Inadequate features or too few features used in the model.
Too much regularization, which restricts the model's capacity.
Symptoms:
High bias: The model makes strong assumptions about the data,
leading to poor performance.
Low training accuracy.
Low test accuracy.
Increase Training Time: Train the model for more epochs (in the
case of neural networks) to ensure it has enough time to learn the
data patterns.
Overfitting
Definition:
Overfitting occurs when a model is too complex and captures not only
the underlying patterns but also the noise in the training data. It
performs very well on the training data but poorly on unseen data (test
set) because it does not generalize well.
Symptoms:
High variance: The model is overly sensitive to small fluctuations in
the training data.
High training accuracy.
Low test accuracy.
Logistic Regression
Logistic Function:
The logistic function, also known as the sigmoid function, is used to
map predicted values to probabilities between 0 and 1. The function is
defined as:
1. Confusion Matrix
A confusion matrix provides a summary of prediction results on a
classification problem. The matrix shows the number of true positives
(TP), true negatives (TN), false positives (FP), and false negatives (FN).
Components:
True Positive (TP): The model correctly predicts the positive class.
True Negative (TN): The model correctly predicts the negative
class.
False Positive (FP): The model incorrectly predicts the positive
class.
False Negative (FN): The model incorrectly predicts the negative
class.
2. Accuracy
4. Recall
6. AUC-ROC Curve
The model outputs probabilities, and different thresholds are used to decide
the class labels. For each threshold, calculate TPR and FPR.
Each threshold results in a point on the ROC curve with FPR on the x-axis
and TPR on the y-axis.
Connect the points to form the ROC curve.
AUC Curve
Definition:
AUC represents the area under the ROC curve. It provides a single scalar value
to summarize the model's performance across all thresholds.
AUC = 1: Perfect model that distinguishes between positive and negative
classes without any errors.
AUC = 0.5: Model with no discrimination power, equivalent to random
guessing.
0.5 < AUC < 1: The model has some degree of discrimination power.
Log loss is particularly valuable when you need to understand and trust the
model's confidence in its predictions. It's not just about whether the model got
the prediction right, but also about how confident it was in that prediction.
Entropy: This is a measure of the randomness of the data. The higher the
entropy, the more random the data is.
Here are the steps involved in building a decision tree using the ID3 algorithm:
Calculate the entropy of the target variable.
For each attribute in the data, calculate the information gain that would be
achieved by splitting the data on that attribute.
Choose the attribute with the highest information gain.
Split the data on the chosen attribute.
The stopping criterion is a set of rules that determines when to stop growing
the tree. Common stopping criteria include:
Decision trees are a powerful and versatile machine learning algorithm that can
be used for a wide variety of tasks. They are relatively easy to understand and
interpret, which makes them a good choice for many applications.
Here we've got an example with lots of points on our two dimensional scatter
plot. Now how does a decision tree work. So what it is going to do is cut it up
into slices in several iterations.
The resulting Tree (obtained by applying algorithms like CART, ID3) which will
be later used to predict the outcomes.
Data
Visualization
Introduction
Data visualization is the art of representing information and data in visual formats like
charts, graphs, maps, and infographics, making complex information quickly and easily
understandable. Instead of deciphering rows of numbers in a spreadsheet, a clear and
colorful chart can effectively reveal trends and patterns. This accessibility allows
everyone, regardless of technical background, to grasp key insights.
There is a story in your data. As the analyst, you know the story within your data, but
how do you communicate it effectively and ensure your audience takes concrete
actions? Data visualization is the final step in your analytical journey, enabling you to tell
your story compellingly and convert insights into decisive measures.
But telling a compelling story is no easy task. Like any other type of communication, the
key challenge in Data Visualization is to identify which elements in your message signal
— the information you want to communicate, and which are noise — unnecessary
information polluting your message.
With that in mind, your main goal is to present content to your audience in a way that
highlights what's important, eliminating any distractions. You've probably already spent a
lot of time understanding, cleaning, and modeling your data to reach a conclusion worth
sharing. So don't let this final step get in the way of properly communicating your key
insights.
Iconic Memory
Processes visual information very quickly, lasting only a fraction of a second.
Acts as a flash storage for visual stimuli, deciding whether to discard or transfer the
information to short-term memory.
Short-Term Memory
Holds information for a few minutes but has limited capacity.
Can only process a limited amount of data at a time and is easily overwhelmed.
Long-Term Memory
Stores information for an extended period.
Information moves here from short-term memory if retained.
As the creator of data visualizations, your goal is to leverage your audience's iconic
memory to capture attention immediately and minimize the load on their short-term
memory to maintain focus. This approach ensures your key insights are effectively
communicated and more likely to be retained in long-term memory.
Before delving deeper into this let us understand what Data Visualization is.
Additional Benefits:
Accessibility for Wider Audiences: Data visualizations can cater to a wider audience,
including those with limited data analysis expertise. Complicated data becomes
approachable through clear visuals, making data analysis more inclusive.
Storytelling with Data: Data visualizations are powerful tools for storytelling. By
weaving data into a narrative, you can connect with your audience on an emotional
level, making the information more impactful and memorable.
https://infogram.com/blog/choose-the-right-chart/
In this chart, we compare one value with the other like region-wise sales, economy rate
comparison of bowler in cricket. We can use the following charts for comparison.
Column charts
It is used to compare values across multiple categories.
Here, the category appears horizontally(X-axis) and values vertically(Y-axis).
In the column charts, you can also show information about parts of a whole
across different categories, and you can show this in absolute value as well as
relative terms. Here comes the concept of a stacked column chart and 100%
stacked column charts.
Line charts
It is one of the most popular charts and vitally used in most industries.
Whether you’re analyzing sales data, whether you’re looking at year-on-year
profit, whether you’re looking at how a person’s salary increases in the last year,
line charts are very helpful in these scenarios.
The line chart is used to show trends over time or categories.
Here, the category appears horizontally(X-axis) and value vertically(Y-axis).
These charts are used to show the spread of the data values over categories or
continuous values. We can use the following charts in order to visualize the distribution
of the data. For example Distribution of bugs found in 10 weeks of the software testing
phase.
Histogram
It is used to graphing the frequency over a distribution. It is a very useful graph
in the analytics world and can infer many useful insights from the data.
Visually, all the bars are touching each other with no space between them.
KDE Plot
KDE is an abbreviation for the Kernel Density Estimation plot.
It’s a smooth form of a histogram.
A kernel density estimate (KDE) plot is a method for visualizing the distribution
of observations in a dataset, analogous to a histogram.
Relative to a histogram, KDE can produce a plot that is less cluttered and more
interpretable, especially when drawing multiple distributions.
These charts are used to analyze, how various parts comprise the whole. These charts
are very handy in many scenarios where we have to analyze revenue contribution by
different regions, batsmen scored on which sides of the ground. Charts used to
represent these are listed below.
Donut Chart
It is a variant of a pie chart, with the hole in the center.
It displays the categories as arcs rather than slices.
These relationships charts are very helpful when we want to know that what is the
relation between the different variables. Charts used to visualize the relationship
between the variables are listed below.
Scatter Plot
A scatter chart uses numerical values along both axes.
It uses dots to represent the values for two different numerical values.
The position of each dot on the horizontal axis and the vertical axis signifier the
value of a particular data point.
It is useful for showing a correlation between the data points that may not be
easy to see from the data alone.
It is used for displaying and comparing numerical values, such as scientific or
statistical data.
This is used to visualize trends of values over time and categories, it is also known as
“Time Series” data in the data-driven world. For example Run rate tracker over by over,
Hourly temperature variation during a day. Listed below are the charts used to represent
time series data.
Line Chart
The best way to visualize trend data is by line chart.
Line charts are also used to see the trends in various domains.
Column Chart
A column chart as discussed above is also used to show the trends of values
over time and categories.
Wilkinson argues that just like a sentence in language follows grammatical rules,
effective visualizations can be built using a set of core building blocks. This "grammar"
provides a systematic approach to describe and construct various statistical graphics.
Data: The raw information being represented (e.g., numerical values, categories)
Aesthetic Mappings: How data attributes are linked to visual properties like position,
size, color, etc. (e.g., position on x-axis corresponds to time, color represents
category)
Scales: The transformation of data values into visual scales (e.g., linear scale for
temperature, logarithmic scale for earthquake magnitudes)
Geometrical Shapes: The basic visual marks used to represent data points (e.g.,
points, lines, bars)
Statistical Transformations: Techniques for summarizing or transforming data for
visual representation (e.g., means, medians, binning)
Preattentive Attributes
What is it?
Our brains are constantly bombarded with visual information. But how do we process it
all so quickly? The answer lies in preattentive processing. This is an automatic,
subconscious ability to pick up on basic visual features like color, size, and position. It
happens within milliseconds, allowing us to grasp the gist of a scene before we even
consciously focus on it.
Think of it like a filter. Preattentive processing sifts through the visual clutter,
highlighting elements that stand out. These "preattentive attributes" act as attention
magnets, drawing our eyes to the most salient or relevant information. For instance, a
bright red bar in a chart can highlight a significant outlier, while using different sizes can
emphasize comparisons between data points.
This preattentive processing plays a crucial role in various fields. From design and
advertising, where capturing attention is key, to education and cognitive science,
understanding how we process visual information unlocks powerful tools for
communication and learning. By leveraging preattentive attributes, we can create
visualizations that guide viewers' attention to the most important details, saving them
time and effort in deciphering complex data.
While it might look like a fuzzy concept at first, the power of these preattentive
attributes is relatively easy to demonstrate. To do so, look at the sequence below and
count how many times the number 9 appears.
The correct answer is five. But in this example, there's no visual indication you can rely
on to help you reach this conclusion. You had to scan each number one by one to see if
it was a 9 or not.
Let's repeat the same exercise with the exact same sequence, but now, let's see what
happens when we make a single visual change.
Preattentive processing
Because we changed the color intensity of these numbers, they now clearly stand out.
Suddenly, there are five 9s in front of you. This is preattentive processing and iconic
memory in action.
Colin Ware, in his book “Information Visualization: Perception for Design” defines the
four preattentive visual properties as follows:
1. Form
2. Color
3. Spatial Position
4. Movement
1. Form
The form applies to various attributes listed below. In design, the form can be used
either to increase attention to specific elements or to reduce attention to it.
Form attributes include:
Collinearity
Curvature
length, breadth, and width
Marks added to objects
Numerosity
Shape
Size
Spatial grouping
Spatial orientation
2. Colour
Color is one of the most common
properties used to call attention. Color can be expressed in many different ways:
3. Movement
Movement can be used very effectively to call someone’s attention to a design or image.
Attributes of Movement:
Flicker
Motion
While these attributes are most attention-grabbing, they have some negative effects too.
Motion or flicker elements in design sometimes become annoying and distracting for
users from the information presented. A designer should carefully use these elements in
design or image.
4. Spatial Position
Our ability to perceive the location of objects in space, both relative to ourselves and to
each other, is called spatial position perception.
The Gestalt principle of figure-ground is a fundamental concept in visual perception that
explains how we see objects in relation to their background. It essentially boils down to
this:
Our brains automatically separate a scene into two parts: a figure (the object in
focus) and the ground (the background).
These are mutually exclusive – you can't perceive both the figure and ground at the
same time.
The relationship between figure and ground is crucial for understanding the visual
scene. Changing one element (e.g., making the background brighter) affects how we
perceive the other.
Note how, without any visual indication, you are left to process all the information by
yourself. You might be able to find an insight on your own from this chart, but you'll
have to make good use of your short-term memory for that, which will take time.
Now check out what happens when we include preattentive attributes to the same
graph.
Now check out what happens when we include preattentive attributes to the same
graph.
By modifying the color hue of these four data points, you make them stand out, and you
now clearly see a pattern you might have missed in the previous example.
Pie charts are an excellent example to illustrate this concept, and while they are still
widely used, you really want to stay away from them.
Want to focus your audience's attention on the top-performing European market? You
can use the preattentive attribute concepts seen above to reduce the time to insight
even more.
Your audience is now starting to see your story. And it took them only a few seconds for
that.
Your graphs are made of ink. Some of this ink represents what's important, and some
doesn’t. Edward Tufte's book, The Visual Display of Quantitative Information, introduces
the data-ink ratio as a concept that says you should dedicate as much ink as possible to
the data. In other words, you should eliminate all the unnecessary information distracting
your audience from the message you're trying to convey.
To maximize your data-ink ratio in your graphs, you should ask yourself, 'Would the data
suffer any loss if this were eliminated?' If the answer is 'no,' get rid of it.
Take a moment to look at the combo line chart below, measuring two critical mobile app
performance metrics.
Let's see how to maximize the data-ink ratio in just a few steps.
Chart Types:
Bar Chart: Uses rectangular bars to represent data values, often used for comparisons
between categories.
Line Chart: Connects data points with a line to show trends or changes over time.
Scatter Plot: Uses dots to represent data points, revealing relationships between two
variables.
Box and Whisker Plot: Summarizes data distribution, showing median, quartiles, and
outliers.
Area Chart: Similar to a line chart, but the space between the line and the x-axis is filled
with color, emphasizing the magnitude of change over time.
Interactivity:
Interactive Visualization: Allows users to engage with and manipulate data
visualizations in real time. It enables users to explore different aspects of the data
and customize their viewing experience.
Jitter: A technique used to add a small amount of random variation to data points,
especially in scatter plots, to avoid overlapping points.
User Interface (UI): The visual layout and controls that allow users to interact with
and explore data visualizations. A well-designed UI enhances the user experience.
Zooming: Allows users to magnify specific areas of a chart or plot for closer
examination. It helps explore fine details in large datasets.