Data Analytics Unit1-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 195

Unit 1- Introduction to Data Analytics

The word analytics has come into the foreground in last decade or so. The increase
of the internet and information technology has made analytics very relevant in the
current age. Analytics is a field which combines data, information technology,
statistical analysis, quantitative methods and computer-based models into one.
This all are combined to provide decision makers all the possible scenarios to make
a well thought and researched decision. The computer-based model ensures that
decision makers are able to see performance of decision under various scenarios.

Business analytics (BA) is a set of disciplines and technologies for solving business
problems using data analysis, statistical models and other quantitative methods. It
involves an iterative, methodical exploration of an organization's data, with an
emphasis on statistical analysis, to drive decision-making.
At its core, business analytics involves a combination of the following:

identifying new patterns and relationships with data mining;

using quantitative and statistical analysis to design business models;

conducting A/B and multi-variable testing based on findings;

forecasting future business needs, performance, and industry trends with
predictive modelling; and

Communicating your findings in easy-to-digest reports to colleagues,

management, and customers.

Business analytics (BA) refers to the skills, technologies, and practices for
continuous iterative exploration and investigation of past business performance to
gain insight and drive business planning. Business analytics focuses on developing
new insights and understanding of business performance based on data and statistical
Business Analytics is the process of transforming data into insights to improve
business decisions. Data management, data visualization, predictive modelling, data
mining, forecasting simulation, and optimization are some of the tools used to create
insights from data.

Evolution of Business Analytics

Business analytics has been existence since very long time and has evolved with
availability of newer and better technologies. It has its roots in operations research,
which was extensively used during World War II.

Operations research was an analytical way to look at data to conduct military

operations. Over a period of time, this technique started getting utilized for business.
Here operation’s research evolved into management science. Again, basis for
management science remained same as operation research in data, decision making
models, etc.

Analytics have been used in business since the management exercises were put
into place by Frederick Winslow Taylor in the late 19th century.

Henry Ford measured the time of each component in his newly established
assembly line. But analytics began to command more attention in the late 1960s
when computers were used in decision support systems.

Since then, analytics have changed and formed with the development of enterprise
resource planning (ERP) systems, data warehouses, and a large number of other
software tools and processes.

In later years the business analytics have exploded with the introduction of
computers. This change has brought analytics to a whole new level and has brought
about endless possibilities. As far as analytics has come in history, and what the
current field of analytics is today, many people would never think that analytics
started in the early 1900s with Mr. Ford himself.
As the economies started developing and companies became more and more
competitive, management science evolved into business intelligence, decision
support systems and into PC software.

Overview of Data Analytics

• Data analytics is the science of analyzing raw data to make conclusions about
that information.
• Data analytics help a business optimize its performance, perform more
efficiently, maximize profit, or make more strategically-guided decisions.

• The techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human
• Various approaches to data analytics include looking at what happened
(descriptive analytics), why something happened (diagnostic analytics), what
is going to happen (predictive analytics), or what should be done next
(prescriptive analytics).
• Data analytics relies on a variety of software tools including spreadsheets,
data visualization, reporting tools, data mining programs, and open-source
languages for the greatest data manipulation.

Understanding Data Analytics

Data analytics is a broad term that encompasses many diverse types of data analysis.
Any type of information can be subjected to data analytics techniques to get insight
that can be used to improve things. Data analytics techniques can reveal trends and
metrics that would otherwise be lost in the mass of information. This information
can then be used to optimize processes to increase the overall efficiency of a
business or system.

For example, manufacturing companies often record the runtime, downtime, and
work queue for various machines and then analyze the data to better plan workloads
so the machines operate closer to peak capacity.

Data analytics can do much more than point out bottlenecks in production. Gaming
companies use data analytics to set reward schedules for players that keep the
majority of players active in the game. Content companies use many of the same
data analytics to keep you clicking, watching, or re-organizing content to get
another view or another click.

Data analytics is important because it helps businesses optimize their performances.

Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of
A company can also use data analytics to make better business decisions and help
analyse customer trends and satisfaction, which can lead to new and better products
and services.

Data Analysis Steps

The process involved in data analysis involves several steps:
1. The first step is to determine the data requirements or how the data is grouped.
Data may be separated by age, demographic, income, or gender. Data values
may be numerical or divided by category.
2. The second step in data analytics is the process of collecting it. This can be
done through a variety of sources such as computers, online sources, cameras,
environmental sources, or through personnel.
3. The data must be organized after it's collected so it can be analyzed. This may
take place on a spreadsheet or other form of software that can take statistical

4. The data is then cleaned up before analysis. It's scrubbed and checked to
ensure that there's no duplication or error and that it is not incomplete. This
step helps correct any errors before it goes on to a data analyst to be analyzed.

Types of Business Analytics

There are mainly four types of Business Analytics, each of these types are
increasingly complex. They allow us to be closer to achieving real-time and
future situation insight application. Each of these types of business analytics have
been discussed below.
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics

4. Prescriptive Analytics

1. Descriptive Analytics
It summarizes an organization’s existing data to understand what has
happened in the past or is happening currently. Descriptive Analytics is the
simplest form of analytics as it employs data aggregation and mining
techniques. It makes data more accessible to members of an organization such
as the investors, shareholders, marketing executives, and sales managers.

It can help identify strengths and weaknesses and provides an insight into
customer behavior too. This helps in forming strategies that can be developed
in the area of targeted marketing.

2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance to the current
events and determine which factors are influencing trends. To uncover the
root cause of events, techniques such as data discovery, data mining and drill-
down are employed. Diagnostic analytics makes use of probabilities, and
likelihoods to understand why events may occur. Techniques such as
sensitivity analysis and training algorithms are employed for classification
and regression.

3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a future event with
the help of statistical models and ML techniques. It builds on the result of
descriptive analytics to devise models to extrapolate the likelihood of items.
To run predictive analysis, Machine Learning experts are employed. They can
achieve a higher level of accuracy than by business intelligence alone.

One of the most common applications is sentiment analysis. Here, existing

data collected from social media and is used to provide a comprehensive
picture of a user’s opinion. This data is analyzed to predict their sentiment
(positive, neutral or negative).

4.Prescriptive Analytics
Going a step beyond predictive analytics, it provides recommendations for the
next best action to be taken. It suggests all favorable outcomes according to a
specific course of action and also recommends the specific actions needed to
deliver the most desired result. It mainly relies on two things, a strong
feedback system and a constant iterative analysis. It learns the relation
between actions and their outcomes. One common use of this type of analytics
is to create recommendation systems.

How business analytics works

Before any data analysis takes place, BA starts with several foundational

Determine the business goal of the analysis.

Select an analysis methodology.

Get business data to support the analysis, often from various systems and sources.
Cleanse and integrate data into a single repository, such as a data warehouse or
data mart.

Need/Importance of Business Analytics

Business analytics is a methodology or tool to make a sound commercial decision.
Hence it impacts functioning of the whole organization. Therefore, business
analytics can help improve profitability of the business, increase market share and
revenue and provide better return to a shareholder.

Facilitates better understanding of available primary and secondary data, which

again affect operational efficiency of several departments.
Provides a competitive advantage to companies. In this digital age flow of
information is almost equal to all the players. It is how this information is utilized
makes the company competitive. Business analytics combines available data with
various well thought models to improve business decisions.

Converts available data into valuable information. This information can be

presented in any required format, comfortable to the decision maker.

For starters, business analytics is the tool your company needs to make accurate
decisions. These decisions are likely to impact your entire organization as they help
you to improve profitability, increase market share, and provide a greater return to
potential shareholders.

While some companies are unsure what to do with large amounts of data, business
analytics works to combine this data with actionable insights to improve the
decisions you make as a company

Essentially, the four main ways business analytics is important, no

matter the industry, are:
Improves performance by giving your business a clear picture of what is and isn’t

Provides faster and more accurate decisions

Minimizes risks as it helps a business make the right choices regarding consumer
behavior, trends, and performance
Inspires change and innovation by answering questions about the consumer.

Benefits of Data analytics:

1.Improved Decision Making

Foremost among the top data analytics benefits is better decision-making. It offers
insightful, data-driven information that aids organizations in understanding their
customers, operations, and markets. They can spot patterns, trends, and correlations.
Moreover, they use this knowledge to make well-informed choices supported by data
and metrics rather than mere guesswork. Businesses can boost productivity, cut
costs, find new opportunities, and reduce risks by optimizing their strategies and
making more informed decisions. Because they are based on actual data and
analytics, data analytics also enables organizations to make more transparent and
dependable decisions.

2.Increased Efficiency and Productivity

Data analytics enables organizations to increase efficiency and productivity by
automating and streamlining processes, maximizing resource allocation, and
minimizing manual labour. Businesses can streamline their workflows by locating
bottlenecks and getting rid of duplication. Additionally, data analytics assists
businesses in identifying areas where productivity can be increased, such as waste
reduction, better inventory control, and supply chain optimization.

3.Enhanced Customer Experience

By giving organizations useful insights into customer behaviour, preferences, and
needs, data analytics enables businesses to identify areas where they can improve
their customer experience–such as lowering wait times, enhancing customer service,
or streamlining user interfaces. Data analytics thus helps businesses tailor their
offerings to meet consumers’ unique needs, thus forging closer ties with them and
fostering greater customer loyalty.
4.Improved Risk Management
Businesses can find patterns and correlations in data from various sources that point
to potential risks. Data analytics can, for instance, assist companies in identifying
potential fraud, online threats, or operational risks. Businesses can also take
preventative action to mitigate potential risks by monitoring data in real-time. By
utilizing data analytics to enhance risk management, they can lessen the possibility
of monetary losses, reputational damage, and other negative outcomes.

5.Competitive Advantage
Businesses can gain a competitive edge using data analytics to make more informed,
data-driven decisions. Analysing data from various sources allows businesses to
understand market trends, consumer behaviour, and competitor activities.
Businesses can use this information to improve their strategies, spot new
opportunities, and set themselves apart from the competition. Data analytics can, for
instance, aid companies in identifying underserved market segments, anticipating
client needs, and enhancing product offerings. Simply put, businesses can increase
their market share, spur revenue growth, and fortify their brand by utilizing data
analytics to gain a competitive advantage.

BI tools are required in almost all industries and functions. The nature of the
information and the speed of action may be different across businesses, but every
manager today needs access to BI tools to have up-to-date metrics about business
performance. Businesses need to embed new insights into their operating processes
to ensure that their activities continue to evolve with more efficient practices. The
following are some areas of applications of BI and data mining.

1.Healthcare and Wellness

Healthcare is one of the biggest sectors in advanced economies. Evidence-based
medicine is the newest trend in data-based healthcare management. BI applications
can help apply the most effective diagnoses and prescriptions for various ailments.
They can also help manage public health issues and reduce waste and fraud.

These systems take away most of the guess work done by doctors in diagnosing
ailments. Treatment Effectiveness The prescription of medication and treatment is
also a difficult choice out of so many possibilities. For example, there are more than
100 medications for hypertension (high blood pressure) alone. There are also
interactions in terms of which drugs work well with others and which drugs do not.
Decision trees can help doctors learn about and prescribe more effective treatments.
Thus, the patients can recover their health faster with a lower risk of complications
and cost.

2.Wellness Management:
This includes keeping a track of patient's health records, analysing customer health
trends and proactively advising them to take any needed precautions.

Manage Fraud and Abuse: Some medical practitioners have unfortunately been
found to conduct unnecessary tests, and/or overbill the government and health
insurance companies. Exception reporting systems can identify such providers and
action can be taken against them.

3.Public Health Management:

The management of public health is one of the important responsibilities of any
government. By using effective forecasting tools and techniques, governments can
better predict the onset of diseases in certain areas in real time. They can thus be
better prepared to fight against diseases. Google has been known to predict the
movement of certain diseases by tracking the search terms (like flu, vaccine) used in
different parts of the world.

As higher education becomes more expensive and competitive, it becomes a great
user of data-based decision-making. There is a strong need for efficiency, increasing
revenue, and improving the quality of student experience at all levels of education.

Student Enrolment (Recruitment and Retention) Marketing to new potential students

requires schools to develop profiles of the students that are most likely to attend.
Schools can develop models of what kinds of students are at traced to the school,
and then reach out to those students. The students at risk of not returning can be
flagged, and corrective measures can be taken in time.
Course Offerings Schools can use the class enrolment data to develop models of
which new courses are likely to be more popular with students. This can help
increase class size, reduce costs, and improve student satisfaction.

Banks make loans and offer credit cards to millions of customers. They interested in
improving the quality of loans and reducing bad debts. They want to retain better
customers, and sell more services to them.

cap Automate the Loan Application Process Decision models can generate from past
data that predict the likelihood of a loan proving successful. These be inserted in
business processes to automate the financial loan approval process

Detect Fraudulent Transactions Billions of financial transactions happen around the

world every day. Exception-seeking models can identify patterns fraudulent
transactions. For example, if money is being transferred to an unrelated account for
the first time, it could be a fraudulent transaction.

Maximize Customer Value (Cross-selling, Up-selling) Selling more products and

services to existing customers is often the easiest way to increase revenue A
checking account customer in good standing could be offered home, auto, or
educational loans on more favourable terms than other customers, and thus, the value
generated from that customer could be increased.

Optimize Cash Reserves with Forecasting Banks have to maintain certain liquidity
to meet the needs of depositors who may like to withdraw money. Using past data
and trend analysis, banks can forecast how much to keep and invest the rest to earn

6.Financial Services
Stock brokerages are an intensive user of BI systems. Fortunes can be made or lost
based on access to accurate and timely information.

Predict Changes in Bond and Stock Prices Forecasting the price of stocks and bonds
is a favourite pastime of financial experts as well as lay people. Stock transaction
data from the past, along with other variables, can be used to predict future price
patterns. This can help traders develop long-term trading strategies.

Assess the Effect of Events on Market Movements Decision models using decision
trees can be created to assess the impact of events on changes in market volume and
prices. Monetary policy changes (such as Federal Reserve interest ate change) or
geopolitical changes (such as war in a part of the world) can stored into the predictive
model to help take action with greater confidence d less risk.

Retail organizations grow by meeting customer needs with quality products in a
convenient, timely, and cost-effective manner. Understanding emerging customer
shopping patterns can help retailers organize their products, inventory, store layout,
and web presence in order to delight their customers, which in turn would help
increase revenue and profits. Retailers generate a lot of transaction and logistics data
that can be used to diagnose and solve problems.

Optimize Inventory Levels at Different Locations Retailers need to manage their

inventories carefully. Carrying too much inventory imposes carrying costs, while
carrying too little inventory can cause stock-outs and lost sales opportunities.
Predicting sales trends dynamically help retailers move inventory to where it is most
in demand. Retail organizations can provide their suppliers with real time
information about sales of their items, so the suppliers can deliver their product to
the right locations and minimize stock-outs.

Improve Store Layout and Sales Promotions A market basket analysis can develop
predictive models of the products often sold together. This knowledge of affinities
between products can help retailers co-locate those products. Alternatively, those
affinity products could be located farther apart to make the customer walk the length
and breadth of the store, and thus be exposed to other products. Promotional
discounted product bundles can be created to push a non selling item along with a
set of products that sell well together.
Optimize Logistics for Seasonal Effects Seasonal products offer tremendously
profitable short-term sales opportunities, yet they also offer the risk of unsold
inventories at the end of the season. Understanding the products that are in season
in which market can help retailers dynamically manage prices to ensure their
inventory is sold during the season. If it is raining in a certain area, then the inventory
of umbrella and ponchos could be rapidly moved there from nongrainy areas to help
increase sales.
Minimize Losses due to Limited Shelf-Life Perishable goods offer challenges in
terms of disposing off the inventory in time. By tracking sales trends, the perishable
products at risk of not selling before the sell-by date, can be suitably discounted and

This industry is a prolific user of prediction models in pricing insurance proposals
and managing losses from claims against insured assets.
Forecast Claim Costs for Better Business Planning When natural disasters, such as
hurricanes and earthquakes strike, loss of life and property occurs. By using the best
available data to model the likelihood (or risk) of such events happening, the insurer
can plan for losses and manage resources and profits effectively.

Determine Optimal Rate Plans Pricing an insurance rate plan requires covering the
potential losses and making a profit. Insurers use actuary tables to project life spans
and disease tables to project mortality rates, and thus price themselves competitively
yet profitably.

Optimize Marketing to Specific Customers By micro-segmenting potential

customers, a data-savvy insurer can cherry pick the best customers and leave the less
profitable customers to its competitors. Progressive Insurance is a US-based
company that is known to actively use data mining to cherry pick customers and
increase its profitability.

Identify and Prevent Fraudulent Claim Activities Patterns can be identified as to

where and what kinds of fraud are more likely to occur. Decision-tree based models
can be used to identify and flag fraudulent claims.

Manufacturing operations are complex systems with interrelated subsystems. From
machines working right, to workers having the right skills, to the right components
arriving with the right quality at the right time, to money to source the components,
many things have to go right. Toyota's famous lean manufacturing company works
on just-in-time inventory systems to optimize investments in inventory and to
improve flexibility in their product-mix.
Discover Novel Patterns to Improve Product Quality of a product can also be
tracked, and this data can be used to create a predictive model of Product quality
deteriorating. Many companies, such as automobile companies,

BI in telecom can help the customer side as well as network side of the operations.
Key BI applications include churn management, marketing/customer profiling,
network failure, and fraud detection.

Churn Management Telecom customers have shown a tendency to switch their

providers in search for better deals. Telecom companies tend to respond with many
incentives and discounts to hold on to customers. However, they need to determine
which customers are at a real risk of switching and which others are just negotiating
for a better deal. The level of risk should be factored into the kind of deals and
discounts that should be given. Millions of such customer calls happen every month.
The telecom companies need to provide a consistent and data-based way to predict
the risk of the customer switching, and then make an operational decision in real
time while the customer call is taking place. A decision-tree or a neural network-
based system can be used to guide the customer service call operator to make the
right decisions for the company, in a consistent manner.

Marketing and Product Creation In addition to customer data, telecom companies

also store call detail records (CDRs), which can be analysed to precisely describe
the calling behaviour of each customer. This unique data can be used to profile
customers and then can be used for creating new product/service bundles for
marketing purposes. An American telecom company, MCI, created a program called
Friends & Family that allowed free calls with one's friends and family on that
network, and thus, effectively locked many people into their network.

Network Failure Management Failure of telecom networks for technical failures or

malicious attacks can have devastating impacts on people, businesses, and society.
In telecom infrastructure, some equipment will likely fail with certain mean time
between failures. Modelling the failure pattern of various compo f the network can
help with preventive maintenance

11.Public Sector
Government gathers a large amount of data by virtue of their regulatory function.
That data could be analysed for developing models of effective functioning. There
are innumerable applications that can benefit from mining that data. A couple of
sample applications are shown here.

Law Enforcement Social behaviour is a lot more patterned and predictable than one
would imagine. For example, Los Angeles Police Department (LAPD) mined the
data from its 13 million crime records over 80 years and developed models of what
kind of crime going to happen when and where. By increasing patrolling in those
particular areas, LAPD was able to reduce property crime by 27 percent. Internet
chatter can be analysed to learn about and prevent any evil designs.
Scientific Research Any large collection of research data is amenable to being mined
for patterns and insights. Protein folding (microbiology), nuclear reaction analysis
(sub-atomic physics), disease control (public health) are some examples where data
mining can yield powerful new insights.
12.Customer Relationship Management
A business exists to serve a customer. A happy customer becomes a repeat customer.
business should understand the needs and sentiments of the customer, sell more of
its offerings to the existing customers, and also expand the pool of customers it
serves. BI applications can impact many aspects of marketing.

Business Intelligence is a comprehensive set of IT tools to support decision making
with imaginative solutions for a variety of problems. BI can help improve the
performance in nearly all industries and applications.

Text analytics and Web analytics

text analytics work is focused on extracting information from unstructured text to
create structured data patterns. Our web analytics research is focused on collecting,
analysing and reporting web data for the purpose of understanding and optimising
web usage. This work provides new and exciting business insights into customer and
online activities.

Ai is undertaking text and web analytics in the areas of:

Text Analytics
• Entity extraction, text categorization and text clustering

• Document summarisation

• Topic model and latent semantic analysis

• Topic discover and public event detection

• Search, retrieval and ranking

• Microblog and twitter mining

• Short text analysis and semantic enhancement

• Opinion mining and sentiment analysis

• Social spammer detection and social influence analysis

Web analytics
• Customer behavior and access pattern mining

• Customer profiling and segmentation

• Customer retention and churn analysis

• Sales trend analysis and sales forecasting

• Marketing segmentation and cross-sale strategies

• Link analysis and link prediction

• User community detection and evolution

• Spatial-temporal analysis.

BI Skills
 Sample space is the universal set that consists of all possible outcomes of
an experiment. Sample space is usually represented using the letter ‘S’
and individual outcomes are called the elementary events.
 The sample space can be finite or infinite.

 Definition: A sample space, is a set of possible outcomes of a random

 Example: Sample space = S = {all people in class}
Few random experiments and their sample spaces are
discussed below:
Experiment 1 : Outcome of a football match
 Sample Space = S = {Win, Draw, Lose}
Experiment 2 : Predicting customer churn at an individual customer
 Sample Space = S = {Churn, No Churn}
Experiment 3: Predicting percentage of customer churn
 Sample Space = S = {X | X ∈ R, 0 ≤ X ≤ 100}, that is X is a real
number that can take any value between 0 and 100 percentage.
Experiment 4: Life of a turbine blade used in an aircraft engine
 Sample Space = S = {X | X ∈ R, 0 ≤ X < ∞}, that is X is a real
number that can take any value between 0 and ∞.
 E X A M P L E : When we flip a coin then sample space is
S = { H ,T } , where
 Where H denotes that the coin lands ”Heads
up” and
 T denotes that the coin lands ”Tails up”.

 For a ”fair coin ” we expect H and T to have the same ”chance ” of

occurring, i.e., if we flip the coin many times then about 50 % of the
outcomes will be H .
 We say that the probability of H to occur is 0.5 (or 50 %) . The
probability of T to occur is then also 0.5.
Problem :1
 When we roll a fair die then the sample space is
S = { 1 ,2 , 3 , 4 , 5 , 6 }
 The probability the die lands with k up is 1/6 , k={1,2,3,4,5,6} and when we roll it
1200 times we except a 5 up about 200 times.
 The probability the die lands with an even number up is
 1/6+1/6+1/6 = 1/2
Problem : 2
 EXAMPLE : When we toss a coin 3 times and record the results in the sequence that
they occur, then the sample space is
 S = { HHH , HHT , HTH , HTT , THH , THT , TTH , TTT } .
 Elements of S are ”vectors ”, ”sequences ”, or ”ordered outcomes ”.
 We may expect each of the 8 outcomes to be equally likely. Thus the probability of the
sequence HTT is 1/8 .
 The probability of a sequence to contain precisely two Heads is
 1/ 8 + 1 /8 + 1/ 8 = 3 /8
 Problem 3 : When we toss a coin 3 times and record the results without
paying attention to the order in which they occur, e.g., if we only record the
number of Heads, then the sample space is
S = ‘ {H, H, H } , {H, H, T } , {H, T, T } , {T, T, T } ‘
The outcomes in S are now sets ; i.e., order is not important.
 Recall that the ordered outcomes are
 { HHH , HHT , HTH , HTT , THH , THT , TTH , TTT } .
Note that
 {H, H, H } Corresponds to one of the ordered outcomes,
{H, H, T } Corresponds to three of the ordered outcomes,
{H, T, T } Corresponds to three of the ordered outcomes,
{T, T, T } Corresponds to one of the ordered outcomes ,
Thus {H, H, H } and {T, T, T } each occur with probability 1 /8 ,
while {H, H, T } and {H, T, T } each occur with probability 3 /8 .
 Definition: A probability event can be defined as a set of outcomes of an
experiment. In other words, an event in probability is the subset of the
respective sample space.
 Pick a person in this class at random.
 Sample space: = {all people in class}
 Event A: A = {all males in class}.
 Event B: B = {all females in class}.
 Thus, an event is a subset of the sample space, i.e., E is a subset of S.
 In the example above, event A occurs if the person we pick is male.
 The entire possible set of outcomes of a random experiment is the sample
space.The likelihood of occurrence of an event is known as probability .
 The probability of occurrence of any event lies between 0 and 1.
 The sample space for the tossing of three coins simultaneously is given by:

 S = {(T , T , T) , (T , T , H) , (T , H , T) , (T , H , H ) , (H , T , T ) , (H , T , H) ,
(H , H, T) ,(H , H , H)}
 Suppose, if we want to find only the outcomes which have at least two heads;
then the set of all such possibilities can be given as:
 E = { (H , T , H) , (H , H ,T) , (H , H ,H) , (T , H , H)}

 Thus, an event is a subset of the sample space, i.e., E is a subset of S.

What is the Probability of Occurrence of an Event?

 The number of favorable outcomes to the total number of outcomes is defined
as the probability of occurrence of any event. So, the probability that an event
will occur is given as:
 P(E) = Number of Favorable Outcomes/ Total Number of Outcomes
1. Simple Events
 Any event consisting of a single point of the sample space is known as a simple
event in probability. For example, if S = {56 , 78 , 96 , 54 , 89} and E = {78}
then E is a simple event.
2. Compound Events
 if any event consists of more than one single point of the sample space then such an
event is called a compound event. Considering the same example again, if S =
{56 ,78 ,96 ,54 ,89}, E1 = {56 ,54 }, E2 = {78 ,56 ,89 } then, E1 and E2 represent
two compound events.
3. Independent Events and Dependent Events
 If the occurrence of any event is completely unaffected by the occurrence of any
other event, such events are known as an independent event in probability and
the events which are affected by other events are known as dependent events.
Examples of Independent Events :
 Tossing a Coin
 Sample Space(S) in a Coin Toss = {H, T}
 Both getting H and T are Independent Events.
4. Mutually Exclusive Events
 If the occurrence of one event excludes the occurrence of another event,
such events are mutually exclusive events i.e. two events don’t have any
common point.
 For example, if S = {1 , 2 , 3 , 4 , 5 , 6} and E1, E2 are two events such
that E1 consists of numbers less than 3 and E2 consists of numbers greater
than 4.
 So, E1 = {1,2} and E2 = {5,6} .
 Then, E1 and E2 are mutually exclusive.
5. Exhaustive Events
 A set of events is called exhaustive if all the events together consume the
entire sample space.
Ex: Let us consider the experiment of throwing a die.
 Sample space S = {1, 2, 3, 4, 5, 6}
 A be the event of getting a number greater than 3
 B be the event of getting a number greater than 2 but less than 5
 C be the event of getting a number less than 3
 We can write these events as:
 A = {4, 5, 6}
 B = {3, 4}
 and C = {1, 2}
 We observe that
 A ⋃ B ⋃ C = {4, 5, 6} ⋃ {3, 4} ⋃ {1, 2} = {1, 2, 3, 4, 5, 6} = S
 Therefore, A, B, and C are called exhaustive events.
5. Complementary Events
 For any event E1 there exists another event E1‘ which represents the remaining
elements of the sample space S.
E1 = S − E 1’
 If a dice is rolled then the sample space S is given as S = {1 , 2 , 3 , 4 , 5 , 6 }.
If event E1 represents all the outcomes which is greater than 4, then
E1 = {5, 6} and E1’ = {1, 2, 3, 4}.
 Thus E1’ is the complement of the event E1.

Events Associated with “OR”

 If two events E1 and E2 are associated with OR then it means that either E1 or
E2 or both. The union symbol (∪) is used to represent OR in probability.
 Thus, the event E1U E2 denotes E1 OR E2.
 Events Associated with “AND”
 If two events E1 and E2 are associated with AND then it means the intersection of
elements which is common to both the events. The intersection symbol (∩) is used
to represent AND in probability.
 Thus, the event E1 ∩ E2 denotes E1 and E2.
Measures of probability :
 A probability measure gives probabilities to a sets of experimental outcomes
(events). It is a function on a collection of events that assigns a probability of 0 and 1
to every event, meeting certain conditions.
Probability Measure Examples
 For a roll of one six-faced die, the

sample space = {1, 2, 3, 4, 5, 6}.

 If A = {1, 3, 5} is the event that the roll is odd, then P(A) = ½.
According to axiomatic theory of probability, the probability of an event
E satisfies the following axioms:
1. The probability of event E always lies between 0 and 1. That is, 0 ≤
P(E) ≤ 1.
2. The probability of the universal set S is 1. That is, P(S) = 1.
3. P(X ∪Y) = P(X) + P(Y), where X and Y are two mutually exclusive
 The following elementary rules of probability are directly deduced from the
original three axioms of probability, using the set theory relationships:
1. For any event A, the probability of the complementary event, written AC, is
given by
P(AC) = 1 – P(A)
 If P(A) is a probability of observing a fraudulent transaction at an e-commerce
portal, then P(AC) is the probability of observing a genuine transaction.
2. The probability of an empty or impossible event ,f, is zero:
 P(f)=0
 If occurrence of an event A implies that an event B also occurs, so that the event
class A is a subset of event class B, then the probability of A is less than or equal
to the probability of B:
 P(A) < P(B)

 If A is students with more than 3.5 CGPA (cumulative grade point average) out of 4 and
B is students with a CGPA of more than 3.0, then P(A) < P(B)
4. The probability that either events A or B occur or both occur is given by
P (A U B) = P(A) + P(B)- P (A ∩ B )
5 .If A and B are mutually exclusive events, so that P (A ∩ B ) = 0, then
P (A U B) = P(A) + P(B)
6. If A1 , A2 , …, An are n events that form a partition of sample space S,
then their probabilities must add up to 1:

Joint Probability :
Let A and B be two events in a sample space. Then the joint probability of the two events,
written as P(A ∩ B), is given by
13 42
P( Divorced ∩ Default )= -------- = 0.013 P( Single ∩ Default )= -------- = 0.042
1000 1000
50 300
P( Divorced )= ----------- = 0.05 P( Single )= ----------- = 0.3
1000 1000
1. Let there be a bag containing 5 white and 4 red balls .Two balls are
drawn from the bag one after the other without replacement. Consider
the following events.
A= Drawing a white ball in the first draw
B= Drawing a red ball in the Second draw.
Sol: P(B/A)= Probability of drawing a red ball in second draw given
that a white ball has already been drawn in the first draw.
P(B/A)= Probability of drawing a red ball from a bag containing 4
white and 4 red balls.
P(B/A)= 4/8 =1/2
For this Random Experiment P(A/B) is not meaningful because A
cannot occur after the occurrence of event B.
2. A Die is thrown twice and the sum of the numbers appearing is observed
to be 6. what is the conditional probability that the number 4 has appeared
at least once?
B= Number 4 has appears at least once
A=The Sum of the numbers appearing is 6, Required probability P(B/A)
Sol: A=((1,5),(2,4),(3,3),(4,2),(5,1)) P(A ∩ B)= 2 P(A)=5
Required probability = P(B/A)
= P(A ∩ B)/P(A) = 2/5
 A= sum of the numbers appearing on two dice is 6
 =(1,5),(5,1),(2,4),(4,2),(3,3) B= number 4 has appeared at least once
 P(A)=5 =(1,4),(4,1),(2,4),(4,2),(3,4),(4,3),(4,4),(4,5),(5,4)

Question 3:
 Ten numbered cards are there from 1 to 15, and two cards a
chosen at random such that the sum of the numbers on both the
cards is even. Find the probability that the chosen cards are
 Let, A ≡ event of selecting two odd-numbered cards
 B ≡ event of selecting cards whose sum is even.
Sol: Then,
 P(B) = number of ways of choosing two numbers whose sum is even
= 8C 2 + 7C 2 .
 P(A ∩ B) = number of ways of choosing odd-numbered cards such that
their sum is even.
 = 8 C 2.
 Now, P(A|B) = P(A ∩ B)/P(B)
 = 8C2 / (8C2 + 7C2) = 4/7.
 Bayes’ theorem is one of the most important concepts in analytics
since several problems are solved using Bayesian statistics. Consider
two events A and B. We can write the following two conditional

 Using the two equations, we can show that

 Bayes’ theorem helps the data scientists to update the probability of an

event (B) when any additional information is provided.
 The following terminologies are used to describe various
1. P(B) is called the prior probability (estimate of the probability
without any additional information).
2. P(B|A) is called the posterior probability (that is, given that the
event A has occurred, what is the probability of occurrence of event
B). That is, post the additional information (or additional evidence)
that A has occurred, what is estimated probability of occurrence of B.
3. P(A|B) is called the likelihood of observing evidence A if B is true.
4. P(A) is the prior probability of A.
 A great example for human’s inability to take decisions is the famous
Monty Hall problem in which the contestants of a game show are
shown three doors Behind one of the doors is an expensive item
(such as a car or gold); while there are inexpensive items behind the
remaining two doors (such as a goat).
 The contestant is asked to choose one of the doors. Assume that the
contestant chooses door 1; the game host would then open one of
the remaining two doors. Assume that the game host opens door 2,
which has a goat behind it. Now the contestant is given a chance to
change his initial choice (from door 1 to door 3).
 In this problem, the contestant — the decision maker — has two
choices: he/she can either change his/her initial choice or stick with
his/her initial choice.
 Let C1 , C2 , and C3 be the events that the car is behind door 1, 2, and 3,
respectively. Let D1 , D2 , and D3 be the events that Monty opens door 1, 2,
and 3, respectively.
 Prior probabilities of C1 , C2 , and C3 are P(C1 ) = P(C2 ) = P(C3 ) = 1/3
 Assume that the player has chosen door 1 and Monty opens door 2 to reveal a
 posterior probability P(C1 |D2 ), Using, Bayes’ theorem
 Generalization of Bayes’ Theorem:
 Three machines A,B,C produce identical items of their irrespective
outputs 5%,4%,and 3% items are defective. On a certain day A has
produced 25% of the total output. B has produced 30% and C the
balance. An item is selected at random and is found defective. What is
the probability that it was produced by the machine with greatest
Sol: let E1 ,E2,E3 denotes the events that an item is selected at random is
manufactured by the machines A,B,and C respectively and Let D be an event of its
being defective then we have P(E1)= 25/100, P(E2)=30/100,P(E3)= 45/100
The probability of drawing a defective item manufactured by machine A is
P( D/E1)=5/100=0.05
Similarly P(D/E2)=4%=0.04 P(D/E3)= 3%=0.03
 A random variable is any function that assigns a numerical value to each possible
 The numerical value should be chosen to quantify an important characteristic of the
outcome. Random variables are denoted by capital letters X,Y, and so on, to
distinguish them from their possible values given in lowercase x, y.
 Suppose that a coin is tossed twice so that the sample space is S = {HH, HT, TH,
TT}. Let X represent the number of heads that can come up. for example, in the
case of HH (i.e., 2 heads), X = 2 while for TH (1 head), X =1. It follows that X is a
random variable.

Random variable HH HT TH TT
X 2 1 1 0
 Random variables can be classified as discrete or continuous depending on the values that
the random variable can take.
Discrete Random Variables :
 A Random variables which takes finite or at most countable ( may be finite or infinite)
number of values is known as discrete random variable. Or Discrete Random Variable
takes a countable number of possible outcomes.
 Ex: i) Marks obtained by a student in a test
ii) Number of Defective nuts in a lot
iii) The number of cars that pass through a given intersection in an
iii) Number of errors on a page of a book
iv) Number of accidents taking place on busy road.
 Thus, X = {1, 2, 3, 4, 5, 6}
 Another popular example of a discrete random variable is the number of heads when
tossing of two coins. In this case, the random variable X can take only one of the three
choices i.e., 0, 1, and 2.
Continuous Random variable :
 A random variable which takes all the possible values in an interval is called
Continuous variable.
 Examples i) Waiting time for a bus

ii) Weight, Height of the students

Generally discrete random variables represent Counted data while Continuous random
variable represent measured data.
Probability Mass Function and Cumulative Distribution Function of a
Discrete Random Variable :
 For a discrete random variable, the probability that a random variable X taking a
specific value xi , P(X = xi ), is called the probability mass function P(xi ).
Probability Mass Function :

= 1/4+1/2+1/4
 Cumulative distribution function, P(xi ), is the probability that the random
variable X takes values less than or equal xi . That is, P(xi ) = P(X ≤ xi ).
 From the above problem
 P(X < 2), probability that the number of heads are less than are equal
to two.
 F(2) = P(x=0)+P(x=1)
= 1/4 +1/2
= 0.75
 Example 2:
 The Cumulative Distribution Function (CDF) is another important concept in
probability theory and statistics, especially when dealing with random variables, whether
discrete or continuous. The CDF provides the probability that a random variable X takes
on a value less than or equal to a specific point x.
 The cumulative distribution function is denoted by F(x) and its formula is given by:
 F(x)=P(X≤x)
 Probability Mass Function and Cumulative Distribution Function of a
Continuous Random Variable :

See the below figure .

 A probability distribution is a mathematical function that describes
the probability of different possible values and possibilities of
a random variable. Probability distributions are often depicted using
graphs or probability tables.
 Example: Probability distribution
 We can describe the probability distribution of one coin flip using a
probability table:
 Outcome Probability
Heads Tails
.5 .5
Again the probability Distributions are Classified into two types.
1) Discrete probability Distribution
2) Continuous probability Distribution
 A distribution is said to be discrete, if the value taken by the corresponding
random variable are discrete, whereas a distribution is said to be Continuous,
if the random variable takes any value in a specified interval.
 In this Chapter we discuss the following Distributions:

 1) Discrete probability Distribution

a) Binomial Distribution 2) Continuous probability Distribution

b) Poisson Distribution a) Normal Distribution
b) Exponential Distribution
c) Geometric Distribution c) Weibull Distribution
d) Bernoulli Distribution
Binomial Distribution :
 Binomial distribution is one of the most important discrete probability
distribution due to its applications in several contexts. A random variable X is said
to follow a Binomial distribution when
1. The random variable can have only two outcomes success and failure (also
known as Bernoulli trials).
2. The objective is to find the probability of getting k successes out of n trials.
3. The probability of success is p and thus the probability of failure is (1 − p).
4. The probability p is constant and does not change between trials.
5. Success and failure are generic terminologies used in binomial distribution;
based on the context, the interpretation will change (winning a lottery can be
considered as success and not winning as failure).
Probability Mass Function (PMF) of Binomial Distribution : The PMF of the
Binomial distribution (probability that the number of success will be exactly x out of
n trials) is given by


 In Microsoft Excel, the function ‘BINOM.DIST(x, n, p, false)’ can be used for

calculating the probability mass function of a binomial distribution.
Cumulative Distribution Function (CDF) of Binomial Distribution : CDF of a
binomial distribution function, F(a), representing the probability that the random
variable X takes value less than or equal to a, is given by
 In Microsoft Excel, the function ‘BINOM.DIST(x, n, p, true)’ can be used for
calculating the cumulative distribution function of a binomial distribution.
Mean and Variance of Binomial Distribution:
 Mean of a binomial distribution is given by

 The variance of a binomial distribution is given by

 Approximation of Binomial Distribution using Normal Distribution If the

number of trials (n) in a binomial distribution is large, then it can be
approximated by normal distribution with mean np and variance npq, where
q = 1 - p.
Binomial Probability:
 Let X be a binomial random variable. Then, its probability mass function is:

 for x = 0, 1, 2, . . . , n. The values of n and p are called the parameters of the

Consider an exam that contains 10 multiple-choice questions
with 4 possible choices for each question, only one of which
is correct.
 Suppose a student is to select the answer for every question randomly.
Let X be the number of questions the student answers correctly. Then,
X has a binomial distribution with parameters n = 10 and p = 0.25.
(Convince yourself that all assumptions for a binomial distribution are
reasonable in this setting.)
 What is the probability for the student to get no answer correct?
What is the probability for the student to get two answers correct?
 Answer:

 What is the probability for the student to fail the test (i.e., to have less
than 6 correct answers)?
 Binomial Mean and Variance:
 Mean= np
 Variance=np(1-p)
Binomial Mean E(X) = 10 * 0.25 = 2.5.
Variance V (X) = 10 * (0.25) * (1 − 0.25) = 1.875.
 Poisson Distribution
 Poisson Distribution is a Probability distribution that is used to show how many times
an event occurs over a specific period.
 It is the discrete probability distribution of the number of events occurring in a given
time period, given the average number of times the event occurs over that time
period. It is the distribution related to probabilities of events that are extremely rare
but have a large number of independent opportunities for occurrence.
 Poisson Distribution Definition
 Poisson distribution is used to model the number of events that occur in a fixed
interval of time or space, given the average rate of occurrence, assuming that the
events happen independently and at a constant rate
Poisson distribution formula
Mean and Variance of Poisson distribution:
 The Poisson distribution has only one parameter, called λ.

 The mean of a Poisson distribution is λ or (µ)

 The variance of a Poisson distribution is also λ or (σ²)

 In most distributions, the mean is represented by µ (mu) and the variance
is represented by σ² (sigma squared). Because these two parameters are
the same in a Poisson distribution, we use the λ symbol to represent
1.An average of 0.61 soldiers died by horse kicks per year in each Prussian army corps.You
want to calculate the probability that exactly two soldiers died in the VII Army Corps in
1898, assuming that the number of horse kick deaths per year follows a Poisson
2. The number of typographical errors in a “big” textbook is Poisson
distributed with a mean of 1.5 per 100 pages.
Suppose 100 pages of the book are randomly selected. What is the
probability that there are no typos?
 Sol:

 Suppose 400 pages of the book are randomly selected. What are the
probabilities for having no typos and for having five or fewer typos?
 The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference for many
probability problems.
 Let X be a continuous random variable, then it is said to follow normal
distribution if it is given by

 Here u, 𝜎 are the mean & Standard Deviation of X.

Properties Of Normal Distribution :
 It is a two parameter distribution, where the parameter U is the mean
(location parameter) and the parameter 𝜎 is the standard deviation (scale
1. Normal curve is always centered at mean
2. Mean, median and mode coincide (i.e., equal)
3. It is unimodal.
4. It is a symmetrical curve and bell shaped curve
5. X-axis is an asymptote to the normal curve .
6. The total area under the normal curve from −∞ 𝑡𝑜∞ is “1”
7. The points of inflection of the normal curve are 𝜇 ± 𝜎, 𝜇 ± 3𝜎
8. The area of the normal curve between
𝜇 − 𝜎 to 𝜇 + 𝜎 = 68.27%
𝜇 − 2𝜎 𝑡𝑜 𝜇 + 2𝜎 = 95.44%
𝜇 − 3𝜎 𝑡𝑜 𝜇 + 3𝜎 = 99.73%
 Standard Normal Variable Let with mean ‘0’ and variance is ‘1’ then the
normal variable is said to be standard normal variable.
Standard Normal Distribution :
 The normal distribution with man ‘0’ and variance ‘1’ is said to be standard normal
distribution of its probability density function is defined by

 By using the following transformation, any normal random variable X can be

converted into a standard normal variable:
 The random variable X can be written in the form of a standard normal
random variable using the relationship.

 Thus, any normal random variable X can be expressed using the standard
normal random variable Z.
Solved Examples
1. Calculate the probability of normal distribution with the population mean
2, standard deviation 3 or random variable 5.
Mean = μ = 2
Standard Deviation = σ = 3
We will solve the questions with the help of the above normal
probability distribution formula:
 Definition: A portion of the population which is examined with a
view to determining the population characteristics is called a
 In other words, sample is a subset of population. Size of the sample
is denoted by n. The process of selection of a sample is called
 There are different methods of sampling
Probability Sampling Methods
Non-Probability Sampling Methods
Probability Sampling Methods :
a) Random Sampling (Probability Sampling): It is the process of drawing a sample from a
population in such a way that each member of the population has an equal chance of being included in
the sample.
Example: A hand of cards from a well shuffled pack of cards is a random sample.
Note: If N is the size of the population and n is the size of the sample, then The no. of samples with
replacement = Nn
The no. of samples without replacement = 𝑁Cn
b) Stratified Sampling : In this , the population is first divided into several smaller groups called strata
according to some relevant characteristics .
 From each strata samples are selected at random, all the samples are combined together to form the
stratified sampling.
c) Cluster Sampling :
 In cluster sampling, the population is divided into mutually exclusive clusters.
 For example, assume that a researcher is interested in analyzing life of smart phone batteries from a
specific manufacturer. The manufacturer may have different models (each model in this case will be a
d) Systematic Sampling (Quasi Random Sampling): In this method , all the units of the population
are arranged in some order . If the population size is N, and the sample size is n, then we first define
sample interval denoted by = N/n
Non Probability Sampling Methods:
 Sample units are selected based on convenience and/or on voluntary basis.
Ex: Assume that a data scientist is interested in studying attrition and factors
influencing attrition. For this study, he/she may collect data from his friends and
colleagues which may not be true representation of the population. Such
sampling procedures come under the category of non-probability sampling.
Convenience Sampling :
Convenience sampling is a non-probability sampling technique in which the sample
units are not selected according to a probability distribution. For example, a
researcher may collect data from his school or the work place and from his/her
friends since the cost of data collection in such cases is minimal. Convenience
sampling is not recommended since it is likely to result in bias estimates.
Voluntary Sampling : Under voluntary sampling the data is collected from people
who volunteer for such data collection. For example, customer feedbacks in many
contexts fall under this sampling procedure. There could be bias in case of voluntary
sampling. Many organizations such as Amazon, Trip Advisor provide customer
feedback. Many times the feedback is provided by customers who had bad
experience with product/ service; many customers who were happy with
product/service may not give feedback.
Purposive (Judgment ) Sampling : In this method, the members constituting the
sample are chosen not according to some definite scientific procedure , but
according to convenience and personal choice of the individual who selects the
sample . It is the choice of the individual items of a sample entirely depends on the
individual judgment of the investigator.
Sequential Sampling: It consists of a sequence of sample drawn one after another
from the population. Depending on the results of previous samples if the result of
the first sample is not acceptable then second sample is drawn and the process
continues to take proper decision . But if the first sample is acceptable ,then no
new sample is drawn .
Classification of Samples:
 Large Samples : If the size of the sample n ≥ 30 , then it is said to
be large sample.
 Small Samples : If the size of the sample n < 30 ,then it is said to
be small sample or exact sample.
Parameters and Statistics:
 Parameter is a statistical measure based on all the units of a
 Statistic is a statistical measure based on only the units selected in a
 Note: In this unit, Parameter refers to the population and Statistic
refers to sample.
 Sampling distribution refers to the probability distribution of a
statistic such as sample mean and sample standard deviation
computed from several random samples of same size.
 Understanding the sampling distribution is important for
hypothesis testing. Test statistic in hypothesis testing is derived
based on the knowledge of sampling distribution.
 In this example, the population is the weight of six pumpkins (in
pounds) displayed in a carnival "guess the weight" game booth.You
are asked to guess the average weight of the six pumpkins by taking
a random sample without replacement from the population.
Since we know the weights from the population, we can find the population

To demonstrate the sampling distribution, let’s start with obtaining all of the
possible samples of size n=2 from the populations, sampling without
replacement. The table below shows all the possible samples, the weights for the
chosen pumpkins, the sample mean and the probability of obtaining each sample.
 The mean of the sample means is :
 =9.5(1/15)+11.5(1/15)+12(2/15)+12.5(1/15)+13(1/15)+13.5(1
 = 14
 Now, let's do the same thing as above but with sample size n=5
 Central Limit Theorem: If ̅ be the mean of a random sample of size n
drawn from population having mean 𝜇 and standard deviation 𝜎 , then
the sampling distribution of the sample mean ̅ is approximately a normal
distribution with mean 𝜇 and SD = S.E of ̅ = 𝜎 /√n provided the
sample size n is large.
 Estimate : An estimate is a statement made to find an unknown population
 Estimator : The procedure or rule to determine an unknown population
parameter is called estimator.
Example: Sample proportion is an estimate of population proportion , because
with the help of sample proportion value we can estimate the population
proportion value.
Types of Estimation:
 Point Estimation: If the estimate of the population parameter is given by a
single value , then the estimate is called a point estimation of the parameter.
 Interval Estimation: If the estimate of the population parameter is given by
two different values where the parameter is excepted to lie, then the estimate is
called an interval estimation of the parameter.
 Hypothesis is a claim or belief, hypothesis testing is a statistical process of
either rejecting or retaining a claim or belief or association related to a
business context, product, service, processes, etc.
 Hypothesis testing consists of two complementary statements called null
hypothesis and alternative hypothesis, and only one of them is true.
 Null hypothesis is the claim that is assumed to be true initially. That is at the
beginning we assume that the null hypothesis is true and try to retain it
unless there is strong evidence against null hypothesis.
 Alternative hypothesis, usually denoted as HA (or H1 ), is the complement
of null hypothesis. Alternative hypothesis is what the researcher believes to
be true and would like to reject the null hypothesis.
 Hypothesis testing is an integral part of many predictive analytics
techniques such as multiple linear regression and logistic regression.
 In business, many claims are made by organizations. Few examples of such
claims are listed below:
 1. Children who drink the health drink Complan (a health drink owned by
the company Heinz in India) are likely to grow taller.
 2. If you drink Horlicks, you can grow taller, stronger, and sharper (3 in 1).
 3. Using fair and lovely (fair and handsome) cream can make one fair and
lovely (fair and handsome).
 4. Wearing perfume (such as Axe) will help to attract opposite gender
(known as Axe effect).
 5. Women use camera phone more than men (Freier, 2016).
 There are many such claims and beliefs; many business rules and strategies
are generated based on these hypotheses. The question is how can we check
whether these are actually true. Hypothesis testing is used for checking the
validity of the claim using evidence found in a sample data.
 Take the decision to reject or retain the null hypothesis based on the p-value
and significance value α. The null hypothesis is rejected when p-value is less
than α and the null hypothesis is retained when p-value is greater than or equal
to α.
 Calculate the p-value (probability value), which is the conditional probability
of observing the test statistic value when the null hypothesis is true. In simple
terms, p-value is the evidence in support of the null hypothesis.
 Decide the criteria for rejection and retention of null hypothesis. This is called
significance value traditionally denoted by symbol α . The value of α will
depend on the context and usually 0.1, 0.05, and 0.01 are used.
 if the calculated statistic value is less than the critical value (p-value will be less
than α-value) then we reject the null hypothesis, whereas, if the statistic value
is greater than the critical value(p-value will be greater than then we retain
the null hypothesis.
 In hypothesis test we end up with the following two decisions:
1. Reject null hypothesis.
2. Fail to reject (or retain) null hypothesis.
 Type I Error: Conditional probability of rejecting a null hypothesis
when it is true is called Type I Error or False Positive (falsely believing
that the claim made in alternative hypothesis is true).
 A type I error (false-positive) occurs if an investigator rejects a null
hypothesis that is actually true in the population false in the population.
 The significance value α is the value of Type I error.
 Type I Error = α = P(Rejecting null hypothesis | H0 is true)
 Probability value (p-value) is the evidence for the null hypothesis
whereas significance value α is the error based on repetitive sampling.
 Type II Error: Conditional probability of failing to reject a null
hypothesis (or retaining a null hypothesis) when the alternative hypothesis
is true is called Type II Error or False Negative (falsely believing that there
is no relationship).
 A type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population.
 Usually Type II error is denoted by the symbol ß.
 Type II Error = ß = P(Retain null hypothesis | H0 is false)
 The value (1 − ß ) is known as the power of hypothesis test.
 Power of the test = 1 − ß = 1 − P(Retain null hypothesis | H0 is false)
 Alternatively the power of test = 1 − ß = P(Reject null hypothesis|H0 is
 False-positive and false-negative results can also occur because of bias.
T-test :
 The t-test is used when the population follows a normal distribution and the population standard

deviation s is unknown and is estimated from the sample. t-test is a robust test for violation of
normality of the data as long as the data is close to symmetry and there are no outliers.

 Let S be the standard deviation estimated from the sample of size n. Then the statistic

will follow a t-distribution with (n − 1) degrees of freedom if the sample is drawn from a

population that follows a normal distribution. Here 1 degree of freedom is lost since the standard

deviation is estimated from the sample. Thus, we use the t-statistic (hence the test is called t-test) to

test the hypothesis when the population standard deviation is unknown. t-statistic =
 The t-test is a statistical test procedure that tests whether there is a
significant difference between the means of two groups.
EX: The two groups could be, for example, patients who received drug
A once and drug B once, and you want to know if there is a difference in
blood pressure between these two groups.
Types of t-test :
 There are three different types of t-tests.
One-sample t-test
 We use the one-sample t-test when we want to compare the mean of a sample with a known
reference mean.
 Example : A manufacturer of chocolate bars claims that its chocolate bars weigh 50 grams on
average. To verify this, a sample of 30 bars is taken and weighed. The mean value of this sample is
48 grams.
Independent-sample t-test
 We use the t-test for independent samples when we want to compare the means of two
independent groups or samples. We want to know if there is a significant difference between these
 Example : We would like to compare the effectiveness of two painkillers, drug A and drug B.
Paired-sample t-test
 The t-test for dependent samples is used to compare the means of two dependent groups.
Example : We want to know how effective a diet is. To do this, we weigh 30 people before the diet
and exactly the same people after the diet.
Chi-Square Goodness of Fit Tests
 Goodness of fit tests are hypothesis tests that are used for comparing the
observed distribution of data with expected distribution of the data to
decide whether there is any statistically significant difference between the
observed distribution and a theoretical distribution based on comparison
of observed frequencies in the data and the expected frequencies if the data
follows a specified theoretical distribution.
 The null and alternative hypotheses in chi-square goodness of fit tests are
H0 : There is no statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
HA: There is a statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
 Let Z be a standard normal distribution with 1 degree.
 If we have k random variables, namely, X1 , X2 , …, Xk , then a chi-
square distribution with k-degrees of freedom is given by

 Consider a binomial random variable with parameter p (probability of

success) and number of trials n.
 Consider a binomial random variable with parameter p (probability of
success) and number of trials n.
 Then for a large sample, the standardized random variable in Eq.
follows a standard normal distribution (central limit theorem for
 Note that np and n(1 − p) are the expected values of two categories (success
and failure) of the binomial distribution.

 Thus, the chi-square statistic for goodness of fit test is given by

 where Oij is the observed frequency in category (i, j) and Eij is the expected
frequency in the category (i, j). Thus, chi-square test is always a right-tailed
 The objective of ANOVA is to check simultaneously whether population
mean from more than two populations are different.
 ANOVA stands for Analysis of Variance. It is a statistical method used to
analyze the differences between the means of two or more groups or
 It is often used to determine whether there are any statistically significant
differences between the means of different groups.
 ANOVA is used to compare treatments, analyze factors impact on a
variable, or compare means across multiple groups.
 Types of ANOVA include one-way (for comparing means of groups) and
two-way (for examining effects of two independent variables on a
dependent variable).
 One-way analysis of variance (ANOVA) : It is a statistical method
for testing for differences in the means of three or more groups.
 In statistics, ANOVA also uses a Null hypothesis and an Alternate
 The Null hypothesis in ANOVA is valid when all the sample means are
equal, or they don’t have any significant difference.
 On the other hand, the alternate hypothesis is valid when at least one of
the sample means is different from the rest of the sample means. In
mathematical form, they can be represented as:
 where μi is the mean of the i-th level of the factor.
Ex for One –way ANOVA:
 Suppose you are studying the effectiveness of three different drugs (Drug
A, Drug B, and Drug C) in reducing blood pressure.You randomly assign
90 patients to one of the three drug groups and measure their blood
pressure after one month of treatment. The blood pressure measurements
(in mmHg) for each patient are observed and prepared as a dataset.
 In this dataset, each drug group represents a separate treatment or
condition, and the blood pressure measurements for each patient in that
group are recorded.
 To analyze this dataset using ANOVA, you would compare the means of
the blood pressure measurements among the three drug groups to
determine if there is a statistically significant difference.
Two-Way ANOVA : Two way ANOVA technique are used
when the data are classified based on the two factors.
 Ex: the agricultural output may be classified on the basis of different
varieties of Seeds and also on the basis of different varieties of
fertilizers are used.
 A statistical test is used to determine the effect of two nominal
predictor variables on a Continuous outcome variable.
 Two way ANOVA test analyzes the effect of the independent variables
on the expected outcome along with their relationship to the
outcome itself.
Ex for TWO –way ANOVA
 Two-way (or two factor) analysis of variance tests whether there is a
difference between more than two independent samples split between
two variables or factors.
 A factor is, for example, the gender of a person with the characteristics
male and female, the form of therapy used for a disease with therapy A,
B and C or the field of study with, for example, medicine, business
administration, psychology and math.
 In addition to gender, the highest level of education also has an influence
on salary.
 besides therapy, gender also has an influence on blood pressure.
 In addition to the field of study, the university attended also has an
influence on the duration of studies.
Now in all three cases you would not have one factor, but two factors
each. And since you now have two factors, you use the two-way
analysis of variance.
Formulas of ANOVA:
 Sum of Squares of Total Variation (SST):

 Mean Square Total (MST) variation is given by

 Sum of Squares of Between (SSB) Group Variation:

 Mean square between variation (MSB) is given by

 Sum of Squares of Within (SSW) Group Variation:

 The mean square of variation within the group is

Correlation Analysis

1.Simple Correlation Coefficient

Interpretation, Scatter plot.
 Correlation is a statistical measure of an association
relationship between two random variables.
 A correlation coefficient is a statistical measure of the degree
to which changes to the value of one variable predict change
to the value of another.
 ’’Correlation means that between two series or groups of
data ,there exists some casual connection,’’
 EX: For example, mobile service providers collect data on variables
such as call duration, number of calls, numbers to which the calls are
made, number of calls received, the device that was used to make the
call, location (and mobile tower that the phone was attached to), time
between calls, last recharge (in case of pre-paid mobile services),
recharge amount, service plan (in case of post-paid connection),
number of messages sent, number of messages received, apps
downloaded, time spent on surfing internet, and so on. The number of
variables collected and new variables generated may exceed several
thousands. The idea behind collecting all these variables is to find
answer to questions such as
 1. Which customer is likely to churn?
 2. What is the customer lifetime value?
 3. What is the best service plan for a customer?
 4. What recommendations can be made to a customer?
Importance of Correlation:
 The study of Correlation shows the direction and degree of relationship
between the variables .
 It is very helpful in understanding economic behaviour .
 Study of correlation reduces the range of uncertainties in matter of
 Helpful in investigation and research.
 It is also helpful in policy formulation
Types of Correlation:
Correlation can be:
 Positive and Negative Correlation
 Linear and Non- Linear Correlation
 Simple ,Multiple and Partial Correlation
Positive Correlation :
 When two variables X and Y move in the same direction,i.e.when one
increases the other also increases and when one decreases the other also
decreases, the correlation between the two is positive .
Negative Correlation:
 If both the variables vary in opposite direction, the correlation is said to be
negative. If means if one variable increases, but the other variable
decreases or if one variable decreases, but the other variable increases,
then the correlation is said to be negative correlation.
Linear Correlation :
 If the ratio of change between two variables is uniform ,it is called Linear
Correlation. If the changes are plotted on a graph paper ,their relationship
will be indicated by a straight line .
Non- Linear Correlation :
 If the ratio of change between two variables is not uniform,It is called
Non-Linear Correlation. If these changes are plotted on a graph paper
,they will not form a straight line but a curve.
Simple Correlation:
 Relationship between two variables is known as Simple Correlation. For
example ,relationship between price and demand of a commodity
 Ex 2 :Yield of paddy and the use of fertilizers is an example of simple
correlation as yield of paddy depends on the use of fertilizers i.e. presence
of one variable affects another variable..
Multiple Correlation:
 When the relationship among three or more than three variables is studied
simultaneously, it is called Multiple Correlation. For example, agricultural
production depends on rainfall, amount of mannures,seeds etc. This will be
called Multiple Correlation
Partial Correlation:
 Relationship between two variables is established keeping other variables
constant. For example, If we study the relationship between degree of
rainfall and agricultural production assuming amount of fertilizers, quality
of seeds as constant ,it will be known as Partial Correlation.
 Degree Of Correlation :
Karl Pearson’s Coefficient of Correlation:
 A mathematical method for measuring the linear relationship
between the variable X and Y was suggested by the great
biologist and statistician Karl Pearson.
 This method is also called Product Moment Method.
 The coefficient of correlation is denoted by the symbol “r”.
 If the two variables under study are X and Y, the following
formula suggested by Karl Pearson can be used for measuring
the degree of relationship of correlation.

Here, r=Coefficient of Correlation.

Karl Pearson correlation coefficient lies between -1 and +1,e.i.,-1≤r≤+1.
If r=0 ,there is no correlation between variables.
 If r=+1,The correlation is perfect positive .
If r=-1,The correlation is perfect negative.
 Practical and popular method.
 Meaningful conclusion.
 Measurement of degree and direction simultaneously.
 Greater influence of extreme values.
 Calculation process is long and time consuming.
 Possibility of wrong interpretation.
 Assumption of Linear relationship between the variables.
 Example – Correlation of statistics and science tests
1. A study is conducted involving 10 students to investigate the association
between statistics and science tests. The question arises here; is there a
relationship between the degrees gained by the 10 students in statistics and
science tests?
As per the above calculation the co-relation co-efficient r = 0.761
so it is a high degree positive co-relation
Spearman’s Rank Coefficient of Correlation:
 When quantification of variables becomes difficult such beauty
of female, leadership ability, knowledge of person etc, then this
method of rank correlation is useful which was developed by
British psychologist Charles Edward Spearman in 1904. In this
method ranks are allotted to each element either in ascending or
descending order.
 To find out correlation under this method, the following
formula is used.

 Here, R=Rank Coefficient of Correlation ,Σ 𝐷 2=The total of

squares of differences of corresponding ranks.
 N= Number of pairs of observation.
 As in case of r, -1≤R≤+1.
 Its calculation is easier as compared to Karl Pearson’s Method.
 This method can be used as a measure of degree of association
between qualitative variables.
 This method is not suitable for calculating coefficient of
correlation of grouped frequency distribution.
 If the no. of items are large , this method becomes difficult and
Kendall's Tau rank correlation:
 Kendall's Tau rank correlation: Kendall rank correlation is a
non-parametric test that measures the strength of dependence
between two variables. If we consider two samples, x and y ,
where each sample size is n, we know that the total number of
pairings with x y is n (n1)/2.
 The following formula is used to calculate the value of Kendall
rank correlation:

Refer Datatab website for problems

•Suppose two doctors rank 6 patients by
descending physical health. One of the two
doctors, in this case the female, is now defined
as the reference and the patients are sorted
from 1 to 6.

•Now it is possible to compare the

sorted ranks with the ranks of the
second doctor, e.g. the patient who is
ranked 3 by the female doctor is
ranked 4 by the male doctor.
•We want to know if there is
a correlation between the two
assessments using Kendall's Tau. To
calculate it, we only need the ranks on
the right-hand side, i.e. the ones from
the male doctor.
•We now look at each rank and note
whether the values below it are smaller
or larger than itself.
•As can be seen in the figure above, we
start with the first rank, corresponding
to the value 3. 1 is smaller than 3, so it
gets a minus, 4 is larger, so it gets a plus,
2 is smaller, so it gets a minus, 6 is
larger, so it gets a plus, and 5 is also
larger, so it also gets a plus.
•Same procedure for 1,4,2,6,5,finally
 We get the number of concordant pairs by counting all "+". = 11
 We get the number of discordant pairs by counting through all the
"-“ =4
 C is 11 and D is 4, so the Kendall's Tau is 11 - 4 divided by 11 + 4,
resulting a value of 0.47.
Scatter Diagram Method:
 1.Scatter Diagram Method :The existence of Correlation between
variables can be shown graphically by means of a Scatter diagram.
 It is obtained by plotting value on a graph paper .
 The chart is prepared by measuring X variable on horizontal axis and the
Y-variable on vertical axis and all the observations are plotted on a graph.
 The cluster points ,so obtained on graph paper is called the Scatter
diagram or dot diagram. By observing the points we can know the degree
and direction of Correlation.
 If the trend of the dotted points is Upward, rising from left bottom and
going up towards the right top, Correlation is positive.
 On the other hand ,If the dotted point show a downward trend from the
left top to the right bottom ,correlation is negative.
 If the plotted point do not show any trend ,the two variables are not
 Closeness of dots towards each other in a particular direction indicating
higher degree of correlation.
Probable error of Coefficient correlation and interpretation.
• Regression Model/Analysis:
• Regression analysis is a predictive modelling technique that analyses the relation between the target or
dependent variable and independent variable in a dataset. The regression technique gets used mainly to
determine the predictor strength, forecast trend, time series, and in case of cause & effect relation.
• Example1: Examine the relationship between sales and advertising expenditures for a corporation.
• Purpose of a regression model: Regression analysis is used for one of two purposes: 1. Predicting the value
of the dependent variable when information about the independent variables is known, forecasting. 2. Predicting
the effect of an independent variable on the dependent variable.
• Types of Regression Models: Popularly used Regression Models are, Linear. Regression, Logistic Regression,
Polynomial Regression.
1. Linear Regression:The most extensively used modelling technique is linear regression, which assumes a
linear connection between a dependent variable (Y) and an independent variable (X). It employs a Regression Line,
also known as a best-fit line.
• Y=c+m*X+e, where, C= denotes the intercept (a regression coefficient),
• m= denotes the regression coefficient, slope of the line, and e= is the error term or residual.
• c=Y-m*x and m=((x-x) (y-y'))/(x-x)² where, Y' = mean of Y X' = mean of x
1. Simple Linear Regression: Here we have one dependent variable and one independent variable so
the formula is
Y=c+m*X where c= intercept value and m=slope value.
Simple linear regression can be used:
• To find the intensity of dependency between two variables. Such as the rate of carbon emission and
global warming.
2. Multiple Linear Regressions: Here we have one dependent variable and more than one
independent variables so the formula is
Multiple linear regression can be used: To estimate how strongly two or more independent
variables influence the single dependent variable. Such as how location, time, condition, and area can
influence the price of a property.
Note that: Linear Regression deals with dependent variables that are continuous numeric data in nature.
• 2.Logistic Regression: This Logistic Regression is useful when dependent variables
are categorical data that is yes/no, true/false, valid/ invalid kind of data, which are
discrete in nature and for this kind of data Linear Regression can't be used for
knowing the output of dependent variable. Hence we say Logistic Regression is used
to predict the categorical dependent variable with the assistance and knowledge of
independent variables.
• The overall aim of Logistic Regression is to classify outputs, which can only be
between 0 and 1.
• In logistic regression, sigmoid curve (S- curve) represents its connection to the
independent variable, and probability has a value between 0 and 1.
• The weighted Sum of inputs is passed through an Activation function called sigmoid
Function which maps values between 0 and 1. the formula for sigmoid function is:
• The change in regression
coefficients (present with
independent variable) has an
impact on the curve direction and
its steepness. Thus, one can infer
that a positive slope results in an
S-shaped curve, and a negative
slope reveals a Z-shaped curve.
• To classify Y-values into two
categories, you need to set a
threshold value (0.5) between 0
and 1. Values of Y above this
threshold will be classified as
category 1, and it will take values
below the threshold as category 0.
• R-Squared: It is important to know how well the relationship between the
values of the x- and y- axis is, if there are no relationship the polynomial
regression cannot be used to predict anything. The relationship is measured
with a value called the r-squared.
• The formula is :
Data Analytics
Probability and Statistics of Data Analytics
BCA 5th Semester

Dr. Rashmi M
Department of Computer Science,
GFGC T. Dasarahall.

Case Study: Netflix

1. Introduction

Netflix is a global leader in streaming entertainment, providing on-demand video content that
includes movies, TV shows, documentaries, and original programming. It has become a
transformative force in the entertainment industry, altering how audiences consume media.

 Founded: 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California.
 Headquarters: Los Gatos, California.
 Key Milestones:
o 1998: Launched as a DVD rental service via mail.
o 2007: Transitioned to online streaming.
o 2013: Entered content production with Netflix Originals, starting with House of

Netflix has grown from a DVD rental service to a global entertainment powerhouse, available in
over 190 countries with millions of subscribers.

2. Business Model Evolution

Initial Model: DVD Rentals

 Netflix began as a subscription-based DVD rental service, allowing users to select DVDs
online and receive them via mail.
 Competitive Edge:
o No late fees.
o Flat monthly subscription rate.
o Large inventory of movies.

Transition to Streaming

 2007: Netflix launched its streaming service, enabling users to watch content instantly
over the internet.
 This move capitalized on advancements in broadband internet and changing consumer

Original Content



 2013: Netflix debuted its first original series, House of Cards, which marked its entry as a
content producer.
 Since then, Netflix has heavily invested in original programming to differentiate itself
from competitors.

Current Model

 Subscription-based streaming with tiered pricing:

o Basic Plan: Limited features and lower quality.
o Standard Plan: HD quality and simultaneous viewing on two devices.
o Premium Plan: 4K quality and simultaneous viewing on up to four devices.

3. Growth Strategy
Content Investment

 Netflix spends billions annually on creating and acquiring content.

 Focus Areas:
o Original series, movies, and documentaries.
o Licensing agreements for popular shows and movies.

Global Expansion

 Netflix is available in over 190 countries, tailoring its content to local tastes.
 Key initiatives:
o Subtitling and dubbing content in multiple languages.
o Producing localized content such as Sacred Games (India) and Dark (Germany).

Data-Driven Decision-Making

 Netflix uses advanced data analytics to:

o Predict audience preferences.
o Optimize recommendations.
o Guide decisions on content creation and acquisition.


 Collaborates with device manufacturers, telecom operators, and smart TV companies to

enhance accessibility.
 Example: Netflix button on remote controls.



4. Technology and Innovation


 Netflix’s recommendation system accounts for over 80% of user activity.

 Utilizes machine learning algorithms to:
o Recommend content based on viewing history and preferences.
o Create personalized thumbnails for each user.

Cloud Infrastructure

 Operates on Amazon Web Services (AWS) to ensure scalability and reliability.

 Robust infrastructure supports millions of concurrent streams worldwide.

Streaming Optimization

 Advanced compression technology allows high-quality streaming on limited bandwidth.

 Tailored mobile plans and optimization for developing markets.

Mobile Strategy

 Affordable mobile-only subscription plans in markets like India, addressing price-

sensitive consumers.

5. Challenges
Intense Competition

 Competes with platforms like Amazon Prime Video, Disney+, Hulu, HBO Max, and
regional players.
 Rivals offer competitive pricing and exclusive content.

Rising Costs

 High investment in original content increases financial pressure.

 Rising competition has driven up content licensing costs.

Subscriber Saturation

 Growth is slowing in mature markets like North America and Europe.

Password Sharing

 Widespread account sharing affects revenue growth.

 Netflix is exploring measures to address this issue.

Regulatory Challenges

 Local regulations on content and censorship can pose barriers in certain markets.

6. Impact
Cultural Influence

 Popularized binge-watching and changed traditional TV consumption habits.

 Produced globally recognized content, such as Stranger Things, The Crown, and Squid

Industry Disruption

 Accelerated the decline of traditional cable TV and physical media.

 Encouraged other entertainment companies to launch their own streaming platforms.

Global Content

 Increased the accessibility of foreign-language content worldwide.

 Promoted inclusivity by investing in diverse stories and voices.

Economic Contribution

 Generates thousands of jobs in content production and distribution.

 Boosts local economies by shooting in various global locations.

7. Case Study: Netflix in India

Localized Content

 Netflix’s Indian Originals include Sacred Games, Delhi Crime, and Lust Stories.
 Focuses on stories resonating with Indian audiences.

Pricing Strategy



 Launched a mobile-only plan priced at INR 149/month to cater to price-sensitive

 Competitively priced compared to Amazon Prime Video and Disney+ Hotstar.

Challenges in India

 Competing with well-established local platforms like Disney+ Hotstar, Zee5, and MX
 Navigating strict regulations and censorship policies.

Success Metrics

 Increased subscriber base in urban areas.

 High engagement with localized and international content.

8. Conclusion
Key Takeaways

 Netflix’s success is driven by its adaptability, technological innovation, and focus on

content quality.
 Investments in global expansion and local content have solidified its position as a market

Future Outlook

 Expansion into emerging markets with affordable plans.

 Experimenting with ad-supported tiers to attract budget-conscious viewers.
 Maintaining leadership through innovation in content and technology.

Case Study: Amazon

1. Introduction




Amazon, founded by Jeff Bezos in 1994, began as an online bookstore and has since evolved
into one of the world’s largest multinational technology companies. It operates in e-commerce,
cloud computing, digital streaming, and artificial intelligence.

 Founded: 1994 by Jeff Bezos in Seattle, Washington.

 Headquarters: Seattle, Washington.
 Key Milestones:
o 1995: Launched as an online bookstore.
o 2000s: Expanded into retail, technology, and logistics.
o 2006: Launched Amazon Web Services (AWS), becoming a leader in cloud
o 2014: Entered digital streaming with Amazon Prime Video.

Amazon’s mission is "to be Earth's most customer-centric company," focusing on innovation and
operational excellence.

2. Business Model
Core Business Segments

1. E-Commerce:
o Operates a global online marketplace offering millions of products.
o Revenue streams include product sales, third-party seller fees, and advertising.
2. Amazon Web Services (AWS):
o Provides scalable cloud computing services.
o Core offerings: storage, computing power, AI tools, and machine learning.
3. Subscription Services:
o Amazon Prime: Offers free shipping, streaming services, and exclusive deals.
o Other subscriptions: Kindle Unlimited, Audible, and Amazon Music.
4. Hardware and Devices:
o Develops and sells devices like Kindle, Echo, Fire tablets, and Ring cameras.
5. Logistics and Delivery:
o Owns extensive warehousing and delivery networks, including Amazon Air and
Prime delivery services.

Revenue Streams

 Product sales (e-commerce and third-party marketplace).

 AWS.
 Advertising and sponsored listings.
 Subscription fees.

3. Growth Strategy
Customer-Centric Approach

 Focuses on enhancing customer satisfaction through:

o Wide product selection.
o Competitive pricing.
o Reliable delivery and returns policies.

Global Expansion

 Operates in multiple countries with localized marketplaces.

 Tailored strategies to meet local needs, e.g., Amazon India focuses on regional products
and local language support.

Technological Innovation

 Heavy investment in technology for:

o Warehouse automation with robotics.
o Voice-enabled AI (Alexa).
o Machine learning for personalized recommendations.


 Expansion into new sectors:

o Entertainment (Amazon Studios and Prime Video).
o Healthcare (acquisition of PillPack and development of Amazon Clinic).

Partnerships and Acquisitions

 Acquired companies like Whole Foods, Zappos, and MGM to diversify offerings.
 Partnerships with logistics and delivery companies to enhance last-mile delivery.

4. Technology and Innovation

AI and Machine Learning

 Personalized Recommendations:
o Uses algorithms to suggest products and content based on user behavior.
 AI-driven voice assistant Alexa powers smart home devices.

Cloud Computing

 AWS is a leader in cloud services, serving businesses, governments, and startups

 AWS innovations include tools for AI, big data, and serverless computing.

Logistics and Automation

 Automated warehouses using robots for efficiency.

 Advanced supply chain systems for faster delivery.

Digital Transformation

 Integrates technology into every aspect of its business, from e-commerce platforms to
fulfillment centers.

5. Challenges
Regulatory Scrutiny

 Subject to antitrust investigations in the U.S., EU, and other markets.

 Concerns over monopolistic practices and data privacy.

Workplace Practices

 Criticized for labor conditions in warehouses.

 Faces unionization efforts in various countries.

Intense Competition

 Competes with Walmart, Alibaba, Microsoft Azure, Google Cloud, Netflix, and others.
 Need to maintain leadership across diverse industries.

Global Market Barriers

 Regulatory and cultural challenges in certain markets like China and India.

6. Impact
Economic Contribution

 Created millions of jobs worldwide in technology, logistics, and retail.

 Supports small businesses through its third-party seller platform.

Consumer Behavior

 Revolutionized online shopping with fast delivery, ease of access, and product diversity.
 Encouraged the growth of the subscription economy through Amazon Prime.

Industry Disruption

 Transformed retail, cloud computing, and digital streaming industries.

 Competitors forced to adopt innovative practices to keep pace.

Sustainability Efforts

 Committed to achieving net-zero carbon emissions by 2040.

 Investments in renewable energy and electric delivery vehicles.

7. Case Study: Amazon in India

Localized Strategy

 Adapted to India’s diverse and price-sensitive market by:

o Offering regional language support.
o Introducing "Amazon Pay Later" for financial inclusivity.
o Selling low-cost products on Amazon Basics.

Innovations for India

 Launched services like Prime Video Mobile Edition for smartphone users.
 Developed partnerships with Indian sellers and brands.

Challenges in India

 Intense competition from Flipkart, Reliance JioMart, and local e-commerce platforms.
 Regulatory hurdles regarding data localization and FDI norms.

Success Metrics

 Rapid growth in Prime membership.

 High adoption of Amazon Pay and Amazon Fresh.
 Increasing share in the grocery and fashion segments.



8. Conclusion
Key Takeaways

 Amazon’s success is driven by its relentless focus on innovation, customer satisfaction,

and market adaptability.
 By diversifying its offerings and leveraging technology, it has maintained a competitive
edge across industries.

Future Outlook

 Continued global expansion with tailored strategies for emerging markets.

 Investments in sustainability, AI, and logistics to future-proof its operations.
 Exploring new industries such as healthcare and fintech to sustain long-term growth.

Case Study: Twitter

1. Introduction

Twitter is a microblogging and social networking platform that allows users to post and interact
through short messages known as "tweets." Since its launch, Twitter has become a significant
tool for communication, marketing, activism, and real-time information sharing.

 Founded: March 21, 2006, by Jack Dorsey, Biz Stone, Evan Williams, and Noah Glass.
 Headquarters: San Francisco, California.
 Key Milestones:
o 2006: Initial launch as a microblogging platform.
o 2013: Listed as a public company on the New York Stock Exchange (NYSE).
o 2022: Acquired by Elon Musk, leading to significant operational changes.

Twitter’s mission is to "serve the public conversation," facilitating open and real-time exchange
of information globally.



2. Business Model
Core Revenue Streams

1. Advertising:
o Promoted tweets, trends, and accounts.
o Accounts for the majority of Twitter’s revenue.
2. Subscription Services:
o Twitter Blue: Offers features like verified badges, longer tweets, and edit options.
3. Data Licensing:
o Monetizes its vast database by selling access to public data (APIs) for research
and analysis.

User Base

 Monthly Active Users (MAUs): Hundreds of millions globally, with a significant

portion from the U.S., India, and Japan.
 Diverse user base comprising individuals, businesses, celebrities, and government

3. Growth Strategy
Platform Features

 Real-Time Engagement:
o Key differentiator: Real-time sharing of news, events, and public discourse.
o Popular for live events, breaking news, and trending topics.
 New Features:
o Spaces: Live audio chat rooms.
o Threads: Organized long-form content.
o Communities: Groups with shared interests.

Global Expansion

 Localized versions for countries with support for multiple languages.

 Partnerships with telecom providers for free or discounted access in emerging markets.


 Collaborates with media houses and brands for event-specific promotions.



 Partnerships with governments for public information campaigns.


 Acquired startups like Periscope (live streaming) and Revue (newsletter services) to
expand its feature set.

4. Technology and Innovation

Algorithmic Feeds

 Uses machine learning to curate personalized timelines based on user preferences and
 Trending topics are tailored by location and interests.

API Ecosystem

 Provides APIs for developers and researchers to build tools, analyze data, and track

Content Moderation

 Relies on a combination of AI and human reviewers to monitor for harmful or abusive

 Policies on misinformation, hate speech, and spam evolve to address changing


 Scalable architecture to support millions of tweets per day.

 Focus on low-latency delivery for real-time communication.

5. Challenges
Content Moderation

 Struggles to balance free speech with preventing harmful content.

 Criticized for inconsistent enforcement of policies.


 Competes with platforms like Facebook, Instagram, TikTok, and emerging decentralized


 Historically struggled to achieve consistent profitability.

 Heavy reliance on advertising revenue makes it vulnerable to market shifts.

User Retention

 Faces challenges in retaining active users, especially with competition offering more
engaging content formats.

Regulatory Issues

 Subject to scrutiny over data privacy, political influence, and misinformation.

 Regional laws (e.g., India’s IT regulations) impact operations and content policies.

6. Impact
Global Communication

 Facilitates real-time communication during major events like natural disasters, protests,
and elections.
 A platform for political discourse, activism, and citizen journalism.

Cultural Influence

 Popularized hashtags, which have become a tool for movements like #MeToo and
 Redefined how news breaks, with many organizations relying on Twitter for updates.

Economic Contributions

 Provides advertising opportunities for businesses of all sizes.

 Supports influencers and content creators in monetizing their audience.

Societal Challenges

 Spread of misinformation and polarizing content.

 Amplification of echo chambers and online harassment.



7. Case Study: Twitter’s Role in Social Movements

#ArabSpring (2010-2012)

 Used extensively to mobilize protests and share information during the Arab Spring.
 Enabled activists to organize and communicate despite government censorship.


 Became a central platform for raising awareness of police brutality and racial injustice.
 Amplified grassroots campaigns and public discussions globally.

Election Campaigns

 Utilized by candidates and political parties to directly engage with voters.

 Challenges: Use of bots and misinformation campaigns.

8. Twitter in India
Localized Features

 Supported Indian languages like Hindi, Tamil, and Bengali.

 Partnerships with government bodies for public awareness campaigns.

Political Engagement

 Widely used by politicians, journalists, and activists to engage the public.

 Challenges with government regulations and demands for content takedowns.

Growth Strategy in India

 Focused on increasing penetration in tier-2 and tier-3 cities.

 Collaborated with brands for localized campaigns.

9. Conclusion
Key Takeaways

 Twitter’s strength lies in real-time communication and its role as a global public square.
 Despite challenges, it has had a profound impact on communication, activism, and media.

Future Outlook

 Focus on monetization through subscriptions and new features.

 Tackling content moderation and misinformation challenges.
 Exploring opportunities in decentralized and blockchain-based networks.

Case Study: Uber

1. Introduction

Uber is a global leader in ride-hailing, food delivery, and freight services, revolutionizing how
people and goods move. By leveraging technology, Uber connects drivers, riders, and businesses
seamlessly, creating a disruptive force in traditional transportation industries.

 Founded: 2009 by Garrett Camp and Travis Kalanick in San Francisco, California.
 Headquarters: San Francisco, California.
 Key Milestones:
o 2010: Official launch in San Francisco.
o 2014: Expanded into international markets.
o 2020: Acquired Postmates to enhance its food delivery services.

Uber’s mission is "to ignite opportunity by setting the world in motion."

2. Business Model
Core Services

1. Ride-Hailing:
o On-demand rides through mobile apps.
o Options include UberX, Uber Pool, Uber Comfort, and Uber Black.
2. Uber Eats:
o Food delivery service connecting restaurants, couriers, and customers.
3. Uber Freight:
o Matches trucking companies with shippers, optimizing logistics.
4. Other Ventures:
o Micro-mobility options like e-scooters and bikes.
o Partnerships in autonomous vehicle research.



Revenue Streams

 Commission from rides and deliveries.

 Service fees for drivers and restaurants.
 Subscriptions like Uber One for premium benefits.

Platform Ecosystem

 A two-sided marketplace involving:

o Drivers/Couriers: Partners earning income through rides and deliveries.
o Customers: Riders and businesses using Uber’s services for convenience.

3. Growth Strategy
Global Expansion

 Entered markets across six continents by adapting to local regulations and consumer
 Focused on partnerships and acquisitions to accelerate growth.

Technology Innovation

 Dynamic pricing algorithms to optimize supply and demand.

 Machine learning for route optimization and fraud detection.
 Investment in autonomous driving and electric vehicles.


 Expanded from ride-hailing to food delivery, logistics, and public transportation

 Uber Health: Transportation for healthcare appointments.

Marketing and Branding

 Aggressive marketing campaigns and referral incentives.

 Partnerships with events, businesses, and local organizations.

4. Technology and Innovation

Mobile App

 Core interface for customers and drivers, featuring:

o Real-time tracking.
o Seamless payments.
o Safety tools like trip sharing and emergency assistance.

Dynamic Pricing

 Surge pricing adjusts ride costs based on demand and supply, maximizing driver

Autonomous Vehicles

 Invested heavily in self-driving technology through Uber ATG (Advanced Technologies

 Sold ATG to Aurora in 2020 but retains a stake.

Data Analytics

 Uses data to predict demand patterns, optimize routes, and enhance user experiences.

5. Challenges
Regulatory Hurdles

 Faced bans and restrictions in markets like London, Germany, and India.
 Classified drivers as independent contractors, leading to legal disputes over labor rights.

Workforce Issues

 Criticized for treatment of drivers, including low pay and lack of benefits.
 Unionization efforts and strikes in various countries.


 Competes with local ride-hailing apps (e.g., Ola, Grab) and global players like Lyft.


 Struggled to achieve consistent profitability due to high operating costs and subsidies.



6. Impact
Economic Contribution

 Created income opportunities for millions of drivers globally.

 Boosted local economies by increasing mobility and accessibility.

Urban Mobility Transformation

 Reduced dependence on private car ownership.

 Provided last-mile connectivity in underserved areas.

Social Challenges

 Raised concerns over safety, driver exploitation, and data privacy.

 Led to traffic congestion in some cities due to increased ride volumes.

7. Case Study: Uber in India

Localized Strategies

 Adapted to Indian market needs with options like Uber Auto (rickshaws) and cash
 Launched regional language support in the app.


 Collaborated with governments for smart city initiatives.

 Partnered with local organizations for driver training and skilling.

Challenges in India

 Intense competition from Ola, a homegrown ride-hailing giant.

 Navigating complex regulatory environments and pricing caps.

Success Metrics

 Significant market share in metropolitan cities.

 Growth in Uber Eats before its sale to Zomato in 2020.



8. Conclusion
Key Takeaways

 Uber’s success is built on leveraging technology and a flexible business model to disrupt
traditional industries.
 Its ability to adapt and innovate has enabled global expansion despite challenges.

Future Outlook

 Focus on sustainability with electric vehicles and carbon-neutral goals.

 Continued investment in autonomous driving and AI.
 Expanding Uber Freight and other enterprise solutions for diversified growth.

Case Study: LinkedIn

1. Introduction

LinkedIn is the world’s largest professional networking platform, enabling individuals and
businesses to connect, share, and grow their professional networks. It has become a vital tool for
career development, recruitment, and professional content sharing.

 Founded: December 2002 by Reid Hoffman, Allen Blue, Konstantin Guericke, Eric Ly,
and Jean-Luc Vaillant.
 Launched: May 5, 2003.
 Headquarters: Sunnyvale, California.
 Ownership: Acquired by Microsoft in 2016 for $26.2 billion.

LinkedIn’s mission is "to connect the world’s professionals to make them more productive and

2. Business Model
Core Offerings

1. Networking:
o Allows professionals to build connections, share updates, and collaborate.

2. Recruitment and Talent Solutions:

o Tools for companies to post jobs, search for candidates, and manage hiring
3. LinkedIn Learning:
o Online courses and training programs to help users upskill.
4. Marketing Solutions:
o Advertising options like sponsored content, InMail, and display ads targeting
5. Premium Subscriptions:
o Offers features like advanced search filters, direct messaging, and insights for
career growth or sales.

Revenue Streams

 Talent Solutions: The largest revenue source, driven by recruitment services.

 Marketing Solutions: Advertising revenue from targeted campaigns.
 Premium Subscriptions: Paid tiers for job seekers, recruiters, and businesses.

3. Growth Strategy
User Growth

 Grew from 4,500 members at launch to over 950 million users globally as of 2024.
 Strong presence in developed and emerging markets.

Product Diversification

 Expanded services from networking to include e-learning, job boards, and marketing
 Continuous feature updates, like video posts, events, and newsletters.

Global Expansion

 Localized versions for countries with language and cultural adaptations.

 Offices in major cities worldwide to support regional operations.

Microsoft Integration

 Integration with Microsoft products like Office 365 and Dynamics CRM enhances
LinkedIn’s utility for professionals and enterprises.



4. Technology and Innovation

Data-Driven Approach

 Leverages user data to recommend connections, jobs, and content.

 Uses AI to enhance profile matching, content personalization, and skill assessments.

AI and Machine Learning

 Algorithms to recommend relevant job opportunities, courses, and professional

 Tools like Resume Builder and Skill Assessments help users showcase their expertise.

Content Sharing and Engagement

 Newsfeed for professional updates, articles, and thought leadership.

 Features like LinkedIn Live and Newsletters foster deeper engagement.

5. Challenges
Data Privacy and Security

 Faces scrutiny over handling user data and maintaining privacy.

 Requires robust measures to prevent data breaches and misuse.

Market Competition

 Competes with platforms like Indeed, Glassdoor, and emerging niche networks.
 Must continuously innovate to maintain its leadership in professional networking.

Engagement Levels

 Balancing between being a job search platform and a professional content-sharing space.
 Challenges in retaining active user engagement outside job-seeking phases.

Regulatory Compliance

 Adapts to diverse global regulations on employment, advertising, and data protection.

6. Impact

Economic Impact

 Facilitates job creation and workforce development.

 Supports businesses by connecting them with talent and customers.

Professional Development

 Enables users to build their personal brand and showcase skills.

 LinkedIn Learning democratizes access to high-quality training resources.

Business Growth

 Provides a powerful platform for B2B marketing and lead generation.

 Drives recruitment efficiency for organizations of all sizes.

Global Connectivity

 Breaks geographical barriers, enabling professionals to connect and collaborate


7. Case Study: LinkedIn’s Role in Recruitment

Streamlining Hiring

 Advanced search filters and AI-powered recommendations reduce hiring timelines.

 Tools like "Easy Apply" and applicant tracking simplify the process for candidates and

Employer Branding

 Company pages allow businesses to showcase culture, values, and achievements.

 Sponsored content enhances visibility among potential hires.

Success Metrics

 Over 40 million users leverage LinkedIn weekly for job searches.

 A preferred platform for Fortune 500 companies to find talent.

8. LinkedIn in India

Localized Features

 Provides content and job recommendations in regional languages.

 Partnerships with educational institutions to promote LinkedIn Learning.

Job Market Influence

 Major platform for Indian professionals across IT, finance, and consulting sectors.
 Growing presence among small and medium enterprises (SMEs) for recruitment.

Challenges in India

 Competing with local job boards like

 Adapting to the needs of a highly price-sensitive market.

9. Conclusion
Key Takeaways

 LinkedIn’s success stems from its ability to evolve with professional needs and leverage
data effectively.
 Its diversification into learning and marketing solutions has solidified its position as more
than just a networking platform.

Future Outlook

 Continued growth in e-learning and enterprise solutions.

 Focus on integrating AI tools to enhance user experience.
 Expanding its presence in emerging markets with localized strategies.

Case Study: COVID-19 Pandemic

1. Introduction



COVID-19, caused by the novel coronavirus SARS-CoV-2, emerged as one of the most
significant global health crises of the 21st century. First identified in December 2019 in Wuhan,
China, the virus rapidly spread worldwide, resulting in widespread illness, economic disruption,
and unprecedented global responses.

 First Identified: December 2019, Wuhan, China.

 Declared a Pandemic: March 11, 2020, by the World Health Organization (WHO).
 Global Impact: Affected over 700 million people and caused millions of deaths globally.

COVID-19 underscored the importance of healthcare systems, global collaboration, and adaptive
responses to crises.

2. Epidemiology

 Primarily spreads through respiratory droplets and aerosols.

 Secondary transmission via surfaces and close contact.
 Highly contagious, with an R0 (basic reproduction number) ranging between 2-3 during
the initial phases.


 Common: Fever, cough, fatigue, loss of taste or smell.

 Severe: Difficulty breathing, organ failure, and death.
 Asymptomatic cases contributed significantly to its spread.


 Mutations led to the emergence of variants like Alpha, Delta, and Omicron, each with
varying transmissibility and severity.

3. Global Response
Containment Measures

1. Lockdowns and Social Distancing:

o Implemented worldwide to reduce transmission.
o Significant socio-economic consequences.
2. Testing and Contact Tracing:



o Widespread testing campaigns to identify and isolate cases.

o Mobile apps for contact tracing (e.g., Aarogya Setu in India).
3. Travel Restrictions:
o Bans on international and domestic travel to curb cross-border spread.

Healthcare Systems

 Rapid establishment of field hospitals and quarantine centers.

 Overburdened healthcare infrastructure in many regions.
 Shortages of medical supplies, ventilators, and personal protective equipment (PPE).

Vaccination Campaigns

 Vaccines like Pfizer-BioNTech, Moderna, AstraZeneca, and Sinovac developed at

unprecedented speed.
 Global vaccination drives prioritized vulnerable populations.
 Initiatives like COVAX aimed to ensure equitable vaccine distribution.

4. Economic Impact
Global Recession

 Significant contraction of global GDP in 2020.

 Sectors like tourism, hospitality, and aviation hit hardest.
 Millions of job losses and business closures.

Government Stimulus Packages

 Economic relief packages to support businesses and individuals.

 Examples: CARES Act in the US, Atmanirbhar Bharat in India.

Supply Chain Disruptions

 Lockdowns disrupted manufacturing and logistics.

 Shortages of essential goods, including medical supplies.

5. Social and Psychological Effects

Mental Health



 Increased anxiety, depression, and stress due to isolation, fear, and economic uncertainty.
 Rise in domestic violence and substance abuse.


 Shift to remote learning disrupted traditional education.

 Unequal access to technology exacerbated the digital divide.

Community Resilience

 Rise of community support initiatives like food distribution and mental health hotlines.
 Strengthened focus on public health awareness.

6. Technological and Scientific Advancements

Vaccine Development

 Rapid mRNA vaccine development set new benchmarks in biotechnology.

 Global collaboration in clinical trials and approvals.

Digital Transformation

 Remote work adoption accelerated globally.

 Increased reliance on e-commerce, telemedicine, and virtual communication tools.

Data-Driven Responses

 Use of AI and big data for predicting outbreaks and managing resources.
 Real-time dashboards for tracking cases (e.g., Johns Hopkins University COVID-19

7. Case Study: India’s Response to COVID-19

Initial Measures

 Nationwide lockdown imposed in March 2020.

 Massive testing and contact tracing efforts.




 Overcrowded healthcare facilities.

 Migrant labor crisis during lockdowns.
 Vaccine hesitancy and distribution inequities.


 Development of indigenous vaccines (Covaxin and Covishield).

 Record-breaking vaccination drives with over 2 billion doses administered by 2022.
 Use of digital platforms like CoWIN for vaccination registration and tracking.

8. Lessons Learned

 Importance of investing in healthcare infrastructure and pandemic preparedness.

 Need for robust global early warning systems.

Global Collaboration

 Successes of international efforts like vaccine development.

 Challenges of vaccine nationalism and supply inequities.

Resilience and Innovation

 Rapid adaptation of businesses and governments to new realities.

 Enhanced focus on sustainability and digital solutions.

9. Conclusion
Key Takeaways

 COVID-19 was a watershed moment for public health, global cooperation, and societal
 Highlighted vulnerabilities in systems while driving innovation and change.

Future Outlook

 Continued vigilance against future pandemics.

 Strengthening healthcare and socio-economic systems to build resilient communities.
 Leveraging technological advancements for equitable and efficient responses.

You might also like