Data Analytics Unit1-4
Data Analytics Unit1-4
Data Analytics Unit1-4
Introduction
The word analytics has come into the foreground in last decade or so. The increase
of the internet and information technology has made analytics very relevant in the
current age. Analytics is a field which combines data, information technology,
statistical analysis, quantitative methods and computer-based models into one.
This all are combined to provide decision makers all the possible scenarios to make
a well thought and researched decision. The computer-based model ensures that
decision makers are able to see performance of decision under various scenarios.
Meaning
Business analytics (BA) is a set of disciplines and technologies for solving business
problems using data analysis, statistical models and other quantitative methods. It
involves an iterative, methodical exploration of an organization's data, with an
emphasis on statistical analysis, to drive decision-making.
At its core, business analytics involves a combination of the following:
Definition
Business analytics (BA) refers to the skills, technologies, and practices for
continuous iterative exploration and investigation of past business performance to
gain insight and drive business planning. Business analytics focuses on developing
new insights and understanding of business performance based on data and statistical
methods.
Business Analytics is the process of transforming data into insights to improve
business decisions. Data management, data visualization, predictive modelling, data
mining, forecasting simulation, and optimization are some of the tools used to create
insights from data.
Analytics have been used in business since the management exercises were put
into place by Frederick Winslow Taylor in the late 19th century.
Henry Ford measured the time of each component in his newly established
assembly line. But analytics began to command more attention in the late 1960s
when computers were used in decision support systems.
Since then, analytics have changed and formed with the development of enterprise
resource planning (ERP) systems, data warehouses, and a large number of other
software tools and processes.
In later years the business analytics have exploded with the introduction of
computers. This change has brought analytics to a whole new level and has brought
about endless possibilities. As far as analytics has come in history, and what the
current field of analytics is today, many people would never think that analytics
started in the early 1900s with Mr. Ford himself.
As the economies started developing and companies became more and more
competitive, management science evolved into business intelligence, decision
support systems and into PC software.
• The techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human
consumption.
• Various approaches to data analytics include looking at what happened
(descriptive analytics), why something happened (diagnostic analytics), what
is going to happen (predictive analytics), or what should be done next
(prescriptive analytics).
• Data analytics relies on a variety of software tools including spreadsheets,
data visualization, reporting tools, data mining programs, and open-source
languages for the greatest data manipulation.
For example, manufacturing companies often record the runtime, downtime, and
work queue for various machines and then analyze the data to better plan workloads
so the machines operate closer to peak capacity.
Data analytics can do much more than point out bottlenecks in production. Gaming
companies use data analytics to set reward schedules for players that keep the
majority of players active in the game. Content companies use many of the same
data analytics to keep you clicking, watching, or re-organizing content to get
another view or another click.
4. The data is then cleaned up before analysis. It's scrubbed and checked to
ensure that there's no duplication or error and that it is not incomplete. This
step helps correct any errors before it goes on to a data analyst to be analyzed.
4. Prescriptive Analytics
1. Descriptive Analytics
It summarizes an organization’s existing data to understand what has
happened in the past or is happening currently. Descriptive Analytics is the
simplest form of analytics as it employs data aggregation and mining
techniques. It makes data more accessible to members of an organization such
as the investors, shareholders, marketing executives, and sales managers.
It can help identify strengths and weaknesses and provides an insight into
customer behavior too. This helps in forming strategies that can be developed
in the area of targeted marketing.
2. Diagnostic Analytics
This type of Analytics helps shift focus from past performance to the current
events and determine which factors are influencing trends. To uncover the
root cause of events, techniques such as data discovery, data mining and drill-
down are employed. Diagnostic analytics makes use of probabilities, and
likelihoods to understand why events may occur. Techniques such as
sensitivity analysis and training algorithms are employed for classification
and regression.
3. Predictive Analytics
This type of Analytics is used to forecast the possibility of a future event with
the help of statistical models and ML techniques. It builds on the result of
descriptive analytics to devise models to extrapolate the likelihood of items.
To run predictive analysis, Machine Learning experts are employed. They can
achieve a higher level of accuracy than by business intelligence alone.
4.Prescriptive Analytics
Going a step beyond predictive analytics, it provides recommendations for the
next best action to be taken. It suggests all favorable outcomes according to a
specific course of action and also recommends the specific actions needed to
deliver the most desired result. It mainly relies on two things, a strong
feedback system and a constant iterative analysis. It learns the relation
between actions and their outcomes. One common use of this type of analytics
is to create recommendation systems.
For starters, business analytics is the tool your company needs to make accurate
decisions. These decisions are likely to impact your entire organization as they help
you to improve profitability, increase market share, and provide a greater return to
potential shareholders.
While some companies are unsure what to do with large amounts of data, business
analytics works to combine this data with actionable insights to improve the
decisions you make as a company
5.Competitive Advantage
Businesses can gain a competitive edge using data analytics to make more informed,
data-driven decisions. Analysing data from various sources allows businesses to
understand market trends, consumer behaviour, and competitor activities.
Businesses can use this information to improve their strategies, spot new
opportunities, and set themselves apart from the competition. Data analytics can, for
instance, aid companies in identifying underserved market segments, anticipating
client needs, and enhancing product offerings. Simply put, businesses can increase
their market share, spur revenue growth, and fortify their brand by utilizing data
analytics to gain a competitive advantage.
BI APPLICATIONS:
BI tools are required in almost all industries and functions. The nature of the
information and the speed of action may be different across businesses, but every
manager today needs access to BI tools to have up-to-date metrics about business
performance. Businesses need to embed new insights into their operating processes
to ensure that their activities continue to evolve with more efficient practices. The
following are some areas of applications of BI and data mining.
These systems take away most of the guess work done by doctors in diagnosing
ailments. Treatment Effectiveness The prescription of medication and treatment is
also a difficult choice out of so many possibilities. For example, there are more than
100 medications for hypertension (high blood pressure) alone. There are also
interactions in terms of which drugs work well with others and which drugs do not.
Decision trees can help doctors learn about and prescribe more effective treatments.
Thus, the patients can recover their health faster with a lower risk of complications
and cost.
2.Wellness Management:
This includes keeping a track of patient's health records, analysing customer health
trends and proactively advising them to take any needed precautions.
Manage Fraud and Abuse: Some medical practitioners have unfortunately been
found to conduct unnecessary tests, and/or overbill the government and health
insurance companies. Exception reporting systems can identify such providers and
action can be taken against them.
4.Education
As higher education becomes more expensive and competitive, it becomes a great
user of data-based decision-making. There is a strong need for efficiency, increasing
revenue, and improving the quality of student experience at all levels of education.
5.Banking
Banks make loans and offer credit cards to millions of customers. They interested in
improving the quality of loans and reducing bad debts. They want to retain better
customers, and sell more services to them.
cap Automate the Loan Application Process Decision models can generate from past
data that predict the likelihood of a loan proving successful. These be inserted in
business processes to automate the financial loan approval process
Optimize Cash Reserves with Forecasting Banks have to maintain certain liquidity
to meet the needs of depositors who may like to withdraw money. Using past data
and trend analysis, banks can forecast how much to keep and invest the rest to earn
interest.
6.Financial Services
Stock brokerages are an intensive user of BI systems. Fortunes can be made or lost
based on access to accurate and timely information.
Predict Changes in Bond and Stock Prices Forecasting the price of stocks and bonds
is a favourite pastime of financial experts as well as lay people. Stock transaction
data from the past, along with other variables, can be used to predict future price
patterns. This can help traders develop long-term trading strategies.
Assess the Effect of Events on Market Movements Decision models using decision
trees can be created to assess the impact of events on changes in market volume and
prices. Monetary policy changes (such as Federal Reserve interest ate change) or
geopolitical changes (such as war in a part of the world) can stored into the predictive
model to help take action with greater confidence d less risk.
7.Retail
Retail organizations grow by meeting customer needs with quality products in a
convenient, timely, and cost-effective manner. Understanding emerging customer
shopping patterns can help retailers organize their products, inventory, store layout,
and web presence in order to delight their customers, which in turn would help
increase revenue and profits. Retailers generate a lot of transaction and logistics data
that can be used to diagnose and solve problems.
Improve Store Layout and Sales Promotions A market basket analysis can develop
predictive models of the products often sold together. This knowledge of affinities
between products can help retailers co-locate those products. Alternatively, those
affinity products could be located farther apart to make the customer walk the length
and breadth of the store, and thus be exposed to other products. Promotional
discounted product bundles can be created to push a non selling item along with a
set of products that sell well together.
Optimize Logistics for Seasonal Effects Seasonal products offer tremendously
profitable short-term sales opportunities, yet they also offer the risk of unsold
inventories at the end of the season. Understanding the products that are in season
in which market can help retailers dynamically manage prices to ensure their
inventory is sold during the season. If it is raining in a certain area, then the inventory
of umbrella and ponchos could be rapidly moved there from nongrainy areas to help
increase sales.
Minimize Losses due to Limited Shelf-Life Perishable goods offer challenges in
terms of disposing off the inventory in time. By tracking sales trends, the perishable
products at risk of not selling before the sell-by date, can be suitably discounted and
promoted.
8.Insurance
This industry is a prolific user of prediction models in pricing insurance proposals
and managing losses from claims against insured assets.
Forecast Claim Costs for Better Business Planning When natural disasters, such as
hurricanes and earthquakes strike, loss of life and property occurs. By using the best
available data to model the likelihood (or risk) of such events happening, the insurer
can plan for losses and manage resources and profits effectively.
Determine Optimal Rate Plans Pricing an insurance rate plan requires covering the
potential losses and making a profit. Insurers use actuary tables to project life spans
and disease tables to project mortality rates, and thus price themselves competitively
yet profitably.
9.Manufacturing
Manufacturing operations are complex systems with interrelated subsystems. From
machines working right, to workers having the right skills, to the right components
arriving with the right quality at the right time, to money to source the components,
many things have to go right. Toyota's famous lean manufacturing company works
on just-in-time inventory systems to optimize investments in inventory and to
improve flexibility in their product-mix.
Discover Novel Patterns to Improve Product Quality of a product can also be
tracked, and this data can be used to create a predictive model of Product quality
deteriorating. Many companies, such as automobile companies,
10.Telecom
BI in telecom can help the customer side as well as network side of the operations.
Key BI applications include churn management, marketing/customer profiling,
network failure, and fraud detection.
11.Public Sector
Government gathers a large amount of data by virtue of their regulatory function.
That data could be analysed for developing models of effective functioning. There
are innumerable applications that can benefit from mining that data. A couple of
sample applications are shown here.
Law Enforcement Social behaviour is a lot more patterned and predictable than one
would imagine. For example, Los Angeles Police Department (LAPD) mined the
data from its 13 million crime records over 80 years and developed models of what
kind of crime going to happen when and where. By increasing patrolling in those
particular areas, LAPD was able to reduce property crime by 27 percent. Internet
chatter can be analysed to learn about and prevent any evil designs.
Scientific Research Any large collection of research data is amenable to being mined
for patterns and insights. Protein folding (microbiology), nuclear reaction analysis
(sub-atomic physics), disease control (public health) are some examples where data
mining can yield powerful new insights.
12.Customer Relationship Management
A business exists to serve a customer. A happy customer becomes a repeat customer.
business should understand the needs and sentiments of the customer, sell more of
its offerings to the existing customers, and also expand the pool of customers it
serves. BI applications can impact many aspects of marketing.
Conclusion
Business Intelligence is a comprehensive set of IT tools to support decision making
with imaginative solutions for a variety of problems. BI can help improve the
performance in nearly all industries and applications.
Text Analytics
• Entity extraction, text categorization and text clustering
• Document summarisation
• Spatial-temporal analysis.
BI Skills
UNIT: 2
Sample space is the universal set that consists of all possible outcomes of
an experiment. Sample space is usually represented using the letter ‘S’
and individual outcomes are called the elementary events.
The sample space can be finite or infinite.
S = {(T , T , T) , (T , T , H) , (T , H , T) , (T , H , H ) , (H , T , T ) , (H , T , H) ,
(H , H, T) ,(H , H , H)}
Suppose, if we want to find only the outcomes which have at least two heads;
then the set of all such possibilities can be given as:
E = { (H , T , H) , (H , H ,T) , (H , H ,H) , (T , H , H)}
If A is students with more than 3.5 CGPA (cumulative grade point average) out of 4 and
B is students with a CGPA of more than 3.0, then P(A) < P(B)
4. The probability that either events A or B occur or both occur is given by
P (A U B) = P(A) + P(B)- P (A ∩ B )
5 .If A and B are mutually exclusive events, so that P (A ∩ B ) = 0, then
P (A U B) = P(A) + P(B)
6. If A1 , A2 , …, An are n events that form a partition of sample space S,
then their probabilities must add up to 1:
Joint Probability :
Let A and B be two events in a sample space. Then the joint probability of the two events,
written as P(A ∩ B), is given by
13 42
P( Divorced ∩ Default )= -------- = 0.013 P( Single ∩ Default )= -------- = 0.042
1000 1000
50 300
P( Divorced )= ----------- = 0.05 P( Single )= ----------- = 0.3
1000 1000
1. Let there be a bag containing 5 white and 4 red balls .Two balls are
drawn from the bag one after the other without replacement. Consider
the following events.
A= Drawing a white ball in the first draw
B= Drawing a red ball in the Second draw.
Sol: P(B/A)= Probability of drawing a red ball in second draw given
that a white ball has already been drawn in the first draw.
P(B/A)= Probability of drawing a red ball from a bag containing 4
white and 4 red balls.
P(B/A)= 4/8 =1/2
For this Random Experiment P(A/B) is not meaningful because A
cannot occur after the occurrence of event B.
2. A Die is thrown twice and the sum of the numbers appearing is observed
to be 6. what is the conditional probability that the number 4 has appeared
at least once?
B= Number 4 has appears at least once
A=The Sum of the numbers appearing is 6, Required probability P(B/A)
Sol: A=((1,5),(2,4),(3,3),(4,2),(5,1)) P(A ∩ B)= 2 P(A)=5
Required probability = P(B/A)
= P(A ∩ B)/P(A) = 2/5
A= sum of the numbers appearing on two dice is 6
=(1,5),(5,1),(2,4),(4,2),(3,3) B= number 4 has appeared at least once
P(A)=5 =(1,4),(4,1),(2,4),(4,2),(3,4),(4,3),(4,4),(4,5),(5,4)
,(4,6),(6,4)
A∩B=(2,4),(4,2)
P(A∩B)=2
Question 3:
Ten numbered cards are there from 1 to 15, and two cards a
chosen at random such that the sum of the numbers on both the
cards is even. Find the probability that the chosen cards are
odd-numbered.
Let, A ≡ event of selecting two odd-numbered cards
B ≡ event of selecting cards whose sum is even.
Sol: Then,
P(B) = number of ways of choosing two numbers whose sum is even
= 8C 2 + 7C 2 .
P(A ∩ B) = number of ways of choosing odd-numbered cards such that
their sum is even.
= 8 C 2.
Now, P(A|B) = P(A ∩ B)/P(B)
= 8C2 / (8C2 + 7C2) = 4/7.
Bayes’ theorem is one of the most important concepts in analytics
since several problems are solved using Bayesian statistics. Consider
two events A and B. We can write the following two conditional
probabilities:
Random variable HH HT TH TT
X 2 1 1 0
Random variables can be classified as discrete or continuous depending on the values that
the random variable can take.
Discrete Random Variables :
A Random variables which takes finite or at most countable ( may be finite or infinite)
number of values is known as discrete random variable. Or Discrete Random Variable
takes a countable number of possible outcomes.
Ex: i) Marks obtained by a student in a test
ii) Number of Defective nuts in a lot
iii) The number of cars that pass through a given intersection in an
hour.
iii) Number of errors on a page of a book
iv) Number of accidents taking place on busy road.
Thus, X = {1, 2, 3, 4, 5, 6}
Another popular example of a discrete random variable is the number of heads when
tossing of two coins. In this case, the random variable X can take only one of the three
choices i.e., 0, 1, and 2.
Continuous Random variable :
A random variable which takes all the possible values in an interval is called
Continuous variable.
Examples i) Waiting time for a bus
P(X)=P(x=0)+p(x=1)+p(x=2)
= 1/4+1/2+1/4
=1
Cumulative distribution function, P(xi ), is the probability that the random
variable X takes values less than or equal xi . That is, P(xi ) = P(X ≤ xi ).
From the above problem
P(X < 2), probability that the number of heads are less than are equal
to two.
F(2) = P(x=0)+P(x=1)
= 1/4 +1/2
= 0.75
Example 2:
The Cumulative Distribution Function (CDF) is another important concept in
probability theory and statistics, especially when dealing with random variables, whether
discrete or continuous. The CDF provides the probability that a random variable X takes
on a value less than or equal to a specific point x.
The cumulative distribution function is denoted by F(x) and its formula is given by:
F(x)=P(X≤x)
Probability Mass Function and Cumulative Distribution Function of a
Continuous Random Variable :
where
What is the probability for the student to fail the test (i.e., to have less
than 6 correct answers)?
Answer:
Binomial Mean and Variance:
Mean= np
Variance=np(1-p)
Binomial Mean E(X) = 10 * 0.25 = 2.5.
Variance V (X) = 10 * (0.25) * (1 − 0.25) = 1.875.
Poisson Distribution
Poisson Distribution is a Probability distribution that is used to show how many times
an event occurs over a specific period.
It is the discrete probability distribution of the number of events occurring in a given
time period, given the average number of times the event occurs over that time
period. It is the distribution related to probabilities of events that are extremely rare
but have a large number of independent opportunities for occurrence.
Poisson Distribution Definition
Poisson distribution is used to model the number of events that occur in a fixed
interval of time or space, given the average rate of occurrence, assuming that the
events happen independently and at a constant rate
Poisson distribution formula
Mean and Variance of Poisson distribution:
The Poisson distribution has only one parameter, called λ.
Suppose 400 pages of the book are randomly selected. What are the
probabilities for having no typos and for having five or fewer typos?
Sol:
NORMAL DISTRIBUTION (GAUSSIAN DISTRIBUTION) :
The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference for many
probability problems.
Let X be a continuous random variable, then it is said to follow normal
distribution if it is given by
Thus, any normal random variable X can be expressed using the standard
normal random variable Z.
Solved Examples
1. Calculate the probability of normal distribution with the population mean
2, standard deviation 3 or random variable 5.
Solution:
x=5
Mean = μ = 2
Standard Deviation = σ = 3
We will solve the questions with the help of the above normal
probability distribution formula:
SAMPLING
Definition: A portion of the population which is examined with a
view to determining the population characteristics is called a
sample.
In other words, sample is a subset of population. Size of the sample
is denoted by n. The process of selection of a sample is called
Sampling.
There are different methods of sampling
Probability Sampling Methods
Non-Probability Sampling Methods
Probability Sampling Methods :
a) Random Sampling (Probability Sampling): It is the process of drawing a sample from a
population in such a way that each member of the population has an equal chance of being included in
the sample.
Example: A hand of cards from a well shuffled pack of cards is a random sample.
Note: If N is the size of the population and n is the size of the sample, then The no. of samples with
replacement = Nn
The no. of samples without replacement = 𝑁Cn
b) Stratified Sampling : In this , the population is first divided into several smaller groups called strata
according to some relevant characteristics .
From each strata samples are selected at random, all the samples are combined together to form the
stratified sampling.
c) Cluster Sampling :
In cluster sampling, the population is divided into mutually exclusive clusters.
For example, assume that a researcher is interested in analyzing life of smart phone batteries from a
specific manufacturer. The manufacturer may have different models (each model in this case will be a
cluster).
d) Systematic Sampling (Quasi Random Sampling): In this method , all the units of the population
are arranged in some order . If the population size is N, and the sample size is n, then we first define
sample interval denoted by = N/n
Non Probability Sampling Methods:
Sample units are selected based on convenience and/or on voluntary basis.
Ex: Assume that a data scientist is interested in studying attrition and factors
influencing attrition. For this study, he/she may collect data from his friends and
colleagues which may not be true representation of the population. Such
sampling procedures come under the category of non-probability sampling.
Convenience Sampling :
Convenience sampling is a non-probability sampling technique in which the sample
units are not selected according to a probability distribution. For example, a
researcher may collect data from his school or the work place and from his/her
friends since the cost of data collection in such cases is minimal. Convenience
sampling is not recommended since it is likely to result in bias estimates.
Voluntary Sampling : Under voluntary sampling the data is collected from people
who volunteer for such data collection. For example, customer feedbacks in many
contexts fall under this sampling procedure. There could be bias in case of voluntary
sampling. Many organizations such as Amazon, Trip Advisor provide customer
feedback. Many times the feedback is provided by customers who had bad
experience with product/ service; many customers who were happy with
product/service may not give feedback.
Purposive (Judgment ) Sampling : In this method, the members constituting the
sample are chosen not according to some definite scientific procedure , but
according to convenience and personal choice of the individual who selects the
sample . It is the choice of the individual items of a sample entirely depends on the
individual judgment of the investigator.
Sequential Sampling: It consists of a sequence of sample drawn one after another
from the population. Depending on the results of previous samples if the result of
the first sample is not acceptable then second sample is drawn and the process
continues to take proper decision . But if the first sample is acceptable ,then no
new sample is drawn .
Classification of Samples:
Large Samples : If the size of the sample n ≥ 30 , then it is said to
be large sample.
Small Samples : If the size of the sample n < 30 ,then it is said to
be small sample or exact sample.
Parameters and Statistics:
Parameter is a statistical measure based on all the units of a
population.
Statistic is a statistical measure based on only the units selected in a
sample.
Note: In this unit, Parameter refers to the population and Statistic
refers to sample.
SAMPLING DISTRIBUTION
Sampling distribution refers to the probability distribution of a
statistic such as sample mean and sample standard deviation
computed from several random samples of same size.
Understanding the sampling distribution is important for
hypothesis testing. Test statistic in hypothesis testing is derived
based on the knowledge of sampling distribution.
In this example, the population is the weight of six pumpkins (in
pounds) displayed in a carnival "guess the weight" game booth.You
are asked to guess the average weight of the six pumpkins by taking
a random sample without replacement from the population.
Since we know the weights from the population, we can find the population
mean.
To demonstrate the sampling distribution, let’s start with obtaining all of the
possible samples of size n=2 from the populations, sampling without
replacement. The table below shows all the possible samples, the weights for the
chosen pumpkins, the sample mean and the probability of obtaining each sample.
The mean of the sample means is :
=9.5(1/15)+11.5(1/15)+12(2/15)+12.5(1/15)+13(1/15)+13.5(1
/15)+14(1/15)+14.5(2/15)+15.5(1/15)+16(1/15)+16.5(1/15)+1
7(1/15)+18(1/15)
= 14
Now, let's do the same thing as above but with sample size n=5
Central Limit Theorem: If ̅ be the mean of a random sample of size n
drawn from population having mean 𝜇 and standard deviation 𝜎 , then
the sampling distribution of the sample mean ̅ is approximately a normal
distribution with mean 𝜇 and SD = S.E of ̅ = 𝜎 /√n provided the
sample size n is large.
Estimate : An estimate is a statement made to find an unknown population
parameter.
Estimator : The procedure or rule to determine an unknown population
parameter is called estimator.
Example: Sample proportion is an estimate of population proportion , because
with the help of sample proportion value we can estimate the population
proportion value.
Types of Estimation:
Point Estimation: If the estimate of the population parameter is given by a
single value , then the estimate is called a point estimation of the parameter.
Interval Estimation: If the estimate of the population parameter is given by
two different values where the parameter is excepted to lie, then the estimate is
called an interval estimation of the parameter.
INTRODUCTION TO HYPOTHESIS TESTING:
Hypothesis is a claim or belief, hypothesis testing is a statistical process of
either rejecting or retaining a claim or belief or association related to a
business context, product, service, processes, etc.
Hypothesis testing consists of two complementary statements called null
hypothesis and alternative hypothesis, and only one of them is true.
Null hypothesis is the claim that is assumed to be true initially. That is at the
beginning we assume that the null hypothesis is true and try to retain it
unless there is strong evidence against null hypothesis.
Alternative hypothesis, usually denoted as HA (or H1 ), is the complement
of null hypothesis. Alternative hypothesis is what the researcher believes to
be true and would like to reject the null hypothesis.
Hypothesis testing is an integral part of many predictive analytics
techniques such as multiple linear regression and logistic regression.
In business, many claims are made by organizations. Few examples of such
claims are listed below:
1. Children who drink the health drink Complan (a health drink owned by
the company Heinz in India) are likely to grow taller.
2. If you drink Horlicks, you can grow taller, stronger, and sharper (3 in 1).
3. Using fair and lovely (fair and handsome) cream can make one fair and
lovely (fair and handsome).
4. Wearing perfume (such as Axe) will help to attract opposite gender
(known as Axe effect).
5. Women use camera phone more than men (Freier, 2016).
There are many such claims and beliefs; many business rules and strategies
are generated based on these hypotheses. The question is how can we check
whether these are actually true. Hypothesis testing is used for checking the
validity of the claim using evidence found in a sample data.
Take the decision to reject or retain the null hypothesis based on the p-value
and significance value α. The null hypothesis is rejected when p-value is less
than α and the null hypothesis is retained when p-value is greater than or equal
to α.
Calculate the p-value (probability value), which is the conditional probability
of observing the test statistic value when the null hypothesis is true. In simple
terms, p-value is the evidence in support of the null hypothesis.
Decide the criteria for rejection and retention of null hypothesis. This is called
significance value traditionally denoted by symbol α . The value of α will
depend on the context and usually 0.1, 0.05, and 0.01 are used.
if the calculated statistic value is less than the critical value (p-value will be less
than α-value) then we reject the null hypothesis, whereas, if the statistic value
is greater than the critical value(p-value will be greater than then we retain
the null hypothesis.
TYPE I ERROR, TYPE II ERROR
In hypothesis test we end up with the following two decisions:
1. Reject null hypothesis.
2. Fail to reject (or retain) null hypothesis.
Type I Error: Conditional probability of rejecting a null hypothesis
when it is true is called Type I Error or False Positive (falsely believing
that the claim made in alternative hypothesis is true).
A type I error (false-positive) occurs if an investigator rejects a null
hypothesis that is actually true in the population false in the population.
The significance value α is the value of Type I error.
Type I Error = α = P(Rejecting null hypothesis | H0 is true)
Probability value (p-value) is the evidence for the null hypothesis
whereas significance value α is the error based on repetitive sampling.
Type II Error: Conditional probability of failing to reject a null
hypothesis (or retaining a null hypothesis) when the alternative hypothesis
is true is called Type II Error or False Negative (falsely believing that there
is no relationship).
A type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population.
Usually Type II error is denoted by the symbol ß.
Type II Error = ß = P(Retain null hypothesis | H0 is false)
The value (1 − ß ) is known as the power of hypothesis test.
Power of the test = 1 − ß = 1 − P(Retain null hypothesis | H0 is false)
Alternatively the power of test = 1 − ß = P(Reject null hypothesis|H0 is
false.
False-positive and false-negative results can also occur because of bias.
T-test :
The t-test is used when the population follows a normal distribution and the population standard
deviation s is unknown and is estimated from the sample. t-test is a robust test for violation of
normality of the data as long as the data is close to symmetry and there are no outliers.
Let S be the standard deviation estimated from the sample of size n. Then the statistic
will follow a t-distribution with (n − 1) degrees of freedom if the sample is drawn from a
population that follows a normal distribution. Here 1 degree of freedom is lost since the standard
deviation is estimated from the sample. Thus, we use the t-statistic (hence the test is called t-test) to
test the hypothesis when the population standard deviation is unknown. t-statistic =
The t-test is a statistical test procedure that tests whether there is a
significant difference between the means of two groups.
EX: The two groups could be, for example, patients who received drug
A once and drug B once, and you want to know if there is a difference in
blood pressure between these two groups.
Types of t-test :
There are three different types of t-tests.
One-sample t-test
We use the one-sample t-test when we want to compare the mean of a sample with a known
reference mean.
Example : A manufacturer of chocolate bars claims that its chocolate bars weigh 50 grams on
average. To verify this, a sample of 30 bars is taken and weighed. The mean value of this sample is
48 grams.
Independent-sample t-test
We use the t-test for independent samples when we want to compare the means of two
independent groups or samples. We want to know if there is a significant difference between these
means.
Example : We would like to compare the effectiveness of two painkillers, drug A and drug B.
Paired-sample t-test
The t-test for dependent samples is used to compare the means of two dependent groups.
Example : We want to know how effective a diet is. To do this, we weigh 30 people before the diet
and exactly the same people after the diet.
Chi-Square Goodness of Fit Tests
Goodness of fit tests are hypothesis tests that are used for comparing the
observed distribution of data with expected distribution of the data to
decide whether there is any statistically significant difference between the
observed distribution and a theoretical distribution based on comparison
of observed frequencies in the data and the expected frequencies if the data
follows a specified theoretical distribution.
The null and alternative hypotheses in chi-square goodness of fit tests are
H0 : There is no statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
distribution.
HA: There is a statistically significant difference between the observed
frequencies and the expected frequencies from a hypothesized
distribution.
Let Z be a standard normal distribution with 1 degree.
If we have k random variables, namely, X1 , X2 , …, Xk , then a chi-
square distribution with k-degrees of freedom is given by
where Oij is the observed frequency in category (i, j) and Eij is the expected
frequency in the category (i, j). Thus, chi-square test is always a right-tailed
test.
INTRODUCTION TO ANALYSIS OF VARIANCE (ANOVA)
The objective of ANOVA is to check simultaneously whether population
mean from more than two populations are different.
ANOVA stands for Analysis of Variance. It is a statistical method used to
analyze the differences between the means of two or more groups or
treatments.
It is often used to determine whether there are any statistically significant
differences between the means of different groups.
ANOVA is used to compare treatments, analyze factors impact on a
variable, or compare means across multiple groups.
Types of ANOVA include one-way (for comparing means of groups) and
two-way (for examining effects of two independent variables on a
dependent variable).
One-way analysis of variance (ANOVA) : It is a statistical method
for testing for differences in the means of three or more groups.
In statistics, ANOVA also uses a Null hypothesis and an Alternate
hypothesis.
The Null hypothesis in ANOVA is valid when all the sample means are
equal, or they don’t have any significant difference.
On the other hand, the alternate hypothesis is valid when at least one of
the sample means is different from the rest of the sample means. In
mathematical form, they can be represented as:
where μi is the mean of the i-th level of the factor.
Ex for One –way ANOVA:
Suppose you are studying the effectiveness of three different drugs (Drug
A, Drug B, and Drug C) in reducing blood pressure.You randomly assign
90 patients to one of the three drug groups and measure their blood
pressure after one month of treatment. The blood pressure measurements
(in mmHg) for each patient are observed and prepared as a dataset.
In this dataset, each drug group represents a separate treatment or
condition, and the blood pressure measurements for each patient in that
group are recorded.
To analyze this dataset using ANOVA, you would compare the means of
the blood pressure measurements among the three drug groups to
determine if there is a statistically significant difference.
Two-Way ANOVA : Two way ANOVA technique are used
when the data are classified based on the two factors.
Ex: the agricultural output may be classified on the basis of different
varieties of Seeds and also on the basis of different varieties of
fertilizers are used.
A statistical test is used to determine the effect of two nominal
predictor variables on a Continuous outcome variable.
Two way ANOVA test analyzes the effect of the independent variables
on the expected outcome along with their relationship to the
outcome itself.
Ex for TWO –way ANOVA
Two-way (or two factor) analysis of variance tests whether there is a
difference between more than two independent samples split between
two variables or factors.
A factor is, for example, the gender of a person with the characteristics
male and female, the form of therapy used for a disease with therapy A,
B and C or the field of study with, for example, medicine, business
administration, psychology and math.
In addition to gender, the highest level of education also has an influence
on salary.
besides therapy, gender also has an influence on blood pressure.
In addition to the field of study, the university attended also has an
influence on the duration of studies.
Now in all three cases you would not have one factor, but two factors
each. And since you now have two factors, you use the two-way
analysis of variance.
Formulas of ANOVA:
Sum of Squares of Total Variation (SST):
Dr. Rashmi M
Department of Computer Science,
GFGC T. Dasarahall.
DATA ANALYTICS UNIT 4 2024-2025
Netflix is a global leader in streaming entertainment, providing on-demand video content that
includes movies, TV shows, documentaries, and original programming. It has become a
transformative force in the entertainment industry, altering how audiences consume media.
Founded: 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California.
Headquarters: Los Gatos, California.
Key Milestones:
o 1998: Launched as a DVD rental service via mail.
o 2007: Transitioned to online streaming.
o 2013: Entered content production with Netflix Originals, starting with House of
Cards.
Netflix has grown from a DVD rental service to a global entertainment powerhouse, available in
over 190 countries with millions of subscribers.
Netflix began as a subscription-based DVD rental service, allowing users to select DVDs
online and receive them via mail.
Competitive Edge:
o No late fees.
o Flat monthly subscription rate.
o Large inventory of movies.
Transition to Streaming
2007: Netflix launched its streaming service, enabling users to watch content instantly
over the internet.
This move capitalized on advancements in broadband internet and changing consumer
preferences.
Original Content
2013: Netflix debuted its first original series, House of Cards, which marked its entry as a
content producer.
Since then, Netflix has heavily invested in original programming to differentiate itself
from competitors.
Current Model
3. Growth Strategy
Content Investment
Global Expansion
Netflix is available in over 190 countries, tailoring its content to local tastes.
Key initiatives:
o Subtitling and dubbing content in multiple languages.
o Producing localized content such as Sacred Games (India) and Dark (Germany).
Data-Driven Decision-Making
Partnerships
Cloud Infrastructure
Streaming Optimization
Mobile Strategy
5. Challenges
Intense Competition
Competes with platforms like Amazon Prime Video, Disney+, Hulu, HBO Max, and
regional players.
Rivals offer competitive pricing and exclusive content.
Rising Costs
Subscriber Saturation
Password Sharing
BU/GIMS/BCA-V SEM Sumathi G K
3
DATA ANALYTICS UNIT 4 2024-2025
Regulatory Challenges
Local regulations on content and censorship can pose barriers in certain markets.
6. Impact
Cultural Influence
Industry Disruption
Global Content
Economic Contribution
Netflix’s Indian Originals include Sacred Games, Delhi Crime, and Lust Stories.
Focuses on stories resonating with Indian audiences.
Pricing Strategy
Challenges in India
Competing with well-established local platforms like Disney+ Hotstar, Zee5, and MX
Player.
Navigating strict regulations and censorship policies.
Success Metrics
8. Conclusion
Key Takeaways
Future Outlook
Overview
Amazon, founded by Jeff Bezos in 1994, began as an online bookstore and has since evolved
into one of the world’s largest multinational technology companies. It operates in e-commerce,
cloud computing, digital streaming, and artificial intelligence.
Amazon’s mission is "to be Earth's most customer-centric company," focusing on innovation and
operational excellence.
2. Business Model
Core Business Segments
1. E-Commerce:
o Operates a global online marketplace offering millions of products.
o Revenue streams include product sales, third-party seller fees, and advertising.
2. Amazon Web Services (AWS):
o Provides scalable cloud computing services.
o Core offerings: storage, computing power, AI tools, and machine learning.
3. Subscription Services:
o Amazon Prime: Offers free shipping, streaming services, and exclusive deals.
o Other subscriptions: Kindle Unlimited, Audible, and Amazon Music.
4. Hardware and Devices:
o Develops and sells devices like Kindle, Echo, Fire tablets, and Ring cameras.
5. Logistics and Delivery:
o Owns extensive warehousing and delivery networks, including Amazon Air and
Prime delivery services.
Revenue Streams
3. Growth Strategy
Customer-Centric Approach
Global Expansion
Technological Innovation
Diversification
Acquired companies like Whole Foods, Zappos, and MGM to diversify offerings.
Partnerships with logistics and delivery companies to enhance last-mile delivery.
Personalized Recommendations:
o Uses algorithms to suggest products and content based on user behavior.
AI-driven voice assistant Alexa powers smart home devices.
BU/GIMS/BCA-V SEM Sumathi G K
7
DATA ANALYTICS UNIT 4 2024-2025
Cloud Computing
Digital Transformation
Integrates technology into every aspect of its business, from e-commerce platforms to
fulfillment centers.
5. Challenges
Regulatory Scrutiny
Workplace Practices
Intense Competition
Competes with Walmart, Alibaba, Microsoft Azure, Google Cloud, Netflix, and others.
Need to maintain leadership across diverse industries.
Regulatory and cultural challenges in certain markets like China and India.
6. Impact
Economic Contribution
BU/GIMS/BCA-V SEM Sumathi G K
8
DATA ANALYTICS UNIT 4 2024-2025
Consumer Behavior
Revolutionized online shopping with fast delivery, ease of access, and product diversity.
Encouraged the growth of the subscription economy through Amazon Prime.
Industry Disruption
Sustainability Efforts
Launched services like Prime Video Mobile Edition for smartphone users.
Developed partnerships with Indian sellers and brands.
Challenges in India
Intense competition from Flipkart, Reliance JioMart, and local e-commerce platforms.
Regulatory hurdles regarding data localization and FDI norms.
Success Metrics
8. Conclusion
Key Takeaways
Future Outlook
Twitter is a microblogging and social networking platform that allows users to post and interact
through short messages known as "tweets." Since its launch, Twitter has become a significant
tool for communication, marketing, activism, and real-time information sharing.
Founded: March 21, 2006, by Jack Dorsey, Biz Stone, Evan Williams, and Noah Glass.
Headquarters: San Francisco, California.
Key Milestones:
o 2006: Initial launch as a microblogging platform.
o 2013: Listed as a public company on the New York Stock Exchange (NYSE).
o 2022: Acquired by Elon Musk, leading to significant operational changes.
Twitter’s mission is to "serve the public conversation," facilitating open and real-time exchange
of information globally.
2. Business Model
Core Revenue Streams
1. Advertising:
o Promoted tweets, trends, and accounts.
o Accounts for the majority of Twitter’s revenue.
2. Subscription Services:
o Twitter Blue: Offers features like verified badges, longer tweets, and edit options.
3. Data Licensing:
o Monetizes its vast database by selling access to public data (APIs) for research
and analysis.
User Base
3. Growth Strategy
Platform Features
Real-Time Engagement:
o Key differentiator: Real-time sharing of news, events, and public discourse.
o Popular for live events, breaking news, and trending topics.
New Features:
o Spaces: Live audio chat rooms.
o Threads: Organized long-form content.
o Communities: Groups with shared interests.
Global Expansion
Partnerships
Acquisitions
Acquired startups like Periscope (live streaming) and Revue (newsletter services) to
expand its feature set.
Uses machine learning to curate personalized timelines based on user preferences and
activity.
Trending topics are tailored by location and interests.
API Ecosystem
Provides APIs for developers and researchers to build tools, analyze data, and track
trends.
Content Moderation
Infrastructure
5. Challenges
Content Moderation
Competition
BU/GIMS/BCA-V SEM Sumathi G K
12
DATA ANALYTICS UNIT 4 2024-2025
Competes with platforms like Facebook, Instagram, TikTok, and emerging decentralized
networks.
Profitability
User Retention
Faces challenges in retaining active users, especially with competition offering more
engaging content formats.
Regulatory Issues
6. Impact
Global Communication
Facilitates real-time communication during major events like natural disasters, protests,
and elections.
A platform for political discourse, activism, and citizen journalism.
Cultural Influence
Popularized hashtags, which have become a tool for movements like #MeToo and
#BlackLivesMatter.
Redefined how news breaks, with many organizations relying on Twitter for updates.
Economic Contributions
Societal Challenges
Used extensively to mobilize protests and share information during the Arab Spring.
Enabled activists to organize and communicate despite government censorship.
#BlackLivesMatter
Became a central platform for raising awareness of police brutality and racial injustice.
Amplified grassroots campaigns and public discussions globally.
Election Campaigns
8. Twitter in India
Localized Features
Political Engagement
9. Conclusion
Key Takeaways
Twitter’s strength lies in real-time communication and its role as a global public square.
Despite challenges, it has had a profound impact on communication, activism, and media.
BU/GIMS/BCA-V SEM Sumathi G K
14
DATA ANALYTICS UNIT 4 2024-2025
Future Outlook
Uber is a global leader in ride-hailing, food delivery, and freight services, revolutionizing how
people and goods move. By leveraging technology, Uber connects drivers, riders, and businesses
seamlessly, creating a disruptive force in traditional transportation industries.
Founded: 2009 by Garrett Camp and Travis Kalanick in San Francisco, California.
Headquarters: San Francisco, California.
Key Milestones:
o 2010: Official launch in San Francisco.
o 2014: Expanded into international markets.
o 2020: Acquired Postmates to enhance its food delivery services.
2. Business Model
Core Services
1. Ride-Hailing:
o On-demand rides through mobile apps.
o Options include UberX, Uber Pool, Uber Comfort, and Uber Black.
2. Uber Eats:
o Food delivery service connecting restaurants, couriers, and customers.
3. Uber Freight:
o Matches trucking companies with shippers, optimizing logistics.
4. Other Ventures:
o Micro-mobility options like e-scooters and bikes.
o Partnerships in autonomous vehicle research.
Revenue Streams
Platform Ecosystem
3. Growth Strategy
Global Expansion
Entered markets across six continents by adapting to local regulations and consumer
preferences.
Focused on partnerships and acquisitions to accelerate growth.
Technology Innovation
Diversification
Dynamic Pricing
Surge pricing adjusts ride costs based on demand and supply, maximizing driver
availability.
Autonomous Vehicles
Data Analytics
Uses data to predict demand patterns, optimize routes, and enhance user experiences.
5. Challenges
Regulatory Hurdles
Faced bans and restrictions in markets like London, Germany, and India.
Classified drivers as independent contractors, leading to legal disputes over labor rights.
Workforce Issues
Criticized for treatment of drivers, including low pay and lack of benefits.
Unionization efforts and strikes in various countries.
Competition
Competes with local ride-hailing apps (e.g., Ola, Grab) and global players like Lyft.
Profitability
Struggled to achieve consistent profitability due to high operating costs and subsidies.
6. Impact
Economic Contribution
Social Challenges
Adapted to Indian market needs with options like Uber Auto (rickshaws) and cash
payments.
Launched regional language support in the app.
Partnerships
Challenges in India
Success Metrics
8. Conclusion
Key Takeaways
Uber’s success is built on leveraging technology and a flexible business model to disrupt
traditional industries.
Its ability to adapt and innovate has enabled global expansion despite challenges.
Future Outlook
LinkedIn is the world’s largest professional networking platform, enabling individuals and
businesses to connect, share, and grow their professional networks. It has become a vital tool for
career development, recruitment, and professional content sharing.
Founded: December 2002 by Reid Hoffman, Allen Blue, Konstantin Guericke, Eric Ly,
and Jean-Luc Vaillant.
Launched: May 5, 2003.
Headquarters: Sunnyvale, California.
Ownership: Acquired by Microsoft in 2016 for $26.2 billion.
LinkedIn’s mission is "to connect the world’s professionals to make them more productive and
successful."
2. Business Model
Core Offerings
1. Networking:
o Allows professionals to build connections, share updates, and collaborate.
BU/GIMS/BCA-V SEM Sumathi G K
19
DATA ANALYTICS UNIT 4 2024-2025
Revenue Streams
3. Growth Strategy
User Growth
Grew from 4,500 members at launch to over 950 million users globally as of 2024.
Strong presence in developed and emerging markets.
Product Diversification
Expanded services from networking to include e-learning, job boards, and marketing
solutions.
Continuous feature updates, like video posts, events, and newsletters.
Global Expansion
Microsoft Integration
Integration with Microsoft products like Office 365 and Dynamics CRM enhances
LinkedIn’s utility for professionals and enterprises.
5. Challenges
Data Privacy and Security
Market Competition
Competes with platforms like Indeed, Glassdoor, and emerging niche networks.
Must continuously innovate to maintain its leadership in professional networking.
Engagement Levels
Balancing between being a job search platform and a professional content-sharing space.
Challenges in retaining active user engagement outside job-seeking phases.
Regulatory Compliance
6. Impact
BU/GIMS/BCA-V SEM Sumathi G K
21
DATA ANALYTICS UNIT 4 2024-2025
Economic Impact
Professional Development
Business Growth
Global Connectivity
Employer Branding
Success Metrics
8. LinkedIn in India
BU/GIMS/BCA-V SEM Sumathi G K
22
DATA ANALYTICS UNIT 4 2024-2025
Localized Features
Major platform for Indian professionals across IT, finance, and consulting sectors.
Growing presence among small and medium enterprises (SMEs) for recruitment.
Challenges in India
9. Conclusion
Key Takeaways
LinkedIn’s success stems from its ability to evolve with professional needs and leverage
data effectively.
Its diversification into learning and marketing solutions has solidified its position as more
than just a networking platform.
Future Outlook
COVID-19, caused by the novel coronavirus SARS-CoV-2, emerged as one of the most
significant global health crises of the 21st century. First identified in December 2019 in Wuhan,
China, the virus rapidly spread worldwide, resulting in widespread illness, economic disruption,
and unprecedented global responses.
COVID-19 underscored the importance of healthcare systems, global collaboration, and adaptive
responses to crises.
2. Epidemiology
Transmission
Symptoms
Variants
Mutations led to the emergence of variants like Alpha, Delta, and Omicron, each with
varying transmissibility and severity.
3. Global Response
Containment Measures
Healthcare Systems
Vaccination Campaigns
4. Economic Impact
Global Recession
Increased anxiety, depression, and stress due to isolation, fear, and economic uncertainty.
Rise in domestic violence and substance abuse.
Education
Community Resilience
Rise of community support initiatives like food distribution and mental health hotlines.
Strengthened focus on public health awareness.
Digital Transformation
Data-Driven Responses
Use of AI and big data for predicting outbreaks and managing resources.
Real-time dashboards for tracking cases (e.g., Johns Hopkins University COVID-19
tracker).
Challenges
Successes
8. Lessons Learned
Preparedness
Global Collaboration
9. Conclusion
Key Takeaways
COVID-19 was a watershed moment for public health, global cooperation, and societal
resilience.
Highlighted vulnerabilities in systems while driving innovation and change.
Future Outlook