0% found this document useful (0 votes)
47 views19 pages

Session 1 - 2

1. The document discusses two approaches to data analysis - statistical/data models and data mining/algorithmic models. It notes their differences in emphasis on statistical inference versus predictive accuracy. 2. It also covers topics like multivariate analysis, types of data measurement scales, hypothesis testing techniques like z-test, t-test, and their appropriate uses. Examples include testing processing times of passport applications and comparing mutual fund returns through direct purchase versus brokers. 3. Key differences between statistical and data mining models, challenges in choosing appropriate modeling approaches, and importance of data type and measurement scale in determining the right multivariate technique are summarized.

Uploaded by

Abhinav Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views19 pages

Session 1 - 2

1. The document discusses two approaches to data analysis - statistical/data models and data mining/algorithmic models. It notes their differences in emphasis on statistical inference versus predictive accuracy. 2. It also covers topics like multivariate analysis, types of data measurement scales, hypothesis testing techniques like z-test, t-test, and their appropriate uses. Examples include testing processing times of passport applications and comparing mutual fund returns through direct purchase versus brokers. 3. Key differences between statistical and data mining models, challenges in choosing appropriate modeling approaches, and importance of data type and measurement scale in determining the right multivariate technique are summarized.

Uploaded by

Abhinav Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Advanced Methods of Data

Analysis
Session 1 – 2

Program: PT MBA
Trim: V
Instructor: Dr. Abhinav Sharma
Two Different Approaches to Data Analysis

Statistical/data models – analysis Data mining/algorithmic models


where a specific model is proposed – models based on algorithms (e.g.,
(e.g., dependent and independent neural networks, decision trees,
variables to be analyzed by the support vector machine. Their
general linear model), the model is emphasis is on predictive accuracy
then estimated and a statistical rather than statistical inference and
inference is made as to its explanation
generalizability to the population
through statistical tests.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 2


Two Different Approaches to Data Analysis

• No “best” approach, each has strengths and weaknesses.


• Analysts today must assess each research situation and identify the best modeling
approach for that specific situation (i.e., objective, data, etc.).

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 3


Is data analysis everything?
• American Express - Default Prediction: Predict if a customer will default in the
future!

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 4


What is Multivariate Analysis?

• All statistical techniques that simultaneously analyze multiple measurements


on individuals or objects under investigation. Thus, any simultaneous analysis of
more than two variables can be loosely considered multivariate analysis.

• Many multivariate techniques are extensions of univariate procedures


• ANOVA  MANOVA

• Many other techniques are uniquely multivariate


• Factor analysis, cluster analysis, discriminant analysis

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 5


Types of Data and Measurement Scales

Data

Nonmetric
Metric or
or
Quantitative
Qualitative

Nominal Interval
Ordinal scale Ratio Scale
Scale Scale

NOTE: The level of measurement is critical in determining the appropriate


multivariate technique to use!
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 6
A multivariate data problem

1. How to distinguish between forged and a


genuine bill?

2. Attributes?

3. Compare the bills!

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 7


A multivariate data problem

x1: length of bill


x2: width of bill, measured on left
x3: width of bill, measured on right
x4: width of margin at the bottom
x5: width of margin at top
x6: length of image diagonal

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 8


Mahalanobis Distance
The Mahalanobis distance between centroid  x  and data point xi is given as:

MDi  ( xi  x ) S x1 ( xi  x )T

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 9


Hypothesis Testing

People who did MBA during 2019-2021 (online mode) are most intelligent as
average grade was 3.76/4 with standard deviation of 0.04.

Really?

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 10


Hypothesis Testing

If you drink Horlicks, you can


grow taller, stronger and sharper (3
in 1).

Wearing deodorant makes you


attractive to the opposite gender
(known as Axe effect)

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 11


Hypothesis Testing

Women take more selfies


compared to men

Smokers are better salespeople.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 12


Hypothesis Testing
Setting up hypothesis:
Step 1: Describe the hypothesis in words. A few examples of hypothesis are:
(a) The average time spent by women using social media is greater than that spent by
men.
(b) On average, women upload more photos on social media than men.
(c) Customers of mobile phone service providers with more than one mobile handset
are more likely to churn.
(d) The average mortality rate due to coronavirus is more for male compared to
female.

Step 2: Based on the claim made in Step 1, define the null and alternative
hypotheses. Initially, we believe that the null hypothesis is true .

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 13


Hypothesis Testing
Setting up hypothesis:
What will be the null and alternate hypothesis for claim ‘women use social media
more than men’

Null: There is no relationship between gender and average time spent on social
media
Alternate: There is a relationship between gender and average time spent on
social media

Remember: Null and alternative hypotheses are defined using a population


parameter.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 14


Hypothesis Testing
Setting up hypothesis:
Step 3: Identify the test to be used.
Step 4: Determine the p-value, compare it with level of significance
Step 5: Take the decision to reject or retain the null hypothesis.

Remember:
1. p-value is low, null must go.
2. Alternate hypothesis is of our interest and we believe it to be true.
3. Null hypothesis is the claim that is assumed to be true initially. That is at the
beginning, we assume that null hypothesis is true and try it unless there is
strong evidence against null hypothesis.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 15


1- Sample Z Test
When can be used?
1. When sample size is large
2. When population standard deviation is known.

A passport office claims that passport applications are processed within 30 days
of submitting the application form along with necessary documents. A sample of
processing time of 40 applications is taken which is provided in Excel file
‘Passport time.xlsx’. Population standard deviation of the processing time is
12.5. Verify the claim made by passport office.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 16


1- Sample t-Test
When can be used?
• When population standard deviation is unknown

Aravind Productions (AP) is a newly formed movie production house based out
of Mumbai, India. AP was interested in understanding the production cost of
producing a Bollywood movie. The industry believes the production house will
require at least Rs.500 million (50 crore) on average. It is assumed that
Bollywood movie production costs follow a normal distribution. The production
costs of 40 Bollywood movies in millions of rupees is given in Excel file ‘Movie
budget.xlsx’. Conduct an appropriate hypothesis test at alpha = 0.05 to check
whether the belief about average production cost is correct.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 17


2- Sample t-Test
When can be used?
• When comparing two samples

Millions of investors buy mutual funds, choosing from thousands of possibilities.


Some funds can be purchased directly from banks or other financial institutions
whereas others must be purchased through brokers, who charge a fee for this
service. This raises the question, Can investors do better by buying mutual funds
directly than by purchasing mutual funds through brokers? To help answer this
question, a group of researchers randomly sampled the annual returns from
mutual funds that can be acquired directly and mutual funds that are bought
through brokers and recorded the net annual returns, which are the returns on
investment after deducting all relevant fees. These are given in Excel file ‘Mutual
fund 2 sample t.xlsx’
Remember to test equality of variance

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 18


Paired t-Test
When can be used?
• In paired t-test, data related to a parameter is captured twice from same subject.

Data related to alcohol consumption before and after break-up is provided in


Excel file ‘Alcohol consumption.xlsx’. Conduct appropriate test to check whether
the alcohol consumption is more after break up.

Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 19

You might also like