0% found this document useful (0 votes)
27 views50 pages

Slides Sessions 1-2

Uploaded by

iyanez.ieu2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views50 pages

Slides Sessions 1-2

Uploaded by

iyanez.ieu2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Statistics and

Data Analysis
BBA & BBA-BIR
Academic Year 2023-2024
Prof. Maud Pindard-Lejarraga
[email protected]

1
SESSION 1
Introduction to the course
and to its methodology

2
AGENDA FOR TODAY

1. 2. 3.
Introducing ourselves Syllabus What is statistics?

Structure and rules An overview of the course

3
1. INTRODUCING OURSELVES

4
Maud Pindard-Lejarraga Office 22.19
https://www.linkedin.com/maud-pindard-lejarraga

Before 2003 2003 & 2004 2004-2010 Since 2010

Born in Paris Moved to Madrid. Master and Ph.D. in Full-time professor in


Bachelors in French Language Business Strategy (most of my
Assistant at UC3M Administration and teaching is Statistics)
Hispanic Studies (U.
Quantitative and collective advisor
Paris 3) and Economic Advisor (in at IE University
Econometrics (ENS macroeconomics) at Methods (UC3M)
Researcher in
Ulm and U. Paris 1) the French Embassy in Visiting researcher at organization theory
Bachelor thesis in Madrid Rotman School of and entrepreneurship
Economics of Management (U. of (quantitative
Organizations (on Toronto) and HEC research)
football club OL) Paris

5
What about you?

6
2. SYLLABUS
Structure and rules

7
Structure of the course
35 sessions but:
• 20 plenary sessions:
• 15 ‘lab’ sessions (although most sessions are practical)
• Half-groups, I will upload the calendar later today or
tomorrow
• If you need to be in a specific group for justified reasons let
me know (send me an email or a message through
Blackboard) TODAY

8
Rules of the IE game
• Be on time → if you’re more than 5 minutes late: absence
• Do not leave the room in the middle of the session (if you have
real urgent reasons, let me know in advance) → absence
• Refrain from side conversations
• Use your laptop for course-related purposes
• Do not use your phone
• Do not eat in the classroom

IE attendance policy: We seek 100% attendance. If you miss more than


20% of the sessions, you fail the course in both the May and June calls
and need to reenroll next academic year
9
Rules of the game in my course
• I consider you adults and expect you to behave likewise
• The less I must enforce rules, the better
• Try not to distract me
• Participation: YES!
• Teamwork: YES!
• “Free Riders”: NO!
• “Cold Calling”: YES

• THE COURSE IS NOT DIFFICULT IF YOU WORK REGULARLY:


• Most of the course work is done in class, stay on top of it, review
what you did not understand the same day or week, request office
hours if you still don’t understand
• Trying to catch up in April IS hard

10
Evaluation system
• Final Exam: 40% of the final grade. You need a minimum
grade of 3.5/10 in the final to pass the course. May 20th
• Group project: 25% (includes peer grading to prevent free-
riding)
• Individual Assignments: 20% (quizzes, exercises, etc…)
• Participation: 15% (ACTIVE participation)

• More info in the syllabus

Do you have any questions?

11
Group Project
Identification of a real-world problem, taken
from business or any other field of interest:
problem-solving will entail the selection of a
database (for example on Kaggle), the
statistical analysis of the data, and the final
interpretation of the obtained results.
Further instructions will be given around
mid-course
80% of the grade will be given to the project
itself and be the same for all group
members; 20% will be based on peer
evaluation of your implication in the group
work.
Deadline: May 19th, 7pm

12
Teamwork
• A team is a small number of
people with
complementary skills who
are committed to a
common purpose, set of
performance goals, and
approach for which they
hold themselves mutually
accountable
Participation

• Passive Attendance DOES


NOT count as participation
• Participation includes
working on the exercises in
class, not only raising your
hands
• I will need you to connect to
Zoom in ALL sessions
• Grade updated weekly

14
Materials

• Optional textbooks: “An Introduction to


Statistics with Python”, Thomas
Haslwanter, 2016, Springer (I
recommend the electronic version)
and OpenIntro Statistics, 4th edition
(www.openintro.org for a free PDF) for
the concepts, additional data and
exercises
• Additional contents on Blackboard

15
Why Python?

General purpose language


Free, elegant and powerful
In this case, we will use an environment
that is included in the Google suite,
Colaboratory: no need to configure, install,
or setup; compatible with Mac and
Windows; one click to set in English
You will be given many examples of code.
Your task is to understand when we use it
and what to change
This is NOT a programming course, you
need basic Python “reading and writing”
skills

16
Remember

17
3. WHAT IS STATISTICS?
An overview of the course

18
Statistics and Data Analysis

Introduction

Session 1: General Concepts

19
Session Goals
Define Statistics and their uses
Explain how decisions are often based on
incomplete information
Explain key definitions:
• Population vs. Sample
• Parameter vs. Statistic
• Descriptive vs. Inferential Statistics

Describe random sampling

20
Why do we study Statistics?

• Statistics are about data and


numbers
• Etimologically, Statistics are
numbers about the State or
our community. They are
data about us, “on average”,
as a society
• We are surrounded by data,
growing exponentially: 90%
of all the data in the world
has been generated in the
past 2 to 5 years (calculations
vary based on methodology).

21
22
How do we interpret data?
• We live in an era of post-truth: the more
data we have, the more we distrust them.

• Distrust in official data covers areas such as


vaccine efficacy and coverage or economic
statistics (e.g. inflation rate, unemployment)

• We need to learn to question data and


numbers, to know what we can or cannot
believe. If we understand where the data
comes from, how it has been collected and
generated, and whether it is correctly
interpreted, we’ll be a step closer
23
Dealing with
Uncertainty
Everyday decisions are based on incomplete
information. Because of uncertainty, we cannot
make such statements as:
The job market will be strong when I graduate
Interest rates will increase for the rest of the year
My energy bills will be stable in 2024

Data are used to assist decision making:

Statistics is a tool to help us process,


summarize, analyze and interpret data for
the purpose of making better decisions in
an uncertain environment
The main skill for you to acquire is “statistical
literacy”
24
A Case Study

Objective: Evaluate the effectiveness of


cognitive-behavior therapy for treating
chronic fatigue syndrome.
Participant pool: 142 patients who were
recruited from referrals by primary care
physicians and consultants to a hospital
clinic specializing in chronic fatigue
syndrome.
Actual participants: Only 60 of the 142
referred patients entered the study. Some
were excluded because they didn’t meet
the diagnostic criteria, some had other
health issues, and some refused to be a
part of the study.
Source: Deale et. al. Cognitive behavior therapy for chronic fatigue
syndrome: A randomized controlled trial. The American Journal of
Psychiatry 154.3 (1997).

25
Study Design

Patients randomly assigned to treatment


and control groups, 30 patients in each
group:
Treatment: Cognitive behavior therapy –
collaborative, educative, and with a
behavioral emphasis. Patients were shown
on how activity could be increased steadily
and safely without exacerbating
symptoms.
Control: Relaxation – No advice was given
about how activity could be increased.
Instead, progressive muscle relaxation,
visualization, and rapid relaxation skills
were taught.

26
Results
The table below shows the distribution of patients with
good outcomes at 6-month follow-up.
Note that 7 patients dropped out of the study: 3 from the
treatment and 4 from the control group.

Proportion with good outcomes in treatment group: 19/27 ≈


70%
Proportion with good outcomes in control group: 5/26 ≈ 19%

27
Understanding the results

Do the data show a “real” difference


between the groups?
Suppose you flip a coin 100 times. While
the chance a coin lands heads in any given
coin flip is 50%, we probably won’t observe
exactly 50 heads. This type of fluctuation is
part of almost any type of data generating
process.
The observed difference between the two
groups (70 - 19 = 51%) may be real or may
be due to natural variation.
Since the difference is quite large, it is
more believable that the difference is real.
We need statistical tools to determine if the
difference is so large that we should reject
the notion that it was due to chance.

28
Key Definitions

• A population is the collection of all items of


interest or under investigation
• N represents the population size
• A sample is an observed subset of the
population
• n represents the sample size

• A parameter is a specific characteristic of a


population
• A statistic is a specific characteristic of a
sample
Population vs. Sample

Population Sample

Values calculated using Values computed from


population data are called sample data are called
parameters statistics
Examples of Populations

All registered voters in Spain


All families living in Chamartin district
All stocks traded on the New York
Stock Exchange
All genetically modified crops in the
European Union
All students at IE University

31
Random Sampling
Simple random sampling is a
procedure in which
• each member of the population is chosen
strictly by chance,
• each member of the population is equally
likely to be chosen,
and
• every possible sample of n objects is
equally likely to be chosen
The resulting sample is called a random
sample
We often use non-random sampling
(for a variety of reasons).
32
Descriptive and Inferential Statistics

Two branches of statistics:


• Descriptive Statistics (“Data Analysis”)
– Collecting, summarizing, and processing data to transform
data into information
• Inferential Statistics
– Provides the bases for predictions, forecasts, and estimates
that are used to transform information into knowledge

Inference is the process of drawing conclusions or


making decisions about a population based on
sample results
In order to move from descriptive statistics to inferential statistics,
we need to understand probability rules and distributions
Data Collection
Anecdotal evidence and early smoking
research
Anti-smoking research started in the 1930s
and 1940s when cigarette smoking became
increasingly popular. While some smokers
seemed to be sensitive to cigarette smoke,
others were completely unaffected.
Anti-smoking research was faced with
resistance based on anecdotal evidence such
as “My uncle smokes three packs a day and
he’s in perfectly good health”, evidence
based on a limited sample size that might
licencia CC BY-SA-NC
not be representative of the population.
It was concluded that “smoking is a complex
human behavior, by its nature difficult to
study, confounded by human variability.”
In time researchers were able to examine
larger samples of cases (smokers), and trends
showing that smoking has negative health
impacts became much clearer.

Brandt, The Cigarette Century (2009), Basic Books


34
Census
Wouldn’t it be better to just include everyone and “sample” the
entire population?
This is called a census.
There are problems with taking a census:
It can be difficult to complete a census: there always
seem to be some individuals who are hard to locate or
hard to measure. And these difficult-to-find people may
have certain characteristics that distinguish them from
the rest of the population.
Populations rarely stand still. Even if you could take a
census, the population changes constantly, so it’s never
possible to get a perfect measure.
Taking a census may be more complex than sampling.

35
From exploratory analysis to inference
Sampling is natural.
Think about sampling something you are
cooking - you taste (examine) a small part of
what you’re cooking to get an idea about the
dish as a whole.
When you taste a spoonful of soup and decide
the spoonful you tasted isn’t salty enough, that’s
exploratory analysis.
If you generalize and conclude that your entire
soup needs salt, that’s an inference.
For your inference to be valid, the spoonful you
tasted (the sample) needs to be representative of
the entire pot (the population).
If your spoonful comes only from the surface and
the salt is collected at the bottom of the pot, what
you tasted is probably not representative of the
whole pot.
If you first stir the soup thoroughly before you taste,
your spoonful will more likely be representative of
the whole pot.
36
Sampling bias

• Large samples are better but…


• Sampling bias can yield wrong
results even with a huge sample
size
– Back to the soup analogy: If
the soup is not well stirred, it
doesn’t matter how large a
spoon you have, it will still
not taste right. If the soup is
well stirred, a small spoon
will suffice to test the soup.

37
Types of Data

Data

Examples:
Marital Status
Do you own a car?
Categorical Numerical
Eye Color
(Defined categories or
groups)

Discrete Continuous

Examples: Examples:
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured characteristics)
Explanatory and response variables
To identify the explanatory variable in a pair of variables, identify
which of the two is suspected of affecting the other:
might affect
explanatory variable response variable

Labeling variables as explanatory and response does not guarantee


the relationship between the two is actually causal, even if there is
an association identified between the two variables. We use these
labels only to keep track of which variable we suspect affects the
other.

39
Observational studies and experiments
Observational study: Researchers collect data in a way that does not
directly interfere with how the data arise, i.e. they merely “observe”, and
can only establish an association between the explanatory and response
variables.

Experiment: Researchers randomly assign subjects to various treatments


in order to establish causal connections between the explanatory and
response variables.
If you’re going to walk away with one thing from this class, let it be
“correlation does not imply causation”.

40
http://xkcd.com/552/
A Study
A Study (continued)

What type of study is this, observational study or experiment ?


• “Girls who regularly ate breakfast, particularly one that includes cereal, were
slimmer than those who skipped the morning meal, according to a study that
tracked nearly 2,400 girls for 10 years. [...] As part of the survey, the girls were
asked once a year what they had eaten during the previous three days.” This
is an observational study since the researchers merely observed the behavior
of the girls (subjects) as opposed to imposing treatments on them.

What is the conclusion of the study?


• There is an association between girls eating breakfast and being slimmer.

Who sponsored the study?


• General Mills.
Three possible explanations
Types of studies

44
Obtaining good samples
More experimental design methodology:
Placebo: fake treatment, often used as the control group
for medical studies
Placebo effect: experimental units showing improvement
simply because they believe they are receiving a special
treatment
Blinding: when experimental units do not know whether
they are in the control or treatment group
Double-blind: when both the experimental units and the
researchers who interact with the patients do not know
who is in the control and who is in the treatment group

Almost all statistical methods are based on the notion of


implied randomness.

45
Statistics – Topic 1

Descriptive Statistics

Describing Data using Tables and


Charts

46
Data in raw form…
… vs. tables and graphical presentations
The “Gen Z” version
Principles of Data Visualization

• “Good data visualization is:


1. Trustworthy
2. Accessible
3. Elegant”

Andy Kirk, Data Visualisation: a


Handbook for Data Driven Design.
Sage: 2016

50

You might also like