Lecture 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

APPLIED DATA ANALYSIS

IN BUSINESS WITH R

Introduction to statistical analysis


for business and the program R

Denis Marinšek
BASIC CONCEPTS

Before the tax changes are introduced, the Ministry of Finance wants to analyze the
wealth of citizens over the age of 18. The data shows that 75% of citizens own stocks
and bonds worth up to EUR 380, that the average net income is EUR 1670 and that 36%
of citizens own real estate.

Determine the population.


Determine the observation.
Determine the variables.
Determine the parameters.

𝑦11 𝑦12 𝑦1𝑘


𝑦21 𝑦22 ⋯ 𝑦2𝑘
𝑦31 𝑦32 𝑦3𝑘
Table of the input data: 𝑦41 𝑦42

𝑦4𝑘
𝑦51 𝑦52 𝑦5𝑘
⋮ ⋮ 2
𝑦𝑁1 𝑦𝑁2 ⋯ 𝑦𝑁𝑘
BASIC CONCEPTS
Types of variables

• Categorical • Numeric (continuous and discrete)


- Nominal - Interval
- Ordinal - Ratio

Examples:
- Gender.
- How much do you agree with the statement "I'm addicted to YouTube"?
- What is your favorite way to prepare your steak?
- Number of household members.
- Revenue of a company.

3
BASIC CONCEPTS
Research methods
• Qualitative and quantitative methods.

Descriptive and inferential statistics

• Descriptive statistics is a group of statistical methods used to describe the basic


characteristics of the data under study. The methods include frequency
distributions, tabulations, means, quantiles, measures of variability, and simple
graphical representations that form the basis for any quantitative analysis of the
population or parts of it.

• Inferential statistics methods attempt to go beyond the insights that follow directly
from the data. For example, they attempt to draw inferences about the parameters
of the population from the sample data.

4
BASIC CONCEPTS
Parameters
• Actual (population) value of the parameter: 𝛤𝑦
• Parameter estimate: 𝑔𝑦

𝑁 𝑛
1 1
𝜇 = ෍ 𝑦𝑖 𝑦ത = ෍ 𝑦𝑖
𝑁 𝑛
𝑖=1 𝑖=1
𝑁 𝑁
1 1
𝜎 2 = ෍ 𝑦𝑖 − 𝜇 2
𝑠2 = ෍ 𝑦𝑖 − 𝑦ത 2
𝑁 𝑛−1
𝑖=1 𝑖=1

5
BASIC CONCEPTS
Before the tax changes are introduced, the Ministry of Finance wants to
analyze the wealth of citizens over the age of 18. The data shows that 75% of
citizens own stocks and bonds worth up to EUR 395, that the average net
income is EUR 1620 and that 34% of citizens own real estate.

Determine the population.


Determine the sample.
Determine the observation.
Determine the variables.
Determine the estimate of the parameters.

𝑦11 𝑦12 𝑦1𝑘


𝑦21 𝑦22 ⋯ 𝑦2𝑘
𝑦31 𝑦32 𝑦3𝑘
Table of the inpute data: 𝑦41 𝑦42

𝑦4𝑘
𝑦51 𝑦52 𝑦5𝑘
6
⋮ ⋮
𝑦𝑛1 𝑦𝑛2 ⋯ 𝑦𝑛𝑘
BASIC CONCEPTS
Distributions

Frequency distribution (histogram).


3000

Frequency
2000

Symmetric distribution.
Skewed distribution.
1000

0
−4 −2 0 2 4
Score

Positive skew Negative skew

4000

3000
Frequency

2000

1000

7
BASIC CONCEPTS
Estimates of parameters
Bimodal Multimodal
3000

• Central tendency: mode


- What is mode? 2000

Frequency
- Bimodal distribution
- Multimodal distribution 1000

• Central tendency: median

8
BASIC CONCEPTS
Estimates of parameters
• Central tendency: arithmetic mean

• Variability: range

• Variability: interquartile range (IQR)


- What are quantiles?

9
BASIC CONCEPTS
Estimates of parameters
• Variability: deviation (from the mean)
𝑦𝑖 − 𝑦ത

• Variability: Sum of squared deviations


𝑛

ത 2
෍(𝑦𝑖 − 𝑦)
𝑖=1

• Variability: average squared deviation – i.e., variance


𝑁 𝑛
1 1
𝜎 2 = ෍(𝑦𝑖 − 𝜇)2 𝑠2 = ത 2
෍(𝑦𝑖 − 𝑦)
𝑁 𝑛−1
𝑖=1 𝑖=1

10
BASIC CONCEPTS
Estimates of parameters
• Variability: standard
deviation

𝑛
1
𝑠= ത 2
෍(𝑦𝑖 − 𝑦)
𝑛−1
𝑖=1

• Variability: coefficient of
variation

𝑠
% 𝑐𝑣 = ∙ 100
𝑦ത
11
BASIC CONCEPTS
Example: Managerial Economics
We randomly select 31 students and look at their results from the Managerial
Economics course (ME.csv):

12
BASIC CONCEPTS
Example: Managerial Economics

13
BASIC CONCEPTS
Example: Managerial Economics

14
BASIC CONCEPTS
Example: Managerial Economics

15
BASIC CONCEPTS
Normal distribution
Properties of the normal distribution:
• Unimodal and symmetric
• 𝜇 = 𝑀𝑜 = 𝑀𝑒
• On the interval from 𝜇𝑦 − 𝑘𝜎𝑦 to 𝜇𝑦 + 𝑘𝜎𝑦 is a known
percentage of values

16
GRAPHICAL ANALYSIS

How can we plot the data?


• Histogram.
• Boxplot.
• Scatter plot.

The chart should (Tufte, 2001):


Present and reveal the data, stimulate thinking about the data, present many
numbers with little ink (data-to-ink ratio), and stimulate comparison of different
parts of the data.

We should avoid distorting the data.

17
GRAPHICAL ANALYSIS
Example: Nervousness

18
GRAPHICAL ANALYSIS
Example: Nervousness

19
GRAPHICAL ANALYSIS
Example: Nervousness

20
GRAPHICAL ANALYSIS
Example: Nervousness

21
GRAPHICAL ANALYSIS
Example: Movie

22
GRAPHICAL ANALYSIS
Example: Movie

23
GRAPHICAL ANALYSIS
Example: Movie

24
GRAPHICAL ANALYSIS
Example: Movie

25
GRAPHICAL ANALYSIS
Example: Movie

26
GRAPHICAL ANALYSIS
Example: Movie

27
GRAPHICAL ANALYSIS
Example: Movie

28
DATA CLEANING
Example: Managerial Economics

29
DATA CLEANING
Example: Managerial Economics

30
DATA CLEANING
Example: Managerial Economics

31
DATA CLEANING
Example: Managerial Economics

32
DATA CLEANING
Example: Trust

33
DATA CLEANING
Example: Trust

34
DATA CLEANING
Example: Trust

35
PRACTICAL EXAMPLE

We collected some data for a group of 35 athletes between the ages of 18 and 25
(Maraton.csv).

a) Import the data into RStudio (read.table) and display it with the head function.
b) Define the unit of study, define the variables, and explain which measurement scales they belong to.
c) Estimate and explain the arithmetic mean and the standard deviation for the variable Height.
d) Change the variable Gender into a factor.
e) Estimate the descriptive statistics for the Glucose separately for each gender. Use the describeBy
{psych} function.
f) Use the stat.desc {pastecs} function to describe the variables and check that you know all the
estimated parameters. Which variable has the greatest variability?
g) For the variable Hematocrit, plot and describe the frequency distribution using the hist function.
h) Draw a boxplot for the variable Glucose, separated by gender. Use the function geom_boxplot
{ggplot2}.

36

You might also like