0% found this document useful (0 votes)
28 views

Lecture 2

Uploaded by

Hooria Nawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Lecture 2

Uploaded by

Hooria Nawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Statistical and Mathematical

Methods for Data Analysis


Dr. Raja Noshad Jamil
Department of Artificial Intelligence
School of System and Technology
University of Managenment and Technology
Basic concepts [1]
Statistics is defined as
“The mathematics of the collection, organization, and
interpretation of numerical data, especially the analysis
of population characteristics by inference from sampling”
OR
Statistics is a science which deals with collection, classification,
distribution and interpretation of data.
OR
Statistics is a science of uncertainty.
OR
Statistics is the science of collecting, organizing, analyzing, and
interpreting data in order to make decisions.
Data sets
Data consist of information coming from observations,
counts, measurements, or responses.

Statistics is the science of collecting, organizing, analyzing,


and interpreting data in order to make decisions.

There are two types of data sets you will use when studying
statistics. These data sets are called populations and samples.

A population is the collection of all outcomes, responses,


measurements, or counts that are of interest.

A sample is a subset, or part, of a population.


Identifying Data Sets
In a recent survey, 614 small business owners in the United
States were asked whether they thought their company’s
Facebook presence was valuable. Two hundred fifty-eight (258)
of the 614 respondents said yes. Identify the population and
the sample. Describe the sample data set.
`

Solution:
The population consists of the responses of all small
business owners in the United States, and the sample
consists of the responses of the 614 small business owners
in the survey.

Notice that the sample is a subset of the responses of all


small business owners in the United States. The sample data
set consists of 258 owners who said yes and 356 owners
who said no.
Descriptive Statistics vs. Inferential Statistics

The study of statistics has two major branches: descriptive


statistics and inferential statistics.

Descriptive statistics is the branch of statistics that involves


the organization, summarization, and display of data.

Inferential statistics is the branch of statistics that involves


using a sample to draw conclusions about a population. A
basic tool in the study of inferential statistics is probability.
Descriptive and Inferential Statistics
Example :Determine which part of the study represents the
descriptive branch of statistics. What conclusions might be
drawn from the study using inferential statistics?
1. A large sample of men, aged 48, was studied for 18 years.
For unmarried men, approximately 70% were alive at age 65.
For married men, 90% were alive at age 65. (Source: The
Journal of Family Issues)
2. In a sample of Wall Street analysts, the percentage who
incorrectly forecasted high-tech earnings in a recent year was
44%. (Source: Bloomberg News)
Solution:
1.Descriptive statistics involves statements such as “For
unmarried men, approximately 70% were alive at age 65” and
“For married men, 90% were alive at age 65.” Also, the figure
represents the descriptive branch of statistics. A possible
inference drawn from the study is that being married is
associated with a longer life for men.

2. The part of this study that represents the descriptive


branch of statistics involves the statement “the percentage
[of Wall Street analysts] who incorrectly forecasted high-
tech earnings in a recent year was 44%.” A possible
inference drawn from the study is that the stock market is
difficult to forecast, even for professionals.
Parameter vs. Statistic

A parameter is a numerical description of a population


characteristic.

A statistic is a numerical description of a sample


characteristic.
Distinguishing Between a Parameter and a
Statistic
Example: Determine whether the numerical value describes a
population parameter or a sample statistic. Explain your
reasoning.
1. A recent survey of approximately 400,000 employers
reported that the average starting salary for marketing majors
is $53,400. (Source: National Association of Colleges and
Employers)
2. The freshman class at a university has an average SAT math
score of 514.
3. In a random check of 400 retail stores, the Food and Drug
Administration found that 34% of the stores were not storing
fish at the proper temperature.
Solution
1. Because the average of $53,400 is based on a subset of the
population, it is a sample statistic.
2. Because the average SAT math score of 514 is based on the
entire freshman class, it is a population parameter.
3. Because the percent, 34%, is based on a subset of the
population, it is a sample statistic.
Types of Data

Data sets can consist of two types of data: qualitative data and
quantitative data.

Qualitative data consist of attributes, labels, or nonnumerical


entries.

Quantitative data consist of numerical measurements or


counts.
Classifying Data by Type
Example: The suggested retail prices of several Honda vehicles
are shown in the table. Which data are qualitative data and
which are quantitative data? Explain your reasoning. (Source:
American Honda Motor Company, Inc.)
Solution

The information shown in the table can be separated into


two data sets. One data set contains the names of vehicle
models, and the other contains the suggested retail prices
of vehicle models.

The names are nonnumerical entries, so these are


qualitative data.

The suggested retail prices are numerical entries, so these


are quantitative data.
Levels of Measurement

Another characteristic of data is its level of measurement.


The level of measurementdetermines which statistical
calculations are meaningful.

The four levels of measurement, in order from lowest to


highest, are nominal, ordinal, interval, and ratio.
Nominal vs Ordinal

Data at the nominal level of measurement are qualitative


only. Data at this level are categorized using names, labels,
or qualities. No mathematical computations can be made at
this level.

Data at the ordinal level of measurement are qualitative or


quantitative. Data at this level can be arranged in order, or
ranked, but differences between data entries are not
meaningful.
Example

Two data sets are shown. Which data set consists of data at
the nominal level? Which data set consists of data at the
ordinal level? Explain your reasoning. (Source: The Numbers)
Solution
The first data set lists the ranks of five movies. The data set
consists of the ranks 1, 2, 3, 4, and 5. Because the ranks can
be listed in order, these data are at the ordinal level. Note
that the difference between a rank of 1 and 5 has no
mathematical meaning.

The second data set consists of the names of movie genres.


No mathematical computations can be made with the
names and the names cannot be ranked, so these data are
at the nominal level.
Interval vs. Ratio

Data at the interval level of measurement can be ordered, and


meaningful differences between data entries can be calculated.
At the interval level, a zero entry simply represents a position on
a scale; the entry is not an inherent zero.

Data at the ratio level of measurement are similar to data at


the interval level, with the added property that a zero entry is
an inherent zero. A ratio of two data entries can be formed so
that one data entry can be meaningfully expressed as a
multiple of another.
An inherent zero is a zero that implies “none.” For instance,
the amount of money you have in a savings account could be
zero dollars. In this case, the zero represents no money; it is
an inherent zero. On the other hand, a temperature of 0°C
does not represent a condition in which no heat is present.
The 0°C temperature is simply a position on the Celsius scale;
it is not an inherent zero.
To distinguish between data at the interval level and at the
ratio level, determine whether the expression “twice as
much” has any meaning in the context of the data.

For instance, $2 is twice as much as $1, so these data are


at the ratio level. On the other hand, 2°C is not twice as
warm as 1°C, so these data are at the interval level.
Classifying Data by Level
Example: Two data sets are shown at below. Which data set
consists of data at the interval level? Which data set consists
of data at the ratio level? Explain your reasoning. (Source:
Major League Baseball)
Solution
Both of these data sets contain quantitative data. Consider
the dates of the Yankees’ World Series victories. It makes
sense to find differences between specific dates. For instance,
the time between the Yankees’ first and last World Series
victories is 2009 - 1923 = 86 years. But it does not make sense
to say that one year is a multiple of another. So, these data
are at the interval level.

However, using the home run totals, you can find


differences and write ratios. From the data, you can see that
Baltimore hit 39 more home runs than Tampa Bay hit and
that New York hit about 1.5 times as many home runs as
Detroit hit. So, these data are at the ratio level.
The tables below summarize which operations are
meaningful at each of the four levels of
measurement.
When identifying a data set’s level of measurement, use the
highest levelthat applies.
Summary of Four Levels of Measurement
Key Terms for Data Types
Continuous
• Data that can take on any value in an interval.
• Synonyms: interval, float, numeric

Discrete
• Data that can only take on integer values, such as
counts.
Synonyms: integer, count
Key Terms for Data Types
Categorical
•Data that can only take on a specific set of
values.
•Example: Sex, type of chocolate, color
•Synonyms: enums, enumerated, factors, nominal,
polychotomous
Binary
•A special case of categorical with just two
categories (0/1, True, False).
•Synonyms: dichotomous, logical, indicator
Ordinal
•Categorical data that has an explicit ordering.
•Synonyms: ordered factor
Data Types
Binary data is an important special case of
categorical data that takes on only one of two
values, such as 0/1, yes/no or true/false.
Synonyms: dichotomous, logical, indicator
Ordinal
•Categorical data that has an explicit ordering.
•Synonyms: ordered factor
An example of this is a numerical rating (1, 2, 3, 4,
or 5)
Data Types
There are two basic types of structured data:
numeric and categorical.

Numeric data comes in two forms: continuous,


such as wind speed or time duration, and discrete,
such as the count of the occurrence of an event.

Categorical data takes only a fixed set of values,


such as a type of TV screen (plasma, LCD, LED, …) or
a state name (Alabama, Alaska, …).
Nominal scales
Nominal scales are used for labeling variables, without any
quantitative value.
“Nominal” scales could simply be called “labels.”
Here are some examples, below. Notice that all of these
scales are mutually exclusive (no overlap) and none of them
have any numerical significance.
A good way to remember all of this is that “nominal”
sounds a lot like “name” and nominal scales are kind of like
“names” or labels.
Nominal scale example
Type of chocolate

•Dark(1)
• Milk(2)
•White (3)
Sex
•Male(0)
• Female(1)
Color
• Red(1)
• Green(2)
• Blue(3)
• Yellow(4)
Ordinal scale
With ordinal scales, it is the order of the values is what’s
important and significant, but the differences between each
one is not really known.
Take a look at the example on below. In each case, we know
that option 4 is better than option 3 or option 2, but we
don’t know–and cannot quantify–how much better it is.
For example, is the difference between “OK” and “Unhappy”
the same as the difference between “Very Happy” and
“Happy” ? We can’t say.
Ordinal scales are typically measures of non-numeric
concepts like satisfaction, happiness, discomfort, etc.
Ordinal scale example

Ordinal” is easy to remember because is sounds like “order”


and that’s the key to remember with “ordinal scales”–it is the
order that matters, but that’s all you really get from these.

Advanced note: The best way to determine central


tendency on a set of ordinal data is to use the mode or
median; the mean cannot be defined from an ordinal set.
Key Ideas

Data are typically classified in software by their


type

Data types include continuous, discrete,


categorical (which includes binary), and ordinal

Data-typing in software acts as a signal to the


software on how to process the data

39
 PAKISTAN: ROAD TRAFFIC ACCIDENTS
 Deaths = 30,046
 % = 2.42 (of total death in Pakistan)
 Rate = 17.12
 World Rank = 95
 According to the latest WHO data published in 2018 Road
Traffic Accidents Deaths in Pakistan reached 30,046 or 2.42%
of total deaths. The age adjusted Death Rate is 17.12 per
100,000 of population ranks Pakistan #95 in the world.
Review other causes of death by clicking the links below or
choose the full health profile.
Reference:https://www.worldlifeexpectancy.com/pakistan-
road-traffic-accidents

8
 Road injuries killed 1.4 million people in 2016, about three-
quarters (74%) of whom were men and boys.
Basic concepts

Probability can be defined as the mathematics of


chance.
OR
Statisticians use the word experiment to describe
any process that generates a set of data.
OR
A probability experiment is a chance process that
leads to well defined outcomes or results.
For example, tossing a coin can be considered a
probability experiment since there are two well-
defined outcomes—heads and tails.
Basic concepts

In probability theory, an experiment or trial is any


procedure that can be infinitely repeated and has a
well-defined set of possible outcomes, known as
the sample space.

An outcome of a probability experiment is the


result of a single trial of a probability experiment.
Basic concepts
The set of all possible outcomes of a statistical
experiment is called the sample space and is
represented by the symbol S.
OR
The set of all outcomes of a probability experiment is
called a sample space. Some sample spaces for various
probability experiments are shown here.
Basic concepts

Each outcome in a sample space is called an


element or a member of simply a sample point.

Each outcome of a probability experiment


occurs at random.

Each outcome of the experiment is equally likely


unless other wise stated.
Basic concepts
An event then usually consists of one or more
out comes of the sample space.
OR
An event is a subset of a sample space.

An event with one outcome is called a


simple event.
An event consists of two or more outcomes,
it is called a compound event.
Example
A single die is rolled. List the outcomes in each event:

a. Getting an odd number

b. Getting a number greater than four

c. Getting less than one


Example cont.
Solution:
Basic concepts
Classical Probability: The formula for determining the
probability of an event E is
n(E)
P(E) =
n(S)
OR
Number of outcomes contained in the event E
P(E) =
Total number of outcomes in the sample space
Example:
Two coins are tossed; find the probability that both
coins land heads up.
Solution:
Example:
A die is tossed; find the probability of each event:

a. Getting a two

b. Getting an even number

c. Getting an number less than 5


Example cont.
Solution:
S = {1, 2, 3, 4, 5, 6}
n(S) = 6

Number of outcomes contained in the event E


P(E) =
Total number of outcomes in the sample space

a. Let A be the event of getting a “two”


A = {2}
n(A) = 1
P (A) = 1/6 = 0.1667 (or 16.67%)
Example cont.

b. a. Let B be the event of getting a “even number”


A = {2, 4, 6}
n(A) = 3
P(B) = 3/6 = 0.5 (or 50%)

c. a. Let C be the event of getting a “less than 5”


C= {1, 2, 3, 4}
n(C) = 4
P(C) = 4/6 = 0.6666 (or 66.67%)
Basic concepts
Rule 1: The probability of any event will always be a
number from zero to one. Probabilities cannot be
negative nor can they be greater than one.

Rule 2: When an event cannot occur, the probability


will be zero.

Example: A die is rolled; find the probability of


getting a 7.
Basic concepts
Rule 3: When an event is certain to occur, the
probability is 1.

Example: A die is rolled; find the probability of


getting a number less than 7.

Rule 4: The sum of the probabilities of all of the


outcomes in the sample space is 1.

Example: P(H) = P(T) = 1/2; P(H) + P(T) = 1


Basic concepts
Complement: The complement of an event A with
respect to S is the subset of all elements of S that are
not in A. We denote the complement of A by the
symbol A'.

Rule 5: The probability that an event will not occur is


equal to 1 minus the probability that the event will
occur.

Example: P(H)=1/2, then P(T) =1- P(H) = 1 - 1/2 =1/2


Basic concepts
The probability of an event A is the sum of the weights of all sample
points in A.
Therefore,
I. 0 ≤ P(A) ≤ 1

II. P(φ) = 0

III. P(S) = 1.
Basic concepts
When the probability of an event is close to zero, the
occurrence of the event is relatively unlikely. For
example, if the chances that you will win a certain
lottery are 0.00l or one in one thousand, you
probably won’t win, unless of course, you are very
‘‘lucky.’’

When the probability of an event is 0.5 or 𝟏/𝟐, there


is a 50 – 50 chance that the event will happen—the
same.
Basic concepts
When the probability of an event is close to one,
the event is almost sure to occur. For example, if
the chance of it snowing tomorrow is 90%, more
than likely, you’ll see some snow.
Empirical Probability [1]
Probabilities can be computed for situations that do
not use sample spaces. In such cases, frequency
distributions are used and the probability is called
empirical probability.
Rank Frequency

FRESHMEN 4
Sophomores 6
Juniors 8
Seniors 7
TOTAL 25
Empirical Probability [2]
Frequency of E
P(E ) =
Sum of the frequencies

P(E ) = 1/4

Empirical probability is sometimes called relative


frequency probability.
Law of large numbers

In probability theory, the law of large numbers


(LLN) is a theorem that describes the result of
performing the same experiment a large number of
times.

According to the law, the average of the results


obtained from a large number of trials should be
close to the expected value, and will tend to
become closer as more trials are performed.
Law of large numbers

The LLN is important because it "guarantees"


stable long-term results for the averages of some
random events.

For example, while a casino may lose money in a


single spin of the roulette wheel, its earnings will
tend towards a predictable percentage over a large
number of spins.
Outcome of a die
1
2
3
4
5
6
Sum of die outcome = 21
Mean = 2 1 = 3.5
6
An illustration of the law of large numbers using a
particular run of rolls of a single die. As the number of
rolls in this run increases, the average of the values of
all the results approaches 3.5.
Law of Large Numbers

Questions:
What happens if we toss the coin 100 times ? Will
we get 50 heads?

What will happen if we toss a coin 1000 times? Will


we get exactly 500 heads?
Law of Large Numbers

Solution: Probably not.

However, as the number of tosses increases, the


ratio of the number of heads to the total number of
tosses will get closer to 𝟏/𝟐.

This phenomenon is known as the law of large


numbers.

You might also like