Unit-1-Introduction To Statistical Analysis
Unit-1-Introduction To Statistical Analysis
• Introduction
• Meaning of Statistics
• The Scientific Method : Basic Steps of the Research Process
• Experimental Data and Survey Data
• Populations and Samples
• Census and Sampling Methods
• Parameter and Statistic
• Independent and Dependent Variables
• Examining Relationships
• Introduction to SPSS Statistics.
Introduction - What is Statistics ?
• Word statistics has two meanings.
Applied statistics involves the applications of those theorems, formulas, rules, and laws
to solve real-world problems.
Broadly speaking, applied statistics can be divided into two areas: descriptive statistics
and inferential statistics
Introduction - What is Statistics ?
Definition: Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.
Statistical analysis: It’s the science of collecting, exploring and presenting large
amounts of data to discover underlying patterns and trends. Statistics are applied
every day – in research, industry and government – to become more scientific about
decisions that need to be made.
First, statisticians are guides for learning from data and navigating common problems
that can lead you to incorrect conclusions. Second, given the growing importance of
decisions and opinions based on data, it’s crucial that you can critically assess the
quality of analyses that others present to you.
• The State
• Economics
• Business Management and Industry
• Social and Natural Sciences
• Biology and Medicine
• Research
Types of Statistics
Types of Statistics - Descriptive Statistics
• Suppose we have information on test scores of students enrolled in a statistics class.
• The whole set of numbers that represents the scores of students is called a data set
• Name of each student is called an element, and the score of each student is called
an observation.
• Rather that source data set it is easier to draw conclusions from summary tables and
diagrams
•So, we reduce data to a manageable size by constructing tables, drawing graphs, or
calculating summary measures such as averages.
•The portion of statistics that helps us do this type of statistical analysis is called
descriptive statistics.
Types of Statistics - Descriptive Statistics
• A major portion of statistics deals with making decisions, inferences, predictions, and
forecasts about populations based on results obtained from samples
• We may want to find the starting salary of a typical college graduate. To do so, we may
select 2000 recent college graduates, find their starting salaries, and make a decision
based on this information.
•A survey that includes every element of the target population is called a census
• We select a sample and collect the required information from the elements included in
that sample.
● Biased Sample- one which is not true representative of the population from which
it’s inference is drawn
Inferencing about Population using Sample
Sampling with and without replacement
• A sample may be selected with or without replacement.
• In sampling with replacement, each time we select an element from the population,
we put it back in the population before we select the next element.
• As a result, we may select the same item more than once in such a sample.
• The experiment of rolling a die many times is another example of sampling with
replacement because every roll has the same six possible outcomes.
• Sampling without replacement occurs when the selected element is not replaced in
the population.
• Thus, we cannot select the same item more than once in this type of sampling.
Sample Survey
Exercise
Exercise
Q1
Q2
Q1 - P S P P S
Q2 - P S P S P
Basic Terms
Basic Terms
Exercise
Exercise
Types of Variables
Quantitative Variables
Discrete Variables
Continuous Variables
•Data may be obtained from internal sources, external sources, or surveys and experiments
• Sometimes needed data may not be available from either internal or external sources.
• In such cases, investigator may have to conduct a survey or experiment to obtain the
required data.
University
Internal Data/ Sources External Data/ Sources Gather Data
Exercise
Population Parameters and Sample Statistics
• A numerical measure such as the mean, median, mode, range, variance, or standard
deviation calculated for a population data set is called a population parameter, or simply a
parameter.
• A summary measure calculated for a sample data set is called a sample statistic, or simply
a statistic.
Ethical
Explore Data Check Errors Data Entry Sampling Method
consideration
Identify
Plot Data Test For Normality Statistical Inference
Distribution
Scientific Steps in Statistical Analysis Research Process
Step 1 Establish The Research Question & Statistical Goal
• Goal of statistical data analysis is to find answer to a research question we call hypothesis
• Our hypothesis which we want to find through study is called alternate hypothesis
• time of watching TV is related to marks obtained
• Contrast to this is null hypothesis which we want to disprove
• time of watching TV is not related to marks obtained
• Due to rejection of null hypothesis we assume that the alternate hypothesis is true
• Population is genus of entity which shares some essential common characteristics and features
• We must decide also on which form of data collection method will be used?
• Pros and cons of these methods should be pondered in relation to your research
question, budget and logistical abilities.
Scientific Steps in Statistical Analysis Research Process
Step 5 Data Category
• Data tag, theme or category are the variables for the test statistic.
• We need to collect sample statistic and put them in relevant and meaningful variables.
• We should ask questions such as “how does a group of people feel?”, “how does a
company recruit people?”, “How does people feel with this practice?”. We collect data
which cannot be quantified. We find concepts relevant to our research question.
Scientific Steps in Statistical Analysis Research Process
Step 6 Sampling Method
• Should decide what kind of sampling to use depending on collected data category
•We need to determine size of sample needed & how many samples should be
observed/collected.
Scientific Steps in Statistical Analysis Research Process
Step 7 Ethical consideration
Essential ethical consideration must be followed. Some important ones are highlighted:
• Invasion of privacy should be avoided.
• Causing participants to lose dignity must be avoided.
• Causing participants to think less of themselves must be avoided.
• Deception that causes resentment or hostility must be avoided.
• Unnecessary withholding of information must be avoided.
• Pain or discomfort must be avoided.
• Breaking local prohibitions (e.g. drinking alcohol, taking drugs, etc.) must be
avoided.
• Anything that may make participants feel uncomfortable must be avoided.
• Information about the research should be given as much as possible without
violating integrity and accuracy of the research.
Scientific Steps in Statistical Analysis Research Process
Step 8 Data Entry
• Meaningful variable names in relation to the research question, aims and objectives of
the study should be defined and named.
• Once we have established an organization of data tables we can start collecting data
considering
• Remember that statistical data in the real world are two types: conceptual data (gender,
class etc.) and numerical data (age, height etc.).
Scientific Steps in Statistical Analysis Research Process
Step 9 Check Errors
• We will recheck variable names, actual values, units and deal with not available values.
Scientific Steps in Statistical Analysis Research Process
Step 10 Explore Data
• Statistics measure certain attributes about data which you can explore.
• You can understand patterns or trends and make decision relevant to your research.
• Claim can be a sample statistic such as mean or a more general characteristic such as
a hypothesis about the population.
Non-probability Sample
● Convinience Sampling
● Quota Sampling
Data Analysis : Data Sampling
•Sampling helps a lot in research. It is one of the most important factors which determines
the accuracy of your research/survey result.
•If anything goes wrong with sample then it will be directly reflected in the final result.
Population is the collection of the elements which has some or the other characteristic in
common.
Number of elements in the population is the size of the population.
A population is the entire group that you want to draw conclusions about
There are lot of sampling techniques which are grouped into two categories as
•Probability Sampling
•Non- Probability Sampling
• Difference lies between two is whether sample selection is based on randomization or not.
• With randomization, every element gets equal chance to be picked up & be part of sample
Probability Sampling
This Sampling technique uses randomization to make sure that every element of the
population gets an equal chance to be part of the selected sample.
Every element has an equal chance of getting selected to be the part sample. It is used
when we don’t have any kind of prior information about the target population.
For example: Random selection of 20 students from class of 50 student. Each student has
equal chance of getting selected. Here probability of selection is 1/50
Data Analysis : Data Sampling - Probability Sampling
Stratified Sampling
•This technique divides elements of population into small subgroups (strata) based on
the similarity.
•And then the elements are randomly selected from each of these strata.
•We need to have prior information about the population to create subgroups.
Data Analysis : Data Sampling - Probability Sampling
Cluster Sampling
Our entire population is divided into clusters or sections and then the clusters are randomly
selected.
Clusters are identified using details such as age, sex, location etc.
Data Analysis : Data Sampling - Probability Sampling
Data Analysis : Data Sampling - Probability Sampling
Data Analysis : Data Sampling - Probability Sampling
Systematic Clustering
Here the selection of elements is systematic and not random except the first element.
Elements of a sample are chosen at regular intervals of population. All the elements are put
together in a sequence first where each element has the equal chance of being selected.
Data Analysis : Data Sampling - Non - Probability Sampling
Convenience Sampling
Purposive Sampling
Quota Sampling
Referral /Snowball Sampling
Data Analysis : Data Sampling - Non - Probability Sampling
Convenience Sampling
For example: Researchers prefer this during the initial stages of survey research, as
it’s quick and easy to deliver results.
Data Analysis : Data Sampling - Non - Probability Sampling
Purposive Sampling
• This is based on the intention or the purpose of study.
• Only those elements will be selected from the population which suits the best for the
purpose of our study.
For Example: If we want to understand the thought process of the people who are
interested in pursuing master’s degree then the selection criteria would be “Are you
interested for Masters in..?”
All the people who respond with a “No” will be excluded from our sample.
Data Analysis : Data Sampling - Non - Probability Sampling
Quota Sampling
• This type of sampling depends on some pre-set standard.
• It selects the representative sample from the population.
• Proportion of characteristics/ trait in sample should be same as population.
• Elements are selected until exact proportions of certain types of data is obtained or
sufficient data in different categories is collected.
For example: If our population has 45% females and 55% males then our sample should
reflect the same percentage of males and females.
Data Analysis : Data Sampling - Non - Probability Sampling
Referral /Snowball Sampling
•This technique is used in the situations where the population is completely unknown
and rare.
•Therefore we will take the help from the first element which we select for the
population and ask him to recommend other elements who will fit the description of
the sample needed.
•So this referral technique goes on, increasing the size of population like a snowball.
•For example: It’s used in situations of highly sensitive topics like HIV Aids where people
will not openly discuss and participate in surveys to share information about HIV Aids.
•Helps in situations where we do not have the access to sufficient people with the
characteristics we are seeking. It starts with finding people to study.
Types of Data / Variables
• Data that is expressed in numbers and summarized using statistics to give
meaningful information is referred to as quantitative data.
• Examples of quantitative data we could collect are
• Heights
• Weights,
• Age
• Such data is usually collected solely for the research problem to you will study.
• When you collect data after another researcher or agency that initially
gathered it makes it available, you are gathering secondary data.
Note that difference between Elementary and High School is different than the difference
between High School and College.
Because of that, ordinal scales are usually used to measure non-numeric features like
happiness, customer satisfaction and so on
Scales / Levels of Data
NUMERICAL / QUANTITATIVE DATA
DISCRETE DATA
•You can check by asking the following two questions whether you are dealing with discrete
data or not: Can you count it and can it be divided up into smaller and smaller parts?
Scales / Levels of Data
NUMERICAL / QUANTITATIVE DATA
CONTINUOUS DATA • Problem with interval values data is that they don’t
• Continuous Data representshave
measurements and therefore their values can’t be counted
a „true zero“.
but they can be measured.
• An example would be the height of a person,
•That means which
in regards toyou
our can describe
example, thatbythere
usingisintervals
no
on the real number line. such thing as no temperature.
INTERVAL SCALE DATA •With interval data, we can add and subtract, but we
• Interval values represent ordered units that have the same difference.
cannot multiply, divide or calculate ratios.
• Therefore we speak of interval data when we have a variable that contains
numeric values that are ordered and where we know the exact differences
•Because there is no true zero, a lot of descriptive and
between the values.
inferential statistics can’t be applied.
•An example would be a feature that contains temperature of a given place like
Scales / Levels of Data
NUMERICAL / QUANTITATIVE DATA
CONTINUOUS DATA
In Data Science, you can use one hot encoding, to transform nominal data
into a numeric feature.
Analysis
Analysis
Types of Data / Variables
Types of Data / Variables Revision
In statistical research, a variable is defined as an attribute of an object of study. Choosing
which variables to measure is central to good experimental design.
Example
• If you want to test whether some plant species are more salt-tolerant than others, some
key variables you might measure
• amount of salt added to water
• species of plants
• growth
• wilting.
• You need to know which types of variables you are working with in order to choose
appropriate statistical tests and interpret the results of your study.
Types of Data / Variables Revision
You can usually identify the type of variable by asking two questions:
What type of data does the variable contain?
What part of the experiment does the variable represent?
Types of Data / Variables Revision
Types of Data / Variables Revision
Types of Data / Variables Revision
Types of Data / Variables Revision
Types of Data / Variables Revision
Types of Data / Variables Revision
Types of Data / Variables Revision
● Sample Surveys
● Interviewing Respondents
◦ In-person Interviewing
◦ Telephone Interviewing
◦ Online Interviewing
◦ Mailed Questionnaire
◦ Focus Groups
◦ Observational Data Collection
◦ Experiments
Examining Relationships Between Variables
• It takes the form Ra.b.c = +/-x, read Multiple correlation of variables b & c with a
Regression Analysis
• The extent to which any model or equation, such as a regression line, summarizes or
“fits” the data is referred to as the goodness of fit.
Y = a + b X
Regression coefficient
Advanced Relationship Analysis
• There are more complex multivariate analytic procedures that assess relationships variables
• Canonical correlation analysis (Rc) is a form of regression analysis used to examine
the relationship between multiple independent and dependent variables.
• Path analysis examines hypothesized relationships among multiple variables (usually
independent, mediating, and dependent) for the purpose of helping to establish causal
connections and inferences by showing the “paths” the causal influences take.
• Discriminant analysis is a form of regression analysis that classifies, or discriminates,
individuals on the basis of their scores on two or more ratio/interval independent
variables into the categories of a nominal dependent variable.
•Factor analysis examines whether a large number of variables can be reduced to a
smaller number of factors (a set of variables).
•Cluster analysis explains whether multiple variables or elements are similar enough to
be placed together into meaningful groups or clusters that have not been predetermined
by the researcher.
Introduction to SPSS
https://www.youtube.com/watch?v=TZPyOJ8tFcI
End