0% found this document useful (0 votes)
17 views61 pages

BDPPM II - Lecture 6

The document discusses data management, emphasizing its importance for ensuring data quality, documentation, and efficient processing in research. It categorizes data into qualitative (nominal and ordinal) and quantitative (discrete and continuous) types, detailing their characteristics and examples. Additionally, it covers data analysis techniques and statistical software tools, highlighting the significance of planning analysis methods early in the research process.

Uploaded by

Muumini Masoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views61 pages

BDPPM II - Lecture 6

The document discusses data management, emphasizing its importance for ensuring data quality, documentation, and efficient processing in research. It categorizes data into qualitative (nominal and ordinal) and quantitative (discrete and continuous) types, detailing their characteristics and examples. Additionally, it covers data analysis techniques and statistical software tools, highlighting the significance of planning analysis methods early in the research process.

Uploaded by

Muumini Masoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

6.

USE STATISTICAL
SOFTWARE TO PREPARE
AND ANALYZE DATA
Lecture 6
6. Data management
6.1 Data management
What is data management?
 Is the process of designing data collection
instruments, looking after data sheets,
entering data into computer files,
checking for accuracy, maintaining records
of the processing steps, and archiving it
for future access.
 It also includes data ownership and
responsibility issues.
January 6, 2025 Lecture Notes 3
6.1.1 Importance of data management
 To insure data quality. Since conclusions are based on
data, accuracy is paramount and errors resulting from
wrong data entry, incorrect methods of conversion and
combining numbers must be avoided
 Documentation and archiving. Documenting or
describing data and archiving it are important so that
anybody can make sense out of the rows and columns of
numbers in your numerous data files. This is important
both for ongoing research and future use.
 Efficient data processing. Scientists spend a great deal
of time preparing data for analysis. This includes
converting data to suitable formats, merging data sitting in
different files, and summarising data from field
measurements.
January 6, 2025 Lecture Notes 4
6.2 Data in Research
 Research data is any information that has
been collected, observed, generated or created
to validate original research findings.
 Research data may be arranged or formatted in
such a way as to make it suitable for
communication, interpretation and processing.
 Today data is everywhere in every field.
Whether you are a data scientist, marketer,
businessman, data analyst, researcher, or you
are in any other profession, you need to play or
experiment with raw or structured data.

January 6, 2025 Lecture Notes 5


6.3 Types of data
6.3.1 Qualitative or Categorical
Data
 Qualitative or Categorical Data is a type of
data that can’t be measured or counted in the
form of numbers.
 These types of data are sorted by category, not
by number. That’s why it is also known as
Categorical Data. These data consist of audio,
images, symbols, or text. The gender of a person,
i.e., male, female, or others, is qualitative data.
 The Qualitative data are further classified into
two parts namely Nominal data and ordinal
data
6.3.1a: Nominal Data
 Nominal Data is used to label variables
without any order or quantitative value.
 The color of hair can be considered nominal
data, as one color can’t be compared with
another color.
 The name “nominal” comes from the Latin
name “nomen,” which means “name.”
 With the help of nominal data, we can’t do
any numerical tasks or can’t give any order
to sort the data.
6.3.1a: Example of nominal
Data
 Colour of hair (Blonde, red, Brown, Black,
etc.)
 Marital status (Single, Widowed, Married)
 Nationality (Indian, German, American)
 Gender (Male, Female)
 Eye Color (Black, Brown, etc.)
6.3.1b: Ordinal Data
 Ordinal data have natural ordering where a number is
present in some kind of order by their position on the
scale.
 These data are used for observation like customer
satisfaction, happiness, etc., but we can’t do any
arithmetical tasks on them.
 Ordinal data is qualitative data for which their values
have some kind of relative position.
 These kinds of data can be considered “in-between”
qualitative and quantitative data.
 The ordinal data only shows the sequences and cannot
use for statistical analysis. Compared to nominal data,
ordinal data have some kind of order that is not present
in nominal data.
6.3.1b: Example of ordinal data
 When companies ask for feedback, experience,
or satisfaction on a scale of 1 to 10
 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First,
Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)
Common qualitative research methods
a. ethnography/ participant
observation
 In many respects, they both refer to
similar approaches to data collection in
which the researcher is immersed in a
social setting for some time to observe
and listen with a view to gaining an
appreciation of the culture of a social
group
Common qualitative research methods…
b. Qualitative interviewing
 It refers to a wide range of interviewing
styles, as well as to the fact that qualitative
researchers employing ethnography or
participant observation typically engage in
qualitative interviewing
c. focus groups
 Interview using open questions to ask
interviewees questions about a specific situation
or event , but interviewees discuss the specific
issue in groups
Common qualitative research methods…
d. Language
 Based approaches to the collection of
qualitative data, such as discourse and
conversation analysis
e. The collection and qualitative analysis of
texts and documents
6.3.2: Quantitative Data
 Quantitative data is a type of data that
can be expressed in numerical values,
making it countable and including
statistical data analysis.
 These kinds of data are also known as
Numerical data.
 It answers the questions like “how
much,” “how many,” and “how often.”
 For example, the price of a phone, the
computer’s ram, the height or weight of a
person, etc., falls under quantitative data.
6.3.2: Quantitative Data…
 Quantitative data can be used for statistical
manipulation.
 These data can be represented on a wide variety
of graphs and charts, such as bar graphs,
histograms, scatter plots, boxplots, pie charts, line
graphs, etc.
 Examples of Quantitative Data :
 Height or weight of a person or object
 Room Temperature
 Scores and Marks (Ex: 59, 80, 60, etc.)
 Time
6.3.2: Quantitative Data…
 The Quantitative data are further
classified into two parts namely Discrete
data and continuous data
6.3.2a: Discrete Data…
 The term discrete means distinct or
separate.
 The discrete data contain the values that
fall under integers or whole numbers.
 The total number of students in a
class is an example of discrete data.
 These data can’t be broken into decimal
or fraction values.
6.3.2a: Discrete Data…
 The discrete data are countable and have
finite values; their subdivision is not possible.
 These data are represented mainly by a bar graph,
number line, or frequency table.
 Examples of Discrete Data :
 Total numbers of students present in a class
 Cost of a cell phone
 Numbers of employees in a company
 The total number of players who participated in a
competition
 Days in a week
6.3.2b: Continuous Data…
 Continuous data are in the form of
fractional numbers.
 It can be the version of an android phone,
the height of a person, the length of an
object, etc.
 Continuous data represents information
that can be divided into smaller levels.
 The continuous variable can take any value
within a range.
6.3.2b: Continuous Data…
 The key difference between discrete and
continuous data is that discrete data
contains the integer or whole number.
Still, continuous data stores the fractional
numbers to record different types of data
such as temperature, height, width, time,
speed, etc.
6.3.2b: Continuous Data…
 Examples of Continuous Data :
 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
 Market share price
Quantitative data analysis and
statistical software tool
 A common error that arises is that people first
collect their data and then they expect to decide
which method of analysis they are going to
employ. This is because quantitative data
analysis looks like a distinct phase that occurs
after data have been collected.
 Indeed, quantitative data analysis is something that
occurs typically at a stage in the overall process
and is also a distinct stage.
 Yet, that does not mean that you should not be
considering how you will analyse your data.
Quantitative data analysis and
statistical software tool
 Actually, you should be fully aware of what techniques
you will apply at a fairly early stage, when you are
designing a questionnaire, observation schedule,
coding frame, or whatever. This is because:
 Techniques have to be appropriately matched to the
types of variables you have created through your
research. This means that you must be fully
conversant with the ways in which different types of
variable are classified.
 The size and nature of your sample are likely to
impose limitations on the kinds of techniques you can
use
Quantitative data analysis and
statistical software tool
 So, you need to know that decisions that
you make at an early stage in the
research process (kind of data you
collect, size of your sample) will have
implications for the shorts of analysis you
will be able to conduct.
Quantitative data analysis and
statistical software tool
Analysis of Content
 Content analysis is an approach to the analysis
of documents and texts (which may be printed or
visual) that seeks to quantify content in terms of
predetermined categories and in a systematic and
replicable manner.
 It is a very flexible method that can be applied to
a variety of different media.
 In a sense it is not a research method in that
it is an approach to the analysis of documents
and texts rather than a means of generating data.
Quantitative data analysis and
statistical software tool
 Content analysis (Berelson,1952) is a
technique for the objective, systematic
and quantitative description of the
manifest content of communication.
 Content analysis can be presented in two
other approaches to the analysis of the
content of communication:
Quantitative data analysis and
statistical software tool
 Semiotics, that is the study/ science of signs and is an
approach to the analysis of documents and other
phenomena that emphasizes the importance of
seeking out the deeper meaning of those
phenomena. A semiotic approach is concerned to uncover the
process of meaning production and how signs are designed to have
an effect upon actual and prospective consumers of those signs
 Ethnographic content analysis, which refers to an
approach to documents that emphasizes the role of the
investigator in the construction of the meaning of
and in texts. Also, it is referred as qualitative content analysis,
because there is an emphasis on allowing categories to emerge out
of data and on recognizing the significance for understanding
meaning in the context in which an item being analyzed.
Quantitative data analysis and
statistical software tool
Dispositions:
 A further level of interpretation is likely to be entailed
when the researcher seeks to demonstrate a
disposition in the texts being analysed. A way in
which dispositions may be revealed in content analysis is
through the coding of ideologies, beliefs or
principles.
Coding
 Coding is a crucial stage in the process of doing a
content analysis.
 There are two elements to a content analysis coding
scheme; designing a coding schedule and designing
a coding manual.
Quantitative data analysis and
statistical software tool
 Coding schedule: The coding schedule is a form onto
which all the data relating to an item being coded
will be entered. The schedule is a simplification in order
to facilitate the discussion of the principles of coding in
content analysis and of the construction of a coding
schedule.
 Coding manual: The coding manual is a statement of
instructions to coders that also includes all the possible
categories for each dimension being coded. It provides a
list of all the dimensions, the numbers- codes that
correspond to each category, and guidance on what each
dimension is concerned with and any factors that should
be taken into account in deciding how to allocate any
particular code to each dimension.
Missing data
 A general problem that arises in quantitative data analysis
is how to handle “missing data”.
 Missing data may arise when respondents fail to reply
to a question, either by accident or because they do
not want to answer the question.
 When there are missing data, it is necessary to code
missing data with a number that could not also be a true
figure.
 For instance, it missing data is coded as a “9999” number,
then it is important to ensure that the computer software
is notified of this fact, since it needs to be taken into
account during the analysis.
Types of variables
 When you look at a questionnaire, the kinds of
information that you receive varies by question, so that
the answers differ between them.
 This leads to a classification of the different types of
variable that are generated in the course of research.
 There are four main types of variables:
i. interval/ ratio variables
 These are variables where the distances between the
categories are identical across the range of categories. For
example, when a parson spends 32min in one activity and
the other person spends 33min for the same activity, then
the distance between the categories is one minute.
 This is the highest level of measurement, and a very wide
range of techniques of analysis can be applied to
interval/ratio variables.
 Also, there is a distinction between interval and ratio
variables, in that ratio variables are interval variables with
a fixed zero point, which is not the case for social research
since most ration variable exhibit this quality.
ii. ordinal variables
 These are variables whose categories can be
rank ordered(as in interval/ ratio variables), but
the distances between the categories are not
equal across the range.
 Also, if you subsequently group an interval/ratio
variable which refers to people’s age, into
categories (e.g. 20 and under; 21-30; 30 and
over), you are transforming it into an ordinal
variable.
iii. Nominal variables
 These variables, also known as categorical
variables, comprise categories that cannot be rank
ordered.
iv. Dichotomous variables
 these variables contain data that have only two
categories (e.g. gender). Their position in relation
to the other type is slightly ambiguous, as they
have only one interval. They look as though they
are nominal variables, but because they have only
one interval they are sometimes treated as ordinal
variables.
KEY MODELS FOR DATA
ANALYSIS
Descriptive Analysis model
 Employ the use of statistical related
computation such as
 Frequency and percentage
 Mean
 Multiple response
 Likerts scale
Inferential Analysis model
 Employ the use of econometrics models in
analysis of the data such as
 Regression models (Linear, multiple linear
regression model) i.e the relationship
between one dependent variable and more
than independent variables)
 Bivariate or multivariate models. Means
more than one dependent variable in relation
to more than dependent variable, such
models include Pearson’s r, t-test, logit, probit
Inferential Statistics
Introduction & Examples
 While descriptive statistics summarize
the characteristics of a data
set, inferential statistics help you come
to conclusions and make predictions
based on your data.
 When you have collected data from
a sample, you can use inferential statistics
to understand the larger population from
which the sample is taken.
 Inferential statistics have two main uses:
i. making estimates about populations
(for example, the mean SAT score of all
11th graders in the US).
ii. testing hypotheses to draw
conclusions about populations (for
example, the relationship between SAT
scores and family income).
Descriptive versus inferential
statistics
 Descriptive statistics allow you
to describe a data set, while
 inferential statistics allow you to
make inferences based on a data set.
Descriptive statistics

 Using descriptive statistics, you can report


characteristics of your data:
 The distribution concerns the frequency
of each value.
 The central tendency concerns the
averages of the values.
 The variability concerns how spread out
the values are.
Descriptive statistics

 In descriptive statistics, there is no


uncertainty – the statistics precisely
describe the data that you collected. If
you collect data from an entire
population, you can directly compare
these descriptive statistics to those from
other populations.
Descriptive statistics

 Example: Descriptive statistics


 You collect data on the SAT scores of all
11th graders in a school for three years.
 You can use descriptive statistics to get a
quick overview of the school’s scores in
those years. You can then directly
compare the mean SAT score with
the mean scores of other schools.
Inferential statistics
 Most of the time, you can only acquire data from
samples, because it is too difficult or expensive to
collect data from the whole population that you’re
interested in.
 While descriptive statistics can only summarize a
sample’s characteristics, inferential statistics use
your sample to make reasonable guesses about the
larger population.
 With inferential statistics, it’s important to use
random and unbiased sampling methods. If your
sample isn’t representative of your population, then
you can’t make valid statistical inferences
or generalize.
Inferential statistics
Example: Inferential statistics
 You randomly select a sample of 11th
graders in your state and collect data on
their SAT scores and other
characteristics.
 You can use inferential statistics to make
estimates and test hypotheses about the
whole population of 11th graders in the
state based on your sample data.
Sampling error in inferential
statistics
 Since the size of a sample is always smaller than
the size of the population, some of the population
isn’t captured by sample data.
 This creates sampling error, which is the
difference between the true population values
(called parameters) and the measured sample
values (called statistics).
 Sampling error arises any time you use a sample,
even if your sample is random and unbiased. For
this reason, there is always some uncertainty in
inferential statistics. However, using probability
sampling methods reduces this uncertainty.
Estimating population
parameters from sample statistics
 The characteristics of samples and populations are
described by numbers called statistics and
parameters:
 A statistic is a measure that describes the sample
(e.g., sample mean).
 A parameter is a measure that describes the
whole population (e.g., population mean).
 Sampling error is the difference between a parameter and
a corresponding statistic. Since in most cases you don’t
know the real population parameter, you can use inferential
statistics to estimate these parameters in a way that takes
sampling error into account.
 There are two important types of estimates you
can make about the population: point estimates
and interval estimates.
 A point estimate is a single value estimate of a
parameter. For instance, a sample mean is a point
estimate of a population mean.
 An interval estimate gives you a range of values
where the parameter is expected to lie.
A confidence interval is the most common type
of interval estimate.
 Both types of estimates are important for
gathering a clear idea of where a parameter is
likely to lie.
Confidence intervals
 A confidence interval uses the variability
around a statistic to come up with an interval
estimate for a parameter. Confidence intervals are
useful for estimating parameters because they take
sampling error into account.
 While a point estimate gives you a precise value
for the parameter you are interested in, a
confidence interval tells you the uncertainty of the
point estimate. They are best used in combination
with each other.
Confidence intervals
 Each confidence interval is associated
with a confidence level. A confidence level
tells you the probability (in percentage) of
the interval containing the parameter
estimate if you repeat the study again.
 A 95% confidence interval means that if
you repeat your study with a new sample
in exactly the same way 100 times, you
can expect your estimate to lie within the
specified range of values 95 times.
Confidence intervals
 Although you can say that your estimate will lie
within the interval a certain percentage of the
time, you cannot say for sure that the actual
population parameter will. That’s because you
can’t know the true value of the population
parameter without collecting data from the full
population.
 However, with random sampling and a suitable
sample size, you can reasonably expect your
confidence interval to contain the parameter a
certain percentage of the time.
Hypothesis testing
 Hypothesis testing is a formal process of
statistical analysis using inferential statistics. The
goal of hypothesis testing is to compare
populations or assess relationships between
variables using samples.
 Hypotheses, or predictions, are tested
using statistical tests. Statistical tests also estimate
sampling errors so that valid inferences can be
made.
 Statistical tests can be parametric or non-
parametric. Parametric tests are considered more
statistically powerful because they are more likely
to detect an effect if one exists.
Hypothesis testing
 Parametric tests make assumptions that
include the following:
 the population that the sample comes
from follows a normal distribution of
scores
 the sample size is large enough to
represent the population
 the variances, a measure of variability, of
each group being compared are similar
Hypothesis testing
 When your data violates any of these
assumptions, non-parametric tests are more
suitable. Non-parametric tests are called
“distribution-free tests” because they
don’t assume anything about the
distribution of the population data.
 Statistical tests come in three forms:
i. tests of comparison
ii. correlation
iii. regression.
Comparison tests
 Comparison tests assess whether there are
differences in means, medians or rankings of
scores of two or more groups.
 To decide which test suits your aim, consider
whether your data meets the conditions
necessary for parametric tests, the number of
samples, and the levels of measurement of your
variables.
 Means can only be found for interval or ratio
data, while medians and rankings are more
appropriate measures for ordinal data.
Comparison tests
Comparison Parametric? What’s being Samples
test compared?
t test Yes Means 2 samples
ANOVA Yes Means 3+ samples
Mood’s median No Medians 2+ samples
Correlation tests
 Correlation tests determine the extent
to which two variables are associated.
 Although Pearson’s r is the most statistically
powerful test, Spearman’s r is appropriate
for interval and ratio variables when the
data doesn’t follow a normal distribution.
 The chi square test of independence is the
only test that can be used
with nominal variables.
Correlation tests

Correlation test Parametric? Variables


Pearson’s r Yes Interval/ratio variables
Spearman’s r No Ordinal/interval/ratio
variables
Chi square test of No Nominal/ordinal
independence variables
Regression tests
 Regression tests demonstrate whether changes
in predictor variables cause changes in an
outcome variable. You can decide which regression
test to use based on the number and types of
variables you have as predictors and outcomes.
 Most of the commonly used regression tests are
parametric. If your data is not normally distributed,
you can perform data transformations.
 Data transformations help you make your data
normally distributed using mathematical
operations, like taking the square root of each
value.
Regression tests

Regression test Predictor Outcome


Simple linear regression 1 interval/ratio variable 1 interval/ratio variable
Multiple linear 2+ interval/ratio 1 interval/ratio variable
regression variable(s)
Logistic regression 1+ any variable(s) 1 binary variable
Nominal regression 1+ any variable(s) 1 nominal variable
Ordinal regression 1+ any variable(s) 1 ordinal variable

You might also like