0% found this document useful (0 votes)
4 views55 pages

Runit 1

R1

Uploaded by

lehowi1258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views55 pages

Runit 1

R1

Uploaded by

lehowi1258
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

University

OMBA-235
LOGO

R PROGRAMMING
FOR BUSINESS
PROGRAMME DESIGN COMMITTEE MBA (CBCS)

COURSE DESIGN COMMITTEE

COURSE PREPARATION TEAM

PRINT PRODUCTION
Copyright Reserved 2022
All Rights Reserved. No Part Of This Publication Which Is Material Protected By This
Copy Right Notice May Be Reproduced Or Transmitted Or Utilized Or Stored In Any
Form Or By Any Means, Now Known Or Hereinafter Invented Electronic Digital Or
Mechanical Including Photocopying Scanning, Recording Or By Any Information
Storage Or Retrieval System Without Prior Permission From The Publisher.
Information Contained In This Book Has Been Obtained By Its Authors From Sources
Believed To Be Reliable And Are Correct To The Best Of Their Knowledge. However,
The Publishers And Its Authors Shall In No Event Be Liable For Any Errors Omissions
Or Damages Arising Out Of Use Of This Information And Specifically Disclaim Any
Implied Warranties Or Merchantability Or Fitness For Any Particular Use.

2
R PROGRAMMING FOR BUSINESS
Unit - 1 Basic Statistics With R .............................................................. 4
Unit - 2 Classification and Tabulation of Data ...................................... 56
Unit- 3 Descriptive Statistics .............................................................. 106
Unit - 4 Moments ............................................................................... 145
Unit - 5 Measures of Skewness........................................................... 167

3
R Programming for Business

UNIT - 1 BASIC STATISTICS WITH R

STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Significance of Statistics
1.3 Primary and Secondary Data
1.3.1 Primary Data
1.3.2 Secondary Data
1.4 Data Collection Methods
1.4.1 Surveys
1.4.2 Interviews
1.4.3 Observations
1.4.4 Experiments
1.5 Presentation of Numerical and Categorical Data
1.5.1 Numerical Data
1.5.2 Categorical Data
1.6 Let Us Sum Up
1.7 Keywords
1.8 Some Useful Books
1.9 Answers to Check You Progress
1.10 Terminal Questions

1.0 OBJECTIVES

After studying this unit, you will be able to:

 Recognize the pivotal role of statistics in providing a systematic


approach to collecting, analysing, and interpreting data for
informed decision-making.
 Understand the distinction between primary data, collected directly
from original sources, and secondary data, obtained from existing
sources like published reports and databases.

4
R Programming for Business

 Explore the concept of surveys as a method of gathering


information from a sample through standardized questionnaires
or interviews.
 Grasp the nuances of interviews, involving direct conversations
between researchers and respondents to obtain detailed
information on a particular topic.
 Interpret the presentation of numerical data, including the use of
measures such as mean, median, and standard deviation.
 Understand the presentation of categorical data through methods
like bar charts and frequency tables.
 Apply statistical methods to make predictions and inferences based
on sample data, enhancing your ability to draw meaningful insights
from complex datasets.

1.1 INTRODUCTION

Statistics plays a crucial role in various fields, helping us make sense of


complex information and draw meaningful conclusions. In essence, it
involves the collection, analysis, interpretation, presentation, and
organization of data. By employing statistical methods, we can gain
insights into patterns, trends, and relationships within datasets. This is
instrumental in decision-making processes across diverse domains, such
as economics, sociology, medicine, and more.
Data, the raw information used in statistical analysis, can be categorized
into primary and secondary types. Primary data is directly collected from
original sources, providing firsthand information. Secondary data, on the
other hand, is derived from existing sources, like books, articles, or
databases. Understanding these distinctions is crucial as it influences the
reliability and applicability of the data used in statistical studies. Primary
data is unique and tailored to the specific research at hand. It is obtained
through methods like surveys, interviews, or experiments, allowing
researchers to gather information directly from the source. This firsthand
data is often more accurate and relevant to the study being conducted.

5
R Programming for Business
Secondary data, although not directly collected for the current research,
serves as a valuable resource. It is obtained from already published or
recorded sources, making it convenient and cost-effective. However,
researchers must critically assess its relevance and reliability to ensure
accurate and meaningful results.
Various methods are employed to collect data, each suited to different
types of studies. These include surveys, interviews, observations, and
experiments. Surveys involve systematically collecting information from
a sample of individuals. This method is effective for gathering opinions,
preferences, and quantitative data on a large scale. Surveys often employ
questionnaires or interviews to elicit responses. Interviews provide a more
in-depth understanding of individual perspectives. Researchers engage
directly with participants, asking open-ended questions to explore nuances
and gather detailed qualitative data. Observational studies involve
systematically watching and recording events or behaviors. This method
is particularly useful when studying natural settings and behaviors without
interference. Experiments are designed to establish cause-and-effect
relationships. Researchers manipulate variables and observe the resulting
outcomes to draw conclusions about the factors influencing a particular
phenomenon.
Once data is collected, it needs to be presented in a meaningful manner.
Numerical and categorical data are two fundamental types of data
presentation. Numerical data involves measurable quantities and is often
presented using statistical measures such as mean, median, and standard
deviation. Graphs, charts, and tables are also commonly used to visually
represent numerical data. Categorical data consists of distinct categories
or groups. Bar graphs, pie charts, and frequency tables are commonly used
to represent categorical data, providing a clear visual depiction of
distribution and proportions within different categories.

1.2 SIGNIFICANCE OF STATISTICS

Statistics plays a vital role in data analysis by providing methods and tools
to describe, organize, interpret, and make inferences from data.

6
R Programming for Business

At its core, statistics is concerned with collecting, summarizing,


analyzing, and making conclusions from data in order to understand the
phenomena the data represent. statistics allows researchers to analyze
data sets, test hypotheses, estimate parameters, and identify trends and
relationships. Common statistical methods include descriptive statistics
such as measures of central tendency (means, medians, modes) and
measures of variability (variance, standard deviation), visualization
techniques such as graphs and charts, probability distributions, statistical
inference methods such as confidence intervals and hypothesis tests,
correlation analysis, regression analysis, analysis of variance, and many
more advanced methods.
By applying appropriate statistical techniques, researchers can extract
meaningful information from raw data. Statistics provides standardized,
quantitative ways to organize, describe, and compare data sets. It also
enables researchers to assess the strength of apparent relationships and to
quantify the confidence in their conclusions. Interpreting data through a
statistical lens allows for more objective, precise, and rigorous analysis
than simply inspecting data visually.
Basics of Statistics:
Descriptive Statistics: Descriptive statistics provide simple summaries
about the sample and the measures. They form the basis of virtually
every quantitative analysis of data. In R, a variety of built-in functions
and visualization methods exist for descriptive statistics and exploratory
data analysis.
A vital early step when receiving a new dataset is to explore it
graphically and numerically to discern overall tendencies, variability, and
the scales involved before proceeding to formal modelling or hypothesis
testing. In R, functions like str, summary, dim, head, plot quickly provide
initial impressions of the number of observations, variables, data types,
completeness etc.
Numerical descriptive statistics summarize precise information from the
sample, including central tendency (mean, median, mode),
dispersion/variability (range, interquartile range, variance, standard
deviation), and shape of the distribution (skewness, kurtosis).

7
R Programming for Business
R has inbuilt functions for each - mean, median, sd, var, IQR etc.
The summarize function also calculates these together. Graphical
approaches like histogram, boxplot, scatterplot visualize the
distributional shape and outliers.
Quantiles like the 0.25, 0.5 and 0.75 quantiles aid comparison of centrality.
Tapley applies a function over subsets of data conditions on factors.
Aggregate can summarize by groups. Table produces summary stats tables
for reporting. The pastes package has more descriptive functions. dplyr
and data. Table provide faster, cleaner data manipulation.
Inferential Statistics: Inferential statistics consists of methods that allow
researchers to draw conclusions about a population from a sample.
It leverages probability theory and distributions to make estimates, test
hypotheses, and identify statistical relationships. R provides an extensive
range of built-in facilities and packages for inferential analysis.
Unlike descriptive statistics that summarize data directly, inferential
statistics employs mathematical models to infer population parameters
from samples and quantify the certainty or likelihood in those inferences.
This requires basic concepts from probability theory - random variables,
probability distributions, central limit theorem, standard errors that
estimate sampling distribution reliability.
In R, probability distributions can be simulated to build intuition and
inferences - rnorm, rpois, rbeta etc. visualize distributional forms.
Statistical inference procedures test hypotheses and derive estimates for
unknown population quantities given sample statistics. Confidence
intervals provide ranges for unknown parameters. Hypothesis testing
checks claims about population means, proportions, ANOVA effects etc.
R's formulas connect the probability and data theory to enable valid
statistical inferences from data samples. test and chestiest check
difference in means. prop. Test compares proportions. lm builds linear
regression models. Sample statistics plug into these formulas to produce
p-values testing null hypotheses and confidence intervals estimating
effect sizes. Users need not manually calculate sampling distributions.

8
R Programming for Business

Importance of Statistics in R:

 Data Exploration and Visualization: Data visualization and


exploration comprise indispensable tools for obtaining insights
and communicating statistical findings from quantitative data
analysis. R furnishes a wide selection of versatile data
visualization, exploration, and graphical presentation methods
that integrate tightly with its statistical capabilities. R provides
base graphics like the traditional plots - scatterplots, dot charts,
bar charts, piecharts and histograms. Pairs generates a matrix of
scatterplots for visual correlation detection. The lattice package
offers multi-panel conditioning plots like xyplot and bwplot to
visualize how a response varies by levels of one or more factors.
ggplot2 implements the layered grammar of graphics approach,
allowing immense flexibility and customization for publication-
grade graphics through its qplot and ggplot functions and +
syntax. Other tidyverse packages like tidyr and dplyr feed cleanly
into such visualization. For data exploration, R delivers tools like
summary, str, head, plot for rapid initial investigations.
Table presents cross-tabulations. cor and cov estimate correlation
and covariance associations. Data can be subsetted flexibly for
groupwise comparisons. Apply family functions iterate analyses
over data margins. Histogram and density plots visually examine
distributions.
 Hypothesis Testing: Hypothesis testing forms a central
application of statistical inference to quantify the evidence in data
relative to some claim about a phenomenon. R contains a wide
array of functions to facilitate hypothesis testing for data-driven
research. The hypothesis testing framework conventionally
involves formulating a null hypothesis and an alternative
hypothesis. The null typically corresponds to a "no difference" or
"no effect" baseline claim, while the alternative represents the
actual research question probing a population mean, proportion,
correlation etc. Appropriate data is collected relevant to these

9
R Programming for Business
hypotheses and a test statistic is chosen that measures
compatibility between the sample and null claim. R has a suite of
hypothesis tests built-in and via packages - t.test(), wilcox.test(),
and prop.test() for comparing means and proportions
respectively. chisq.test() handles count data, cor.test()
correlations. These tests output the sample statistics, test statistic
value, degrees of freedom, the p-value measuring probability of
observed (or more extreme) data under the null, and confidence
interval for the effect's magnitude. If the p-value falls below the
chosen significance level (often 0.05), the null hypothesis is
rejected - indicating insufficient compatibility between the
sample data distribution and the null claim. Then we conclude the
alternative is statistically supported. Failing to reject indicates
inadequate evidence against the null. Hypothesis testing
formalizes making data-based statistical inferences about effect
presence and generalizability.
 Regression Analysis: Regression analysis refers to a family of
statistical techniques investigating the relationships between a
dependent outcome variable and one or more independent
explanatory variables. Amongst the most widely-used statistical
methods, regression facilitates modelling and predicting
continuous, discrete, and categorical outcome variables from a set
of predictors. R contains versatile in-built regression modelling
functions. The basic linear regression model estimates the linear
relationship between a quantitative response variable like income,
height etc. and quantitative predictor variables like age, weight
through the model - outcome = b0 + b1x1 + b2x2 + ... This
estimates the intercept (b0) and slope coefficients (b1, b2)
mapping predictors to outcome. R's lm function fits this model,
estimating coefficients and quantifying uncertainty.
Abline visualizes the fit. For binary categorical outcomes taking
0/1 values, logistic regression models the probability or odds of
"success" as explained by predictors through a logit
transformation handling the range constraint. R's glm allows this

10
R Programming for Business

logistic and other generalized linear models like Poisson or


Gamma regression suited to count/percentage and strictly positive
continuous outcomes respectively. Beyond these basic
regressions, R delivers support for a variety of advanced
modeling - multivariate responses, non-linear trends,
regularization for high-dimensional data, interactions, tree-based
ensemble methods like randomForest, and generalized additive
models via the mgcv package. Regression in R thus provides a
versatile modelling framework for explaining variation.

Real-world Applications:
Business and Finance: Statistics plays an integral role in financial
modelling, analysis, and informed decision-making. R offers a rich set of
tools and packages tailored to business and finance applications -
portfolio optimization, risk modelling, time series forecasting,
algorithmic trading, insurance analytics, and more.
Descriptive statistics in R like mean, sd, quantile, run basic summary
profiles on financial metrics like historical returns, price movements,
trading indicators etc. Correlation and regression analysis quantify
relationships between assets, indicators, and macro factors. Plot
visualizes trends over time. ggplot2 enables publishing-grade graphics of
market dynamics.
Specialized R packages implement financial statistics and modelling
techniques. Performance Analytics has portfolio risk & return analysis.
Time Series supports autocorrelation, stationarity, ARIMA models for
temporal data. quant mod downloads market data and estimates portfolio
metrics. rug arch builds GARCH models analyzing volatility clustering.
tidy quant manipulates financial data. caret trains machine learning models
for prediction.
Beyond statistical inference, R has computational finance tools for
trading strategy development, back testing, and algorithmic trading -
TTR for technical analysis, quant Strat for strategy back testing. The
Metrics suite includes the portfolio package for portfolio optimization
based on Markovitz allocation and risk budgeting.

11
R Programming for Business
Healthcare: Statistics is integral to evidence-based medicine and
healthcare research. From descriptive summaries of symptoms to
complex multivariate models, R furnishes state-of-the-art data analysis
tools tailored for medical sciences.
A first step in analyzing healthcare data is understanding distributions -
patient demographics, disease incidence, lab measurements etc.
R produces descriptive statistics through summary, quantile and
visualizes data by hist, scatterplot, bar plot and more. This profiles the
population distribution.
Statistical modelling quantifies relationships in health data. Logistic
regression predicts clinical binary outcomes from risk factors and
symptoms. Survival analysis in R handles time-to-event data through
packages like survival, rms, and come. lme4 performs mixed effects
regression incorporating random effects along with fixed predictors.
Critically, much medical statistics concerns drawing inferences about
populations from samples. Hypothesis testing frameworks in R like test,
Wilcox. Test, prop. Test, and poetettes formally assess evidence for
effects in the data. Meta-analysis combines results across studies. Sample
size computations assist study planning.
Besides analysis, R has tools to simulate patient populations and
interventions. Probability distributions and random number generation
provide flexibility for modelling. The D2Hawkeye package models
entire healthcare systems. R thus extends from statistics to a decision-
modelling environment.
Social Sciences: Statistics plays a foundational role across the quantitative
social sciences - psychology, sociology, political science, communications
etc. R furnishes these fields state-of-the-art capabilities for data analysis
and modelling tailored to social research contexts.
A common application is questionnaire and survey data summarization.
R produces descriptive statistics on response patterns and trends via
summary, table, core and powerful visualizations through ggplot2, lattice
and base graphics.
This supports understanding data distributions and relationships.

12
R Programming for Business

Inferential statistics test formal hypotheses about psychological constructs


or societal phenomena. R has a variety of statistical tests like test, Wilcox.
Test, chi-square, ANOVA, correlation and regression models.
These assess group differences, identify predictive relationships, and aid
theory testing.
Specialized models provide further capabilities. Psychometrics quantify
latent attributes from observed indicators through Factor Analysis and
IRT. Time series analysis models social trends. Survival analysis handles
event occurrence over time. SEM packages fit structural equation models
with latent factors. Text analysis facilitates qualitative data.
Beyond statistics, R has computational modelling capabilities relevant
for social domains - agent-based models simulate social systems from the
bottom-up. Cognitive architectures model the mind. Reinforcement
learning solves goal-based tasks.
Check Your Progress-1
1. How does statistics contribute to the reliability of research findings?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. What role do measures of central tendency play in data analysis?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…………………………………………………………………………….

1.3 PRIMARY AND SECONDARY DATA

Primary data refers to data that is collected first-hand by the researcher(s)


specifically for the analysis objectives at hand. In contrast, secondary data
signifies data that already exists, collected by other studies or surveillance
programs and accessible from repositories, surveys, reports etc.
while still retaining relevance to the current line of inquiry.

13
R Programming for Business
Primary data affords maximum control, flexibility and currency to
researchers in assessing their research questions. This can leverage
approaches like field surveys, experiments, interviews, focus group
studies etc. tailored for the precise phenomena of interest. However,
designing, sampling, executing, recording and processing primary data
from scratch demands considerable resource commitments of time,
access, human power and funding.
In comparison, secondary pre-existing data sources represent a valuable
opportunity to conduct rigorous studies feasibly by analyzing patterns
across prior large-scale or wide-ranging data assets instead of pursuing
elaborate primary collections. For instance, census datasets, electoral
results, disease registries, social media trends, student records , open
access repositories etc. offer treasure troves of real-world data facets
amenable to scientific investigation. This can answer novel questions
economically by applying thoughtful analysis methods without further
data engineering.
However, secondary data analysis presents limitations as well. The
available variables and cohorts may not fully capture the desired target
behaviors. Metadata, provenance and data collection methods need
careful review to assess fit and analysis suitability relative to the
contextual research aims. Appropriate interpretation warrants factoring
data generating processes. Overall, a combination of primary and
secondary analyses often yields optimal and multifaceted insights.

Fig :1.1 Types of Data

14
R Programming for Business

1.3.1 Primary Data


Primary data refers to original data collected to directly address the
research problem at hand. In contrast to secondary data derived from
records for some prior purpose, primary data stands closest to the
phenomenon of research interest with the most up-to-date and targeted
relevance. It forms a crucial pillar of discovery and decision processes
across sectors.
By designing data collection protocols customized to research questions,
primary data promises high concept validity and analytical alignment. It
captures precise behaviors, attitudes, exposures etc. that the study aims to
measure without relying on approximations from external indexes. Data
integrity relies less on accuracies of prior coding systems. Fine-tuned
survey questionnaires, sensitive instruments, experimental interventions
etc. can target precise constructs.
However, primary data collection requires extensive conceptualization of
measurement operations and sampling schemes. Establishing reliability
and representativeness involves deliberate effort. Legal and ethical
diligences govern human subject protocols. Quality control safeguards
against bias. For instance, randomized controlled trials establish rigorous
causal claims but demand extensive investments.
In areas with sparse secondary sources, primary data provides the only
window into current realities. Even where prior data exists, primary
evidence will typically hold higher credibility for decision processes when
recent or emerging phenomena require investigation. It directly showcases
target populations and metrics. By bridging gaps in existing knowledge,
high-quality primary data fuels discovery and actionable insights.
Data Collection and Technologies associated with Primary Data:
Technology in Data Collection: Technological advances have expanded
and enhanced primary data collection capacities through tools like mobile
apps, web platforms, sensors, wearables, and IoT facilitating larger-scale,
higher-resolution, automated, and remote measurements.
However adoption choices warrant careful validation.
For instance, computer-assisted techniques like drop-down responses,
chatbots etc. ease survey administration and synthesis - enabling versatile

15
R Programming for Business
question formats while automatic logging maintains data hygiene.
Streaming digital trace data opens new behavioral vistas. GPS, image
recognition and in-situ sensors enable direct environmental monitoring.
Telemedicine consults and wearables gather patient physiological data
unseen previously.
Such technical measurement channels promise multifaceted fine-grained
coverage of target phenomena once unfeasible or relying on coarse
secondary proxies earlier. Automated collection minimizes reporting
biases while boosting scale and compliance. Modern systems also embed
validation checks and audit trails improving transparency.
Advanced analytics draw holistic insights intersecting diverse data types.
However, digital tools carry challenges as well regarding usability,
access, and representation biases which must be addressed upfront, not as
an afterthought. Technology adoption metrics should ensure target
populations are not excluded disproportionately. Sensor monitoring
requires knowing meaningful thresholds and baselines beforehand
through primary calibration measurements. Logging complete contextual
metadata remains key for responsible interpretation.
Online Surveys and Data Collection Platforms: Online survey tools and
digital data collection platforms have streamlined primary data gathering
through convenient template interfaces, cloud storage, automated analysis,
and participant access at scale. However, researchers must address
challenges of data quality and representation biases before generalizing
insights derived.
Services like Qualtrics, SurveyMonkey, Google Forms, Amazon
Mechanical Turk etc. facilitate question creation, distribution, response
logging and analytics dashboards to summarize results. Embedded
quality checks like enforced validation, multiple choice formats,
randomization modules etc. aid reliability. Remote asynchronous
collection grants convenient access for subjects while enabling large,
diverse samples unconstrained by geography. Automatic recording
minimizes human error during data entry and collation. Export options
allow use with external tools for further modelling and inference tasks.

16
R Programming for Business

However, online interface and literacy barriers may skew participation


demographically across age, socioeconomic status and cultural gaps
limiting insights to narrow subgroups. Spam responses need filtering.
Careful survey construction is essential to ensure clarity. Surveys should
combine closed and open-ended responses to capture richer context. As
prevalent platforms standardize, over-reliance can constrain creativity in
questionnaire formats. Legal and privacy protections remain paramount
handling subject data.
Data Validation and Quality Assurance of Primary Data:
Ensuring Data Accuracy: Primary data fundamentally aims for high
fidelity capturing of target phenomena. Deliberate protocols during data
generation and post-hoc examinations ensure quality and credibility of
recorded evidence. These confirmation procedures uphold analytical
utility and explanatory power for decision-making.
Careful survey design consider respondent understanding, recall and
reporting feasibility. Detailed vignettes offer tangible context aiding
accurate responses over vague generalities. Randomization of options
limits order effects. Prefixing length, date or frequency ranges assists
memory and estimation before seeking precise quantities. Cross-validation
questions assess consistency. Real-time data snapshots via mobile apps
reduce reliance on human precision.
Reviewing distributions, summary statistics, metadata notes etc. post-
collection highlights peculiar values demanding closer inspection.
Outlier detection methods formally flag anomalies. External data lookups
verify selected cases. Testing and refinement of scoring algorithms and
categorical coding establishes stable interpretations. Analyst feedback
informs improvements for subsequent iterations.
Throughout, transparency in documentation facilitates reproducibility -
recording time stamps, data dictionary definitions, collection conditions
etc. alongside the readings themselves. Detailed logs aid auditing and
usability by secondary consumers.
Quality Assurance : Quality assurance for primary data collection
involves employing methods and safeguards throughout the data
gathering and processing pipeline to maximize fidelity to the phenomena

17
R Programming for Business
being captured and ensure resulting datasets can support credible
analysis and insights. This necessitates deliberate protocols and
validation checks before, periodically during, and post data compilation
to uphold analytical utility.
In instrument-based measures like surveys and sensor readings,
standardized calibrations assess accuracy against authoritative
references, specifying minimum precision and error thresholds
acceptable for use cases. Certifying respondent understanding via test
questions and previews limits noise from unclear items. Authentication
of identities and credentials fights fraudulent entries. Random sampling
orders combat biasing. Timestamping enables consistency audits by
factors like location, fatigue etc.
For human-generated data like experiments, treatment integrity demands
adherence confirmation to designated protocols. Reliability metrics
quantify marker proficiency and gaps prompting retraining. Replication
measurements affirm reproducibility across comparable subgroups.
Blinding investigators to intervention conditions limits perception biases.
Negative and positive controls check for false signals. Randomization
enables unbiased group assignments.
Post gathering, analytics examine completeness of expected records, data
distributions, outlier detection etc. Assessing subgroup variability gauges
distortions. Metadata captures provenance descriptions, question formats,
scoring logics etc aiding reuse. Formal data validation rulesets fight bad
input. Versioning enables flight records rollback.
1.3.2 Secondary Data
Secondary data refers to data originally collected and compiled for
purposes other than the current research questions at hand. In contrast
with primary data gathered first-hand tailoring measurement directly to
analytical needs, secondary sources represent pre-existing records
amenable to retrospective analysis towards new ends. By tapping prior
efforts, secondary data unlocks immense analytical potential efficiently.
As data gathering, storage and dissemination technologies progress,
accumulations of past endeavors to track societies, markets, ecosystems
etc. create vast repositories holding clues for discovery in historical

18
R Programming for Business

patterns. Sources like government records, publications, transaction logs,


surveillance initiatives, commercial analytics, open data policies etc.
harbour treasures of indicators spanning populations and contexts
potentially transferrable beyond original motivations.
Mining such resources can power investigations through readily available
measurements instead of embarking on elaborate primary data collection.
Inherently observational rather than interventional though, the findings
warrant careful interpretation. Analytical techniques factor in data
provenance, contextual shifts over time and representation issues when
relating insights to present-day decisions. Still secondary analyses
efficiently unlock immense knowledge otherwise dormant.
In domains where original data compilation is infeasible presently due to
access, cost, or ethics barriers, preserved historical evidence allows seed
analyses to guide future primary endeavours. More broadly, blending
primary purpose-built data with exploratory secondary investigations on
adjacencies provides multifaceted perspectives furthering discovery.
Sources of Secondary Data:
Published Sources: Published sources such as books, academic journals,
and reports can provide a wealth of valuable information across
disciplines. However, the credibility and reliability of these sources can
vary greatly depending on factors such as the reputation of the publisher,
the expertise of the author(s), the rigor of the editorial or peer review
process, and the citing of other credible sources.
When evaluating the credibility of a published book, aspects such as the
author's credentials and area of expertise, whether the information is well-
documented and supported by external sources, and the reputation of the
publisher can signal the expected accuracy and trustworthiness of the
content. Academic journals, especially prominent peer-reviewed
publications, are generally considered highly credible as their content
undergoes rigorous critique and examination by experts prior to
acceptance for publication. However, variance can still occur depending
on the strength of the peer review processes across different journals.
Reports published by government agencies, respected research
organizations, and corporations may contain very reliable data depending

19
R Programming for Business
on the reputation of the institution and their adherence to standards
regarding research methodology and transparency. However, potential
conflicts of interest or biases should also be considered.
Government and Institutional Data: Government agencies such as
federal and state departments alongside respected institutions like
universities and research organizations generate and aggregate immense
amounts of secondary data across disciplines. Tapping into the public
databases and data repositories made available by official sources
provides researchers with access to high-quality, credible information
that typically adheres to rigorous standards and processes in its collection
and reporting.
For example, census data published by national statistics agencies and
epidemiological reports from health departments contain validated
quantitative and qualitative insights about populations based on
methodical analysis by subject experts. These databases can yield
representative, unbiased snapshots of various societal aspects.
Researchers can avoid the difficulty of recruiting sample groups and
conducting lengthy surveys by simply accessing related public data
resources.
Likewise, universities and associations publish studies in their own
repositories that follow field-specific protocols around sampling
approaches, measurement tools, analytical assumptions, error tolerances,
and peer-review oversight before public release. The methodical nature
of the data collection and reporting process followed by institutional
research bodies promotes a high degree of accuracy and trust in the data
integrity.
Digital and Online Sources: The advent of digital platforms and the
internet has exponentially increased the availability of online data that
researchers can potentially draw insights from. Websites, forums, social
media channels, e-commerce portals, and digital databases now produce
vast amounts of user-generated data daily alongside datasets created
explicitly for public consumption.
Tapping into big data that captures consumer search trends, purchasing
behavior, feedback patterns, and content engagement on online platforms

20
R Programming for Business

allows researchers to analyze behavioral shifts and emerging consumption


patterns. For instance, scraping user reviews on travel websites can reveal
changing expectations around hospitality services. Social listening tools
can detect evolving viewer attitudes by extracting sentiment from Twitter
conversations.
However, the variability in data quality across online sources necessitates
careful evaluation of aspects such as the comprehensiveness, consistency
and accuracy of reporting; availability of contextual metadata; underlying
biases in sampling; privacy preservation around personal data; terms of
use constraints; ideological leanings; and commercial interests.
For instance, e-commerce sales data limited to a single portal provides
insights on its user base not the entire market. Sentiment algorithms
struggle to decode nuances within social conversations. Anonymized user
data risks privacy infringements through reverse engineering.
Analysis and Interpretation of Secondary Data:
Data Cleaning and Preprocessing: The integration of quality secondary
data is fundamental to research, however, innate issues within sourced
datasets can undermine analysis if unaddressed. Data cleaning and
preprocessing together serve as vital processes for resolving raw data
inconsistencies before proceeding with essential activities of merging,
analysing, and modelling information.
Common data quality issues recognize the pervasiveness of missing values
where no data populates certain records or attributes. This can occur due
to incomplete observations or record deletions over time. Preprocessing
techniques that input variable means, medians or predictive estimates
enable retention of partially observed samples during analysis. Next,
detection and removal of outliers - data deviating extremely from typical
distributions - by standard deviation cutoffs, isolating and investigating
extreme values or replacing them with limits, prevents skewed analysis.
Furthermore, datasets often encompass discrepant variable formats,
redundant information, inaccurate mappings between fields - essentially
technical complexities needing reconciliation through formatting
standardization, merging identical variables, defining primary keys, and
validating Field values. Finally, variables can demonstrate covariance or

21
R Programming for Business
collinear relationships with each other - collecting similar information.
Analyzing correlated variables degrades model predictive capacity,
necessitating removal of duplicated attributes.
Data Analysis: Secondary datasets, whether from public repositories or
internal organizational records, require dedicated analytical approaches to
derive contextual insights. Established statistical techniques provide the
primary toolkit to examine patterns, model relationships, and interpret
signals within existing datasets.
Descriptive analytics, starting from data visualization, univariate analysis
around central tendencies and spreading, bivariate identification of
correlations and associations etc. constitutes initial examination to
characterize dataset features. Following aggregation of variables into
composite measures and application of sampling weights allows
population-level projections. Statistical testing procedures further assist
in making probabilistic inferences around phenomenon observations in
the data.
With reliable, good quality datasets, researchers can leverage multivariate
methods like regression analysis, factor analysis etc. for modeling causal
connections between phenomena; clustering algorithms to detect segments
and personas; and classification approaches to categorize entities or
predict outcomes based on historical patterns. Time series analysis
specifically tracks trends and trajectories in temporal data.
The analytical workflow is facilitated through dedicated statistical
software and programming platforms like SAS, Stata, SPSS, R, Python
etc. These tools automate the supply of cleaned, integrated data into
analytical models and generate reports, projections and visualizations to
assist interpretation. Big data capabilities using Hadoop, Spark and cloud-
based warehousing facilitate information consolidation from dispersed
secondary sources while GUI-based solutions expedite analysis.
Check Your Progress-2
1. What are the advantages of using primary data in research?
……………………………………………………………………………
……………………………………………………………………………

22
R Programming for Business

……………………………………………………………………………
…………………………………………………………………………….
2. What challenges are associated with secondary data analysis?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
…………………………………………………………………………..

1.4 DATA COLLECTION METHODS

The systematic collection of information to answer research questions


constitutes the central evidence-gathering mechanism across scientific
domains. Data collection represents the starting point that enables
scholars to establish an empirical foothold around phenomena worth
exploring further through analytical interpretations. The research
methodology process encompasses various data gathering techniques,
both quantitative and qualitative, that impart unique evidentiary value
and insight generation capacity.
Quantitative techniques like surveys and structured interviews engage
with a representative sample of respondents and collect numerical data
across predefined indicators that, in aggregation, characterize
distributions, frequencies, correlations and causal explanations found
within study populations. In contrast, qualitative approaches through
methodologies such as focus groups, ethnographic observations and in-
depth interviews gather non-numerical data around subjective narratives,
cultural themes and contextual shades of meaning, difficult to capture
solely via quantifiable metrics.
The data collection methodology needs to align with the research
questions posed, disciplinary epistemic conventions, participant
accessibility constraints, investigator skill sets and project resource
availability. While the dichotomy between quantitative and qualitative
data appears stark, mixed methods approaches attempt to extract the
richness within both numeric information and experiential perspectives
around target phenomena. As platforms evolve, digital tools provide
23
R Programming for Business
mechanisms to gather, integrate and analyze multifaceted data at
unprecedented levels, opening new possibilities.
1.4.1 Surveys
Surveys constitute one of the most global instruments used across
disciplines to systematically gather insights around prevalence,
distribution, interconnections and causal mechanisms associated with a
target phenomenon in a population. They involve structured interactions,
either through self-administered questionnaires or guided interviews,
with selected respondents that generate quantitative data amenable for
statistical analysis.
The standardized questionnaire format allows measurement of attitudes,
preferences, reported behaviors and demographic attributes across sample
groups generalizable to defined populations when scientific sampling
methods apply. Online platforms enable global reach while optimizations
in natural language processing promote question nuance.

Fig:1.2 Survey Questionnaire(Example)


Surveys encompass versatility spanning contextual, exploratory,
descriptive, correlational and explanatory research.
24
R Programming for Business

Initial surveys can scan prevalence across populations. Descriptive


findings characterize phenomena distributions while correlational
analysis surface predictor relationships. Explanatory surveys establish
causal explanations for occurrences and behaviors. Iterative surveys also
allow trend evaluations across longitudinal studies.
This analytical depth coincides with logistical convenience and cost
savings. Online self-administration auto-populates responses into
databases compiled for analysis using statistical software like R and
SPSS. The proliferation of survey data opens diverse analytical
directions - predictive modelling, segment analysis, normative
comparisons etc. - while avoiding laborious primary data gathering.
Probability sampling lends generalizability, vital for policy decisions.
Insight consolidation across scattered studies occurs through meta-
analysis approaches.
Survey Instrument Design:
Questionnaire Construction: The questionnaire forms the channel for
investigational constructs and measures to gather insightful data from
sample groups. Thus, instrument creation requires systematic processes
to promote response quality and analytical relevance.
Survey questions require concise, simple phrasing that specifically
conveys one idea at a time. Technical terms and ambiguities confuse
respondents, lower participation and distort response patterns. Leading
questions projecting researcher biases also prompt validity issues and
assume respondent familiarity with topics. Neutral framing allows free
sharing of experiences.
Question flow should logically build context and themes.
Response categories need exhaustion and mutual exclusivity - "Select all
that apply” covers all choice facets, while rating scales define interval
bounds. Balanced scale midpoints give equal propensity for
agreement/disagreement. Questions also arrange ordinal scale and
numeric responses before free-text boxes which risk respondent fatigue
effects.
Demographic measures often conclude questionnaires. Sensitive
attributes stay voluntary to prevent respondent data falsification.

25
R Programming for Business
Anonymity and informed consent statements promote participation rates
and safeguard ethical compliance. Questionnaire structure, length and
medium suit target groups and research objectives while pretesting
iterations refine instrument quality.
Response Scales and Formats: Survey answer types constitute a crucial
questionnaire design choice that impacts administration, analytics
pathways and insight potential. Categoric response options include
dichotomous binary capturing yes/no perspectives, nominal typed multi-
choice item selections and ordinal graded scales signaling rank ordered
opinions. Numeric options encompass discrete ordinal counts,
continuous ratio quantities and cardinal monetary values or temporal
durations.
Among ordinal types, the Likert scale offers gradient answer options
spanning disagree/agree attitudes around single declarative statements.
An odd number of choices allows a neutral midpoint. Variants specify
granularity - "strongly agree" to "strongly disagree" ranges or simply
"always" to "never" frequencies. Semantic differences help match
question styles - value, frequency etc. For multi-dimensional concepts, a
matrix table template assesses the same choices across factors facilitating
comparison.
Rank order and constant sum scale questions have respondents
numerically prioritize or allocate quantitative values across items - useful
for trait preferences or budget allocation exercises. Open-ended box
responses elicit qualitative explanations while comment boxes gather
verbatim feedback requiring coding before analysis. Choice set clicking
optimizes online self-administration convenience but demands
exhaustive, mutually exclusive options with a possible "none" catchall.
Pilot Testing: The execution of a small-scale pilot survey constitutes a
vital quality assurance checkpoint within the phased process of nurturing
survey instruments before field deployment. Trial testing questionnaires
on groups with characteristics aligned to target samples unearths vital
refinements around question wording ambiguities, difficult skip patterns,
inadequate choices, length issues and general flow concerns.

26
R Programming for Business

Pilot testing espouses established principles including using naïve


respondents having similar traits to intended participants. Situational
administration mirrors eventual protocols while concurrent think-aloud
interviews capture real-time feedback plus observational notes on visible
confusion to identify problematic items. Specifically, researchers assess
completion times, gauge motivation levels, check information relevance
against objectives and determine statistical output ranges per variable.
Consolidated findings diagnose questions needing rephrasing for clarity,
response choice gaps that miss answer facets, flow disruptions needing
realignment and poor performing items that risk unreliable data despite
intents. Analysts compute inter-item correlations to remove redundant
questions. Review panels assess face validity around alignments with
research constructs. Repeated iterations refine and bolster survey quality.
Survey Implementation:
Administration Methods: The appropriate survey administration vehicle
constitutes a vital aspect governing participant reach, response rates plus
speed and economics of eventual data consolidation. Each channel offers
unique strengths but also limitations worth weighing.
Online surveys allow cost-effective dissemination to wider audiences
given proliferating digital access. Automation assists data aggregation for
analysis while embedding completeness checks during collection.
However, online modes risk sampling bias excluding technology-limited
groups. Email surveying enables access to verified contact lists but faces
rising threat filtering. Website pop-up surveys see high abandonment
issues. Personalized invitations help improve response rates.
Telephonic surveys enable random digit dialling to access representative
samples. They allow higher supervision during collection plus capturing
spontaneous responses using trained callers. However, increasing use of
call filters and mobile phone portability impacts completeness.
Telephone interviews endure rising costs and require advanced call
center infrastructure.
Postal surveys ensure comprehensive geographic reach and facilitate
anonymity promoting participation around sensitive topics. But postal
inefficiencies, costs and response waiting times affect viability.

27
R Programming for Business
Face-to-face surveys using home visits or public interceptions secure
higher quality data with the ability to collect collateral observations. But
custom exercises become expensive and data volume gets limited by
access constraints.
Ethical Considerations: Survey research activities directly interfacing
with human subjects demand prudent constructs to uphold participant
rights and welfare. Voluntary participation necessitates proper
disclosures on the study's purpose, sponsor, potential data usage,
anonymity mechanisms and withdrawal policies as part of informed
consent procedures. This enables respondents to rationally assess
involvement risks around roles possibly resulting in psychological,
economic, legal or social harm before agreeing to contribute views.
Further, collected personal information requires responsible data
stewardship. Confidentiality undergirds survey legitimacy - limiting
access to identifiable data, aggregating public reporting, securely storing
records with encryption protocols and regulating data sharing
conventions via agreements. Contact information gets maintained
separately from survey responses, only conjoined using reference codes
during active analysis by the core team before complete dataset
anonymization.
Relatedly, transparent policies must cover secondary usage norms for
archived survey data that respect original consent conditions. Ethical
standards also inform sound sampling protocols ensuring fair selection,
administering uniform collection instruments across strata, applying
consistent quality checks and avoiding leading communications that
prompt certain responses. Such mechanisms uphold credibility of the
resulting dataset and ensure representative voices get reflected in survey
findings without biases or coercion pressuring participation.
Data Analysis and Interpretation of Surveys:
Quantitative Analysis: Survey analysis leverages a repertoire of
statistical tools and testing approaches to derive descriptive summaries,
compare group responses and model variable relationships based on
collected response datasets.

28
R Programming for Business

Univariate examination through frequency distributions, central tendency


measures and dispersion indices characterize response patterns for survey
variables, assisted by data visualizations like histograms and box plots.
Cross tabulations contrast response clusters across demographic
segments providing bivariate analysis around subgroup attitudes,
awareness and behaviors. Statistical tests like Chi-square and ANOVA
help infer significant differences.
Factor analysis builds consolidated perspectives from patterns between
interrelated survey items, reducing variable dimensionality for broader
comparisons. Regression modelling examines predictive relationships
between indicator predictor variables and outcome measures using
correlation coefficients and explanatory variance metrics. Machine
learning techniques like cluster analysis can automatically segment
respondents into persona groups based on response similarities that
inform customized intervention needs.
Qualitative Analysis: While surveys largely employ quantitative
questioning, supplementary open-ended responses allow gathering
expansive perspectives through subjective participant expressions typed
freely into comment boxes and text fields. Analysis of these textual
narratives necessitates qualitative methodologies to systematically
extract underlying meanings.
Inductive coding represents the foundation - scrutinizing responses to tag
recurrent ideas, experiences and suggestions with descriptive codes across
batches. Code consolidation into an optimized framework captures key
facets as major categories, each with hierarchically grouped sub-themes.
Code applications get peer verified to validate consensus.
Thereon, thematic analysis identifies model narratives, theoretical
constructs or causal mechanisms binding coded extract patterns across
responses to trembling themes using analytical memos. Frequency counts
of code occurrences supplement rich text interpretations.
Case exemplars and participant quotes help retain raw insight textures
during reporting.
Contrast analysis examines differences in perceptions between
respondent groups flagged during disaggregation by key demographics

29
R Programming for Business
like age, locale and gender. Content analysis allows numerical
conversion of textual data into quantifiable manifest content categories
historically compared across periodic surveys. sentiment analysis
through dictionary methods automates identification of emotional
expressions, criticisms and applause.
1.4.2 Interviews
Interviews constitute an intensive process of directed conversation
oriented to systematically gather first-hand descriptive insights around
lived experiences, attitudes, behaviors, expert opinions or eyewitness
accounts associated with a research phenomenon from selected individual
participants.
As data implicators, interviews excel in casting explanatory shades - the
contextual, relative, personal or situated facets challenging to quantify
concerning target themes under investigation. Through guided discussion,
question probes and narrative framing, investigators can motivate subjects
to intimately share nuanced reflections on beliefs, emotions, insider
knowledge, or explanatory rationales associated with their relationship to
the research focus area.
Capturing thick descriptive data around subjective vignettes, personal
histories or quotes facilitates documenting informal realities occurring
locally across incidents which elude questionnaire constraints. Analytical
frameworks like grounded theory then help collate common experiential
denominators into shared conceptual maps. Comparatively, interviews
allow customized targeting of niche experts or outliers holding specialized
experiences beyond behaviourally measured norms.
Purpose and Types of Interviews:
Purpose of Interviews:
Interviews serve as a vital data collection method in research, offering a
structured and interactive platform for obtaining in-depth information
from participants. The primary objectives of conducting interviews are
manifold, encompassing the exploration of complex phenomena, the
elucidation of participants' perspectives, and the generation of rich,
context-specific data.

30
R Programming for Business

One fundamental purpose of interviews is to delve into the intricacies of


participants' experiences, perceptions, and opinions. Through open-ended
questioning, researchers can unravel nuanced insights that may not be
readily apparent through quantitative methods alone. This qualitative
approach enables a more comprehensive understanding of the studied
phenomenon, fostering the identification of underlying motives, social
dynamics, and contextual factors.
Furthermore, interviews facilitate the establishment of rapport between
the interviewer and the participant, fostering a conducive environment
for candid and detailed responses. This interpersonal connection allows
researchers to probe deeper into participants' thoughts, motivations, and
emotions, yielding a more holistic portrayal of the subject under
investigation.
Interviews are particularly advantageous when seeking to explore topics
that necessitate a nuanced understanding or when studying social
phenomena characterized by intricate layers of meaning. For instance, in
the realm of social sciences, interviews are invaluable for investigating
subjective experiences, cultural practices, and individual perspectives,
where the depth of qualitative data is imperative for comprehensive
analysis.
Compared to other data collection methods, such as surveys or
experiments, interviews offer a distinctive advantage in capturing the
complexity and richness of human experiences. Surveys may be limited
by predefined response categories, potentially overlooking the diversity
of participants' viewpoints. Conversely, experiments often prioritize
control over contextual richness, potentially neglecting the subtle
nuances inherent in human behaviour and perception.
Types of Interviews:

 Structured Interviews:
 Advantages: Structured interviews are characterized by a
predetermined set of questions administered in a standardized
manner. This method offers several advantages, including
enhanced reliability and ease of analysis.

31
R Programming for Business
The standardized format ensures consistency across interviews,
facilitating the comparison of responses. Moreover, structured
interviews are efficient in terms of time and resource utilization,
making them suitable for large-scale studies. Researchers can
employ statistical techniques with greater confidence due to the
standardized nature of the data.
 Limitations: However, structured interviews may lack flexibility
in addressing unanticipated insights or probing deeper into
responses. The predetermined questions might not capture the
complexity of participants' experiences or viewpoints, limiting the
depth of qualitative data. Additionally, the rigid structure may
hinder the establishment of rapport between the interviewer and
the participant, potentially affecting the candor and richness of
responses.
 Semi-Structured Interviews:
 Advantages: Semi-structured interviews strike a balance between
structure and flexibility. This approach allows researchers to use
a predefined set of core questions while also permitting the
exploration of emergent themes. The semi-structured format
encourages in-depth responses, fostering a more comprehensive
understanding of the research topic. This method is particularly
valuable when investigating complex phenomena, providing the
researcher with the flexibility to adapt the interview to the
participant's context and responses.
 Limitations: Despite their flexibility, semi-structured interviews
require skilled interviewers who can navigate the balance
between adherence to the core questions and exploration of
additional topics. The variability in interviewer style and probing
techniques may introduce some degree of inconsistency in data
collection. Additionally, the analysis of semi-structured
interviews can be more time-consuming than that of structured
interviews due to the diverse and open-ended nature of the
responses.

32
R Programming for Business

 Unstructured Interviews:
 Advantages: Unstructured interviews are characterized by their
open-ended, exploratory nature. This method allows for
maximum flexibility, enabling the researcher to delve deeply into
participants' perspectives without predefined constraints.
Unstructured interviews are well-suited for exploring novel or
poorly understood topics, as they provide the freedom to follow
unexpected leads and capture rich, context-specific data.
 Limitations: While unstructured interviews offer unparalleled
flexibility, they present challenges in terms of standardization and
comparability. The absence of a predefined set of questions
makes it challenging to ensure consistency across interviews.
Moreover, the open-ended format may result in data that is more
difficult to analyze, as the richness and diversity of responses can
be overwhelming. Additionally, the rapport-building process may
be more critical in unstructured interviews, as the absence of a
predetermined structure requires a higher level of participant
comfort and engagement.

Interview Design and Planning:

 Define Clear Research Objectives: Before crafting interview


questions, clearly articulate the research objectives. Identify the
specific information required and the overarching goals of the
study. This foundational step ensures that interview questions
align with the research purpose and contribute meaningfully to
data collection.
 Establish a Logical Flow: Structure the interview protocol in a
logical sequence, starting with introductory and rapport-building
questions before progressing to more substantive inquiries. A
well-organized flow enhances participant comprehension and
engagement while facilitating a seamless transition between
topics.
 Balance Open-Ended and Closed-Ended Questions:
Incorporate a mix of open-ended and closed-ended questions.

33
R Programming for Business
Open-ended questions encourage participants to provide detailed
and nuanced responses, offering valuable qualitative data.
Closed-ended questions, on the other hand, can yield quantifiable
data and aid in the standardization of responses.
 Avoid Ambiguity and Jargon: Craft questions with precision,
avoiding ambiguous language or disciplinary jargon that may
confuse participants. Clarity in language ensures that respondents
interpret questions consistently, reducing the likelihood of
miscommunication and enhancing the reliability of data.
 Pilot Test the Protocol: Conduct a pilot test with a small sample
to refine the interview protocol. Assess participant
comprehension, identify potential ambiguities, and gauge the
overall effectiveness of the questions. Iterative refinement based
on pilot testing enhances the quality of the final interview
protocol.
 Include Probing Techniques: Integrate probing techniques to
elicit deeper responses. Probing involves follow-up questions that
encourage participants to expand on their initial answers,
providing richer insights. Examples of probing techniques include
asking for clarification, requesting examples, or exploring
alternative perspectives.
 Consider Cultural Sensitivity: Ensure that questions are
culturally sensitive and applicable to the diverse backgrounds of
participants. Avoid assumptions based on cultural stereotypes and
strive for inclusivity in language and content to enhance the
relevance of the interview protocol across different populations.
 Maintain Neutrality and Avoid Leading Questions: Formulate
questions in a neutral manner to prevent bias and leading
participants toward specific responses. Neutral phrasing fosters
an environment in which participants feel comfortable expressing
their genuine perspectives without feeling guided or influenced.
 Prioritize Conciseness: Craft questions with brevity and clarity.
Concise questions are easier for participants to comprehend and
answer accurately.

34
R Programming for Business

A succinct interview protocol not only enhances participant


engagement but also streamlines the data collection process and
facilitates efficient analysis.

1.4.3 Observations
Observational research, a methodological approach employed across
diverse disciplines, involves the systematic and unobtrusive observation
of phenomena in their natural settings. This method serves as a powerful
means of collecting data, capturing the complexity and richness of
behaviors, interactions, and contexts within real-world environments.
Observational research holds particular significance in fields such as
psychology, sociology, anthropology, education, and environmental
science, offering researchers a unique lens through which to explore and
understand the intricacies of human behavior, social dynamics, and
ecological systems. Unlike self-report measures or structured interviews,
direct observation allows for the examination of behaviors as they unfold
naturally, affording researchers unparalleled insights into the nuances,
patterns, and contextual factors that shape the phenomena under
investigation. This method's inherent capacity to provide a holistic and
contextually embedded understanding makes observational research an
invaluable tool for advancing knowledge and informing evidence-based
practices across various academic domains.
Purpose and Types of Observations:
Purpose of Observational Research:
Observational research serves as a pivotal methodology within the realm
of scientific inquiry, offering a nuanced and context-rich approach to data
collection. Its primary purpose lies in the systematic and unobtrusive
observation of phenomena in their natural settings, enabling researchers to
glean valuable insights into human behavior, social interactions, and
environmental dynamics. By immersing oneself in the authentic context
of the subject under investigation, observational research seeks to provide
a genuine portrayal of occurrences, devoid of the potential biases
introduced by controlled environments or self-reported data.
One of the paramount objectives of employing observational research is
the pursuit of a comprehensive understanding of behavior and events as
35
R Programming for Business
they naturally unfold. Through careful and objective observation,
researchers can capture the intricacies of human conduct, discern
patterns, and uncover underlying dynamics that may elude detection in
more artificial or contrived settings. This method allows for the
exploration of the intricate interplay between variables, facilitating a
holistic comprehension of complex phenomena within their ecological
niches.
Observational research proves particularly valuable in scenarios where
participants may be unable or unwilling to articulate their experiences
accurately, or where the phenomenon of interest manifests spontaneously
and unpredictably. In fields such as psychology, sociology, and
anthropology, where human behavior and social dynamics constitute
focal points of investigation, observational research offers an
unparalleled avenue for uncovering the subtle nuances that shape
individuals' actions and interactions. Additionally, in ecological studies,
naturalistic observation enables the examination of wildlife behavior and
environmental dynamics with minimal interference, preserving the
authenticity of the observed behaviors.
Furthermore, observational research contributes to the validation and
refinement of theoretical frameworks by grounding abstract concepts in
real-world contexts. It allows researchers to bridge the gap between
theoretical constructs and empirical realities, fostering a more robust
foundation for subsequent analyses and interpretations. This method also
facilitates the identification of novel research questions, paving the way
for further exploration and hypothesis generation.
Types of Observations:
Observational research manifests itself in various forms, each tailored to
specific research goals and the nature of the phenomena under
investigation. Three prominent types of observations—participant
observation, non-participant observation, and naturalistic observation—
serve as distinct methodological approaches, each offering unique
advantages based on the researcher's objectives.

36
R Programming for Business

 Participant Observation: Participant observation involves the


researcher actively engaging in the observed group or setting,
assuming a role within the environment under study.
This method is particularly appropriate when the aim is to gain an in-
depth understanding of social dynamics, cultural practices, or
immersive experiences. By becoming an integral part of the observed
context, the researcher can access nuanced insights that might be
elusive through more detached approaches. Participant observation is
invaluable in studies seeking to explore subjective perspectives,
social norms, and intricate interpersonal relationships. It is well-
suited for investigations in anthropology, sociology, and ethnography
where an insider's perspective is essential for capturing the richness
and complexity of a given social milieu.
 Non-participant Observation: In contrast, non-participant
observation involves the researcher maintaining a more distant and
objective stance, refraining from direct participation in the activities
or interactions being observed. This method is appropriate when the
research goal is to minimize potential biases introduced by the
researcher's involvement, allowing for a more impartial and external
assessment of the phenomena. Non-participant observation is
particularly suitable for studies focused on behavioral patterns,
spatial arrangements, or situations where an unbiased and
unobtrusive perspective is crucial. This approach is commonly
employed in fields such as psychology, where researchers aim to
observe behavior without influencing it, thus ensuring the ecological
validity of the findings.
 Naturalistic Observation: Naturalistic observation entails the study
of phenomena in their natural and unaltered environments,
emphasizing ecological validity. This approach is well-suited for
research goals that involve capturing spontaneous and genuine
behavior without artificial constraints. Naturalistic observation is
often applied in ecological studies, wildlife research, and
environmental psychology, where the goal is to understand how
organisms interact with their surroundings under natural conditions.

37
R Programming for Business
It is particularly advantageous when the aim is to observe behaviors
that may be influenced by the context, and when the researcher seeks
to avoid the potential biases introduced by laboratory settings.

Planning and Designing Observations:


Creating effective observation protocols is paramount to the success and
rigor of observational research endeavors. A well-constructed protocol
serves as the blueprint for systematic and reliable data collection, ensuring
that the observations align with the research objectives. To enhance the
effectiveness of observation protocols, two key components deserve
meticulous attention: the clarity of instructions and the establishment of
predefined criteria.

 Clarity of Instructions: Clear and unambiguous instructions are


fundamental to the success of an observation protocol.
Researchers must articulate the purpose of the observation,
delineate specific behaviors or phenomena of interest, and
provide explicit guidance on the observational process.
Ambiguities in instructions can lead to variability in data
collection, compromising the reliability and validity of the
findings. Therefore, it is imperative to communicate the research
goals concisely, specify the observational context, and outline
any relevant ethical considerations. Furthermore, the protocol
should elucidate the observer's role, emphasizing the importance
of maintaining objectivity and avoiding interference with the
observed setting. This clarity ensures consistency among
observers and facilitates the accurate interpretation of observed
behaviors.
 Predefined Criteria: Establishing predefined criteria for
observation is equally crucial. Researchers must define and
operationalize the variables or behaviors under scrutiny,
providing clear benchmarks for the observers to follow.
This precision minimizes subjectivity in data interpretation and
promotes inter-rater reliability. Criteria should encompass
observable and measurable indicators, enabling consistent
identification and classification of behaviors across different

38
R Programming for Business

observers. Additionally, researchers should consider


incorporating a coding system or checklist that aligns with the
research objectives, facilitating the systematic recording of
observed events. The development of predefined criteria not only
enhances the rigor of the observational process but also facilitates
comparisons across different studies, contributing to the
cumulative knowledge within a particular field.

1.4.4 Experiments
Experimental research, as a distinguished and methodologically rigorous
approach, occupies a paramount position in the arsenal of data collection
methods within the scientific domain. Defined by its systematic
manipulation of independent variables to observe their effects on
dependent variables, experimental research stands as an invaluable tool for
investigating causal relationships and discerning patterns in the intricate
fabric of phenomena. This methodological framework, characterized by
its structured design and controlled conditions, facilitates the isolation of
specific factors for meticulous examination, offering unparalleled insights
into the dynamics underlying diverse phenomena.
Purpose and Types of Experiments:
Purpose of Experimental Research:
Experimental research serves as a crucial methodological approach in the
realm of scientific inquiry, primarily aimed at achieving distinct
objectives. The fundamental purpose of conducting experiments is to
systematically investigate and understand phenomena by manipulating
variables in a controlled environment. This method allows researchers to
explore causal relationships between variables, shedding light on the
cause-and-effect dynamics inherent in the phenomena under
investigation.
The primary objective of experimental research is to contribute to the
advancement of knowledge by providing empirical evidence and testing
hypotheses. Through a carefully designed experimental setup,
researchers can manipulate independent variables while controlling for
potential confounding factors. This meticulous control enables them to

39
R Programming for Business
observe changes in the dependent variable and, consequently, discern
any causal relationships that may exist.
Crucially, experiments possess a unique ability to establish cause-and-
effect relationships, a feat not easily attainable through other research
designs. By systematically manipulating one or more variables and
observing their impact on the outcome, researchers can draw more
definitive conclusions about the factors influencing a particular
phenomenon. This cause-and-effect clarity is instrumental in building a
solid foundation for scientific theories and contributing to the cumulative
body of knowledge within a given field.
Types of Experiments:
Experimental designs play a pivotal role in shaping the structure and
implementation of research studies. Three prominent types of
experimental designs include between-subjects, within-subjects, and
factorial designs. The selection of a specific design depends on the
research goals, questions posed, and the nature of the phenomena under
investigation.

 Between-Subjects Design: In a between-subjects design,


distinct groups of participants are exposed to different
experimental conditions. Each group represents a different
level of the independent variable, allowing for a comparison
of the average responses or outcomes across these groups.
This design is appropriate when the researcher seeks to assess
the overall impact or differences between separate treatments,
interventions, or conditions. It is particularly useful when
minimizing the potential influence of individual differences
within the groups is essential to draw reliable conclusions.
 Within-Subjects Design: Conversely, a within-subjects design
involves the same group of participants experiencing all levels
of the independent variable. Each participant serves as their
control, thus minimizing individual differences and enhancing
the sensitivity of the study. Within-subjects designs are suitable
when the aim is to examine individual changes over time,
compare participants' reactions to multiple conditions, or when

40
R Programming for Business

resources or participants are limited. This design is particularly


effective in capturing within-participant variations and is often
utilized in longitudinal studies.
 Factorial Designs: Factorial designs involve the simultaneous
manipulation of more than one independent variable,
providing a comprehensive understanding of the interplay
between various factors. The notation for factorial designs
indicates the number of levels for each independent variable
(e.g., a 2x2 factorial design involves two independent
variables, each with two levels). This design is valuable when
researchers aim to explore the combined effects of multiple
factors and their potential interactions. It allows for a nuanced
examination of how different variables interact to influence
the dependent variable.

Check Your Progress-3


1. How do quantitative data collection techniques differ from qualitative
approaches?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. Why is it important for the data collection methodology to align with
research questions and other contextual factors?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

1.5 PRESENTATION OF NUMERICAL AND


CATEGORICAL DATA

The presentation of data is a pivotal aspect that demands precision and


clarity. The communication of findings often involves two fundamental

41
R Programming for Business
types of data: numerical and categorical. Numerical data, characterized by
quantitative values, encapsulates measurable quantities, while categorical
data, defined by distinct categories or labels, represents qualitative
information. Effectively presenting these two types of data is essential for
conveying research outcomes with accuracy and coherence.

Fig:1.3 Numerical and Categorical Data


1.5.1 Numerical Data
Numerical data, a cornerstone of empirical research, represents
quantitative information expressed in numerical terms. This form of data
is pervasive across diverse fields, ranging from the natural and social
sciences to economics and engineering. The significance of numerical
data lies in its capacity to provide precise, measurable information,
allowing researchers to conduct rigorous analyses, draw statistical
inferences, and derive meaningful conclusions.
Characterized by the inherent ability to be measured and expressed in
numerical values, numerical data manifests in a continuum, enabling
researchers to discern variations and trends within the dataset. This type
of data is often employed to quantify and compare different attributes,
facilitating a quantitative understanding of phenomena. From
experimental measurements to survey responses, numerical data serves
as the foundation for statistical analyses that underpin scientific
investigations and decision-making processes.
Types of Numerical Data:
Continuous vs. Discrete Data: Continuous and discrete data are two
distinct types of numerical data that are essential in statistical analysis and

42
R Programming for Business

research. Understanding the differences between these two categories is


crucial for proper data interpretation and analysis.
Continuous data refers to variables that can take any value within a given
range. These values are often measured and can be subdivided into smaller
intervals, potentially infinitely.
Continuous data is characterized by its fluid nature, as it can take on an
uncountable number of possible values. Common examples of continuous
data include measurements such as height, weight, temperature, and time.
For instance, when measuring an individual's height, one can observe
values like 5.6 feet, 5.65 feet, and so on, indicating the continuous and
unbroken nature of the data.
On the other hand, discrete data consists of distinct, separate values with
no intermediate values possible within the given range. These values are
often counted and are typically whole numbers. Discrete data is
characterized by its distinct and separate nature, allowing only specific,
separate values. Examples of discrete data include the number of students
in a classroom, the count of items sold, or the number of cars in a parking
lot. In contrast to continuous data, discrete data does not permit fractional
or in-between values, and each observation is a distinct entity.
Interval vs. Ratio Data: Interval and ratio data are two types of numerical
data that share similarities but differ in terms of the presence or absence
of a true zero point and the nature of their measurements.
Interval data represents numerical values where the intervals between
consecutive points are meaningful and consistent, but the absence of a true
zero point is notable. In other words, the zero point in interval data does
not indicate the complete absence of the attribute being measured but
serves as a reference point. Common examples of interval data include
temperature measured in Celsius or Fahrenheit. In the Celsius scale, a
temperature of 10 degrees does not mean there is no heat; it simply
indicates a point on the scale, and the absence of temperature (absolute
zero) is not represented. Similarly, in Fahrenheit, a temperature of 20
degrees is not twice as warm as 10 degrees.
In contrast, ratio data possesses a true zero point, indicating the absence of
the measured attribute.

43
R Programming for Business
The zero point in ratio data is meaningful and represents a complete
absence of the characteristic being measured. Examples of ratio data
include height, weight, and income. For instance, a height of 0 cm implies
the absence of height, making the zero point meaningful. Similarly, a
weight of 0 kg signifies the absence of weight. Ratio data allows for
meaningful ratios and comparisons, as values can be compared in terms of
multiplication or division.
Descriptive Statistics for Numerical Data:
Measures of Central Tendency: Measures of central tendency are
statistical tools used to summarize and describe the central or average
value of a dataset. Three commonly employed measures in this regard
are the mean, median, and mode. Each of these measures provides a
different perspective on the central tendency of a dataset and is suitable
under specific circumstances.
Mean: The mean, often referred to as the average, is calculated by
summing up all values in a dataset and dividing the total by the number
of observations. This measure is appropriate when dealing with a dataset
that is normally distributed or follows a symmetrical pattern. For
instance, when examining the average income of a population or the
average test scores of students, the mean is a reliable measure. However,
the mean can be sensitive to extreme values or outliers, making it less
suitable for skewed distributions.
Median: The median is the middle value of a dataset when it is arranged
in ascending or descending order. If there is an even number of
observations, the median is the average of the two middle values. The
median is particularly useful in scenarios where the data is skewed or
contains outliers. For example, when analyzing income data, the median
provides a more robust measure than the mean because it is less
influenced by extreme values. It accurately represents the center of the
distribution without being skewed by outliers.
Mode: The mode represents the most frequently occurring value in a
dataset. Unlike the mean and median, the mode is not affected by
extreme values or the shape of the distribution. The mode is appropriate
for categorical data or datasets with distinct peaks. For instance, in a

44
R Programming for Business

survey asking people about their preferred mode of transportation, the


mode would indicate the most commonly chosen option. However, it is
important to note that a dataset may have one mode (unimodal), two
modes (bimodal), or more (multimodal).
Measures of Dispersion: Measures of dispersion are statistical tools that
help to quantify the extent to which individual values in a dataset deviate
from the central tendency. Three common measures of dispersion are
range, variance, and standard deviation. Each of these measures provides
insights into the spread or variability of numerical data.
Range: The range is the simplest measure of dispersion and is calculated
as the difference between the maximum and minimum values in a dataset.
While easy to compute, the range is sensitive to outliers and extreme
values. It provides a basic understanding of the spread by indicating the
interval within which most values lie. However, it doesn't account for the
distribution of values within that range.
Variance: Variance is a more sophisticated measure of dispersion that
takes into account the deviation of each data point from the mean. It
involves squaring the difference between each data point and the mean,
summing up these squared differences, and then dividing by the number
of observations. Variance provides a comprehensive view of the spread
of values, but the squared units make interpretation challenging.
Standard Deviation: The standard deviation is the square root of the
variance. It is a widely used and more interpretable measure of dispersion.
By taking the square root, the standard deviation is returned to the original
units of the data, making it more intuitive. A smaller standard deviation
indicates that data points tend to be closer to the mean, while a larger
standard deviation suggests greater variability. Standard deviation is
particularly useful when the distribution of the data is approximately
normal, as it allows for a clear understanding of how much individual
values deviate from the central tendency.
1.5.2 Categorical Data
Categorical data constitutes a fundamental element in the realm of
statistics and data analysis. It refers to information that can be divided
into distinct categories or groups based on qualitative attributes.

45
R Programming for Business
Unlike numerical data, which involves measurable quantities, categorical
data encompasses non-numeric information that falls into distinct classes
or labels. The prevalence of categorical data is ubiquitous across various
fields, playing a pivotal role in capturing and interpreting information
that is not inherently numerical.
In fields such as sociology, psychology, and market research, categorical
data is commonly employed to classify individuals based on
characteristics such as gender, marital status, or consumer preferences.
Medical studies utilize categorical data to categorize patients into
different diagnostic groups, while educational research may classify
students based on academic performance or learning styles. Categorical
variables are also integral in areas like linguistics, where language
elements are categorized into phonetic, syntactic, or semantic classes.
The nature of categories in categorical data is inherently discrete and
qualitative. These categories may represent nominal variables, where
there is no inherent order or ranking, such as colors or types of fruits.
Alternatively, they can represent ordinal variables, where there is a
meaningful order but the differences between categories are not uniform,
as seen in survey responses like "strongly agree," "agree," "neutral,"
"disagree," and "strongly disagree."
Types of Categorical Data:
Nominal vs. Ordinal Data: Researchers often encounter two
fundamental types: nominal and ordinal. These designations are pivotal
for understanding and interpreting data, as they convey distinct levels of
information hierarchy.
Nominal data is characterized by categories that are devoid of any
inherent order or ranking. In other words, the classification is merely
nominal or in name. These categories serve as labels for different groups
without implying any particular sequence or significance. For instance,
when analyzing the favorite colors of a group of individuals, the resulting
data – such as "red," "blue," or "green" – falls under the umbrella of
nominal categorization. In this context, the colors are merely labels and
do not possess a prescribed order.

46
R Programming for Business

On the contrary, ordinal data introduces a sense of order or hierarchy


among categories. While the categories still represent distinct groups,
they are arranged in a meaningful sequence that implies a certain level of
ranking or preference.
An illustrative example of ordinal data is the ranking of satisfaction levels
in a customer survey, where responses could be categorized as "very
dissatisfied," "dissatisfied," "neutral," "satisfied," and "very satisfied."
Here, the ordinal nature of the data is evident, as there is a clear order
reflecting varying degrees of satisfaction.
To further elucidate the difference between nominal and ordinal data,
consider a scenario where individuals are classified based on their
educational qualifications. If the categories are "high school diploma,"
"bachelor's degree," and "master's degree," the data is nominal, as there is
no inherent order among these qualifications. On the other hand, if the
categories are arranged hierarchically, such as "high school diploma,"
"bachelor's degree," and "master's degree" in ascending order of
educational attainment, the data becomes ordinal.

Binary Data: Binary categorical data is a specific type of categorical data


that is characterized by two mutually exclusive and exhaustive categories.
These categories are often dichotomous in nature, meaning they represent
two distinct and opposite outcomes. Understanding the characteristics of
binary data is essential for researchers and analysts in various fields, as it
is a prevalent form of categorical information.
Characteristics of Binary Categorical Data:
 Two Categories: The primary feature of binary data is the
existence of only two possible categories. These categories are
typically labeled as 0 and 1, or more descriptively, as "success"
and "failure," "yes" and "no," or "positive" and "negative."
 Mutual Exclusivity: Each observation or case falls into one and
only one of the two categories. There is no overlap or ambiguity in
the classification of the data.

47
R Programming for Business
 Exhaustiveness: The two categories together encompass all
possible outcomes. Every observation must belong to either
category, leaving no room for unclassified or undefined instances.

Descriptive Statistics for Categorical Data:


Frequency Distributions: Frequency distributions serve as a
fundamental concept in statistics, particularly when dealing with
categorical data. They provide a systematic way of organizing and
summarizing the distribution of different categories within a dataset.
Understanding how to create and interpret frequency tables is crucial for
gaining insights into the patterns and characteristics of categorical data.
Concept of Frequency Distributions:
A frequency distribution is a tabular representation that displays the
number of occurrences or frequency of each category within a categorical
dataset. The purpose is to organize and summarize the data in a way that
facilitates a clear understanding of the distribution of values.
Creating a Frequency Table:

 Identify Categories: Begin by identifying the distinct categories


present in the dataset. These could be nominal or ordinal variables.
 Count Frequencies: For each category, count the number of
occurrences or frequency in the dataset.
 Construct the Table: Create a table with two columns: one for the
categories and another for their corresponding frequencies.
List each category and its frequency in the respective columns.

Interpreting Frequency Tables:

 Central Tendency: The category with the highest frequency is


often considered the mode of the distribution, representing the
most common occurrence in the dataset.
 Dispersion and Spread: Examining the range of frequencies
across different categories provides insights into the spread or
variability of the data.
 Patterns and Trends: Observing the distribution pattern can
reveal trends, clusters, or gaps in the data. This information is
valuable for identifying patterns or anomalies.

48
R Programming for Business

Example: Frequency Table for Favourite Colours:


Consider a dataset recording the favourite colours of a group of
individuals:

Colour Frequency

Red 15

Blue 20

Green 12

Yellow 8

Orange 5

In this example, the frequency table shows the number of individuals


favoring each color. It is evident that blue is the most preferred color,
occurring 20 times, making it the mode. The table also provides a quick
overview of the distribution of preferences.
Usefulness of Frequency Distributions:

 Data Summary: Frequency distributions offer a concise summary


of categorical data, making it more manageable and interpretable.
 Comparison: By comparing frequencies across categories, one
can identify patterns, preferences, or variations within the dataset.
 Inform Decision-Making: Understanding the distribution of
categorical data is crucial for making informed decisions and
drawing meaningful conclusions.

Bar Charts:
Bar charts are effective graphical representations used to visually convey
the distribution of categorical data. They are particularly suitable for
illustrating the frequency or proportion of different categories within a
dataset. Bar charts provide a clear and accessible way to interpret
categorical information, making them a popular choice in data
visualization.

49
R Programming for Business
Bar Charts for Categorical Data:

 Basic Bar Chart:


 In a basic bar chart, each category is represented by a rectangular
bar, and the height of the bar corresponds to the frequency or
proportion of that category.
 The categories are displayed on the horizontal axis (x-axis), while
the frequencies or proportions are depicted on the vertical axis (y-
axis).

Fig: 1.4: Bar Chart

 Grouped Bar Charts:


 Grouped bar charts are employed when there is a need to
compare the distribution of multiple categorical variables side
by side.
 In this format, each category is represented by a group of bars,
and each group corresponds to a different variable. This allows
for easy visual comparison across categories.
 Stacked Bar Charts:
 Stacked bar charts are useful for illustrating the composition of
a whole in terms of subcategories.
 Instead of side-by-side bars, each bar is divided into segments,
with each segment representing a different subcategory. The
total height of the bar reflects the overall frequency or
proportion.
50
R Programming for Business

Check Your Progress-4


1. How does numerical data contribute to scientific investigations?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
2. How is categorical data utilized in fields like market research and
sociology?
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………
……………………………………………………………………………

1.6 LET US SUM UP

 Statistics is pivotal in diverse fields like economics, sociology, and


medicine, facilitating decision-making by analyzing, interpreting,
and organizing data.
 Data is categorized into primary (directly collected) and secondary
(from existing sources), influencing reliability and applicability in
statistical studies.
 Surveys, interviews, and experiments gather firsthand information,
ensuring accuracy and relevance tailored to specific research
needs.
 Derived from existing sources, secondary data is cost-effective but
requires critical assessment of relevance and reliability for accurate
and meaningful results.
 Surveys capture opinions on a large scale, interviews provide in-
depth insights, observations study natural settings, and
experiments establish cause-and-effect relationships through
variable manipulation.

51
R Programming for Business
 Surveys, employing questionnaires or interviews, systematically
collect quantitative data, making them effective for large-scale
information gathering on preferences and opinions.
 Interviews allow researchers to explore nuances and gather
detailed qualitative data by engaging directly with participants
and asking open-ended questions.
 Observational studies systematically observe and record events or
behaviors, particularly useful when studying natural settings
without interference.
 Experiments manipulate variables to observe outcomes,
facilitating the establishment of cause-and-effect relationships in
understanding influencing factors.

1.7 KEYWORDS

 Statistics: Analysis and interpretation of data to uncover patterns,


trends, and relationships for informed decision-making.
 Data: Raw information used for analysis, providing insights into
various fields and supporting decision-making processes.
 Primary Data: Firsthand information directly collected from
original sources, tailored and specific to particular research.
 Secondary Data: Derived from existing sources, like books or
databases, offering convenient and cost-effective information for
research.
 Surveys: Systematic collection of information from a sample
through questionnaires or interviews for large-scale data gathering.
 Interviews: In-depth data collection method where researchers
engage directly with participants, exploring perspectives through
open-ended questions.
 Observations: Systematic watching and recording of events or
behaviors, useful for studying natural settings without interference.

52
R Programming for Business

 Experiments: Designed studies manipulating variables to


establish cause-and-effect relationships, observing outcomes for
drawing conclusions.
 Numerical Data: Measurable quantities presented using
statistical measures like mean, median, standard deviation, and
visual representations.
 Categorical Data: Consists of distinct categories or groups, often
represented visually through charts, graphs, and tables.

1.8 SOME USEFUL BOOKS

 Moore, D. S., McCabe, G. P., & Craig, B. A. (2014). Introduction


to the Practice of Statistics. Macmillan Higher Education.
 Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics
Using R. SAGE.
 Wickham, H., & Grolemund, G. (2016). R for Data Science.
“O’Reilly Media, Inc.”
 Lock, R. H., Lock, P. F., Morgan, K. L., Lock, E. F., & Lock, D.
F. (2020). Statistics. Wiley Global Education.
 Cody, R. P., & Smith, J. K. (1991). Applied Statistics and the SAS
Programming Language.

1.9 ANSWERS TO CHECK YOUR PROGRESS

Refer 1.2 for Answer to check your progress- 1 Q. 1


Statistics enhances the reliability of research findings by providing
rigorous methods, such as hypothesis testing and confidence intervals, to
assess the significance of results. This ensures that conclusions are not
merely based on chance and adds a level of objectivity to the interpretation
of data.

53
R Programming for Business
Refer 1.2 for Answer to check your progress- 1 Q. 2
Measures of central tendency, including means, medians, and modes, help
summarize the central or typical values in a dataset. They provide a clear
reference point for understanding the distribution of data, aiding
researchers in describing and comparing different sets of information with
a single, standardized value.
Refer 1.3 for Answer to check your progress- 2 Q. 1
Primary data provides researchers with maximum control over data
collection methods, allowing for tailored approaches like field surveys and
interviews. It offers flexibility to address specific research questions and
ensures currency, as data is collected firsthand for the current analysis
objectives.
Refer 1.3 for Answer to check your progress- 2 Q. 2
Secondary data analysis faces limitations such as incomplete capturing of
target behaviors and the need for careful review of metadata and data
collection methods. Researchers must assess the fit of available variables
and cohorts with their research aims and consider data-generating
processes for appropriate interpretation.
Refer 1.4 for Answer to check your progress- 3 Q. 1
Quantitative techniques, such as surveys, focus on numerical data gathered
from representative samples to characterize distributions, frequencies, and
correlations. In contrast, qualitative approaches, like focus groups and in-
depth interviews, collect non-numerical data, exploring subjective
narratives and contextual meanings not easily captured by quantifiable
metrics.
Refer 1.4 for Answer to check your progress- 3 Q. 2
Alignment with research questions, disciplinary conventions, participant
accessibility, investigator skills, and resource availability is crucial for
effective data collection. It ensures that the chosen methodology is well-
suited to extract relevant information and insights, enhancing the validity
and reliability of the research findings.
Refer 1.5 for Answer to check your progress- 4 Q. 1
Numerical data, being measurable and precise, enables researchers to
conduct rigorous analyses and draw statistical inferences, providing a

54
R Programming for Business

quantitative understanding of phenomena. It serves as the foundation for


various scientific investigations, from experimental measurements to
survey responses, allowing researchers to discern variations, identify
trends, and derive meaningful conclusions.
Refer 1.5 for Answer to check your progress- 4 Q. 2
In market research, categorical data is commonly employed to classify
consumers based on preferences, while in sociology, it helps categorize
individuals based on attributes like gender or marital status. These
qualitative classifications enable researchers to capture and interpret
information that goes beyond measurable quantities, providing insights
into various social and consumer behaviors.

1.10 TERMINAL QUESTIONS

1. How does the distinction between primary and secondary data


impact the reliability and applicability of statistical studies?
2. What are the advantages and disadvantages of using surveys as a
method for collecting data in statistical studies?
3. In what situations would observational studies be more appropriate
than experiments for collecting data, and vice versa?
4. Can you explain the significance of numerical data presentation,
including statistical measures like mean, median, and standard
deviation?
5. How does the use of interviews contribute to a more in-depth
understanding of data in statistical research, especially when
compared to other data collection methods?

55

You might also like