Module 2
Module 2
Learning Outcomes:
1. Discuss the different methods used to collect data.
2. Choose an appropriate type of data representation to present data effectively
Data Collection
Data collection is the process of gathering and measuring information on variables of interest, in
an established systematic fashion that enables one to answer stated research questions, test
hypotheses, and evaluate outcomes.
The most critical objective of data collection is ensuring that information-rich and reliable
data is collected for statistical analysis so that data-driven decisions can be made for research.
Inaccurate data collection can impact the results of a study and ultimately lead to invalid results.
1
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Observation is a way of collecting data through observing. The observation data collection
method is classified as a participatory study because the researcher has to immerse herself in the
setting where her respondents are while taking notes and/or recording.
Advantages of observation
1. data collection method includes direct access to research phenomena,
2. high levels of flexibility in terms of application, and
3. generating a permanent record of phenomena to be referred to later.
Disadvantages of observation
1. longer time requirements,
2. high levels of observer bias, and
3. impact of observer on primary data, in a way that the presence of an observer may
influence the behavior of sample group elements.
2
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
An interview is generally a qualitative research technique that involves asking open-ended
questions to converse with respondents and collect elicit data about a subject (QuestionPro, 2020).
The interviewer in most cases, is the subject matter expert who intends to understand
respondent opinions in a well-planned and executed series of questions and answers. Interviews are
conducted with a sample from a population, and the key characteristic they exhibit is their
conversational tone.
Interviews offer the researchers a platform to prompt their participants and obtain inputs in
the desired detail. There are three fundamental types of interviews in research: structured interviews,
semi-structured interviews, and unstructured interviews.
Structured interview is defined as research tools that are extremely rigid in their operations
are allows very little or no scope of prompting the participants to obtain and analyze results. It is
thus also known as a standardized interview and is significantly quantitative in its approach.
Questions in this interview are pre-decided according to the required detail of information. They can
be closed-ended as well as open-ended – according to the type of target population. Closed-ended
questions can be included to understand user preferences from a collection of answer options. In
contrast, open-ended can be included to gain details about a particular section in the interview.
3
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
5. As the scope of detail is already considered while designing the interview, better
information can be obtained. The researcher can analyze the research problem
comprehensively by asking accurate research questions.
6. Since the interview structure is fixed, it often generates reliable results and is quick to
execute.
7. The relationship between the researcher and the respondent is not formal. The
researcher can clearly understand the margin of error if the respondent either
degrees to be a part of the survey or is just not interested in providing the correct
information.
4
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Advantages of semi-structured interviews:
1. Questions of semi-structured interviews are prepared before the scheduled interview,
allowing the researcher to prepare and analyze the questions.
2. It is flexible to an extent while maintaining the research guidelines.
3. Researchers can express the interview questions in the format they prefer, unlike the
structured interview.
4. Reliable qualitative data can be collected via these interviews.
5. Flexible structure of the interview.
Unstructured Interview is called in-depth interviews. These interviews have the least number
of questions as they lean more towards a normal conversation but with an underlying subject. There
are no guidelines for the researchers to follow. So, they can ethically approach the participants to
gain as much information as possible for their research topic. Since there are no guidelines for these
interviews, a researcher is expected to keep their approach in check so that the respondents do not
sway away from the main research motive. For a researcher to obtain the desired outcome, he/she
must keep the following factors in mind:
Intent of the interview.
The interview should primarily take into consideration the participant’s interests and skills.
All the conversations should be conducted within permissible limits of research and the
researcher should try and stick by these limits.
The skills and knowledge of the researcher should match the purpose of the interview.
Researchers should understand the do’s and don’ts of unstructured interviews.
5
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Advantages of Unstructured Interviews:
1. Due to the informal nature of unstructured interviews – it becomes extremely easy
for researchers to try and develop a friendly rapport with the participants. This leads
to gaining insights in extreme detail without much conscious effort.
2. The participants can clarify all their doubts about the questions and the researcher
can take each opportunity to explain his/her intention for better answers.
3. There are no questions which the researcher has to abide by and this usually
increases the flexibility of the entire research process.
There are three methods to conduct research interviews, each of which is peculiar in its
application and can be used according to the research study requirement.
A personal interview, also called a face-to-face interview, is utilized when a specific target
population is involved. The purpose of conducting a personal interview survey is to explore the
people's responses to gather more and deeper information.
Personal interviews are one of the most used types of interviews, where the questions are
asked personally directly to the respondent. For this, a researcher can have online guide surveys to
take note of the answers. A researcher can design his/her survey so that they take notes of the
comments or points of view that stand out from the interviewee.
6
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Advantages of the personal interview:
1. Higher response rate.
2. When the interviewees and respondents are face-to-face, there is a way to adapt the
questions if this is not understood.
3. More complete answers can be obtained if there is doubt on both sides or particular
information is detected that is remarkable.
4. The researcher has an opportunity to detect and analyze the interviewee’s body
language at the time of asking the questions and taking notes about it.
7
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Telephone interview is when the interviewer communicates with the respondent on the telephone
in accordance with the prepared questionnaire. Usually, standardized questionnaires with closed-
ended questions are recommended for this kind of questioning.
8
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Computer-Assisted Personal Interviewing (CAPI) is a face-to-face data collection method in
which the interviewer uses a tablet, mobile phone, or a computer to record answers given during
the interview (Beam, 2019).
The primary purpose of CAPI is to conduct large-scale continuous surveys for the
commercial sector and government. CAPI defies traditional paper questionnaires and adopts a
face-to-face stance, which has had enormous effects on the quality of data.
Advantages of CAPI:
1. Time – Being purely electronic, the time taken to convert a paper questionnaire into a
computer would be time-consuming. CAPI software systems also provide data entry,
checking, and exportation all in one place.
2. Exposure – The program can be incorporated on to the internet, potentially attracting
a global audience.
3. Cost – With CAPI you can store data online and offline, eliminating any printing and
data-entry costs.
4. Accurate results – CAPI software systems provide analysis of results in real-time,
which are easily exportable to Excel or CSV, avoiding any possibility of human error.
Disadvantages of CAPI:
1. Due to the effectiveness of this market research tool, there may be additional time
spent on preparation e.g. programming and procurement.
2. Practicalities such as technical difficulties, internet access, and accessibility could
affect the development of research.
9
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Questionnaire is a set of standardized questions, often called items, which follow a fixed scheme
to collect individual data about one or more specific topics (Reddy, 2019)
Questionnaire provides the speediest and simple technique of gathering data about groups
of individuals scattered in a wide and extended field. In this method, a questionnaire form is sent
usually by post to the persons concerned, with a request to answer the questions and return the
questionnaire.
Paper-pencil questionnaires can be sent to a large number of people and save the
researcher time and money. People are more truthful while responding to the questionnaires
regarding controversial issues because their responses are anonymous. But they also have
drawbacks. The majority of the people who receive questionnaires do not return them, and those
who do might not be representative of the originally selected sample.
Web-based questionnaires are a new and inevitably growing methodology that is Internet-
based research. This would mean receiving an e-mail on which you would click on an address that
would take you to a secure website to fill in a questionnaire. This type of research is often quicker
and less detailed. Some disadvantages of this method include the exclusion of people who do not
have a computer or cannot access a computer. Also, the validity of such surveys is in question as
people might be in a hurry to complete them and so might not give accurate responses.
Advantages of Questionnaire:
1. Questionnaires are inexpensive when appropriately handled. They can be cheaper
than taking surveys, which requires a lot of time and money.
2. It is an effective method to get an opinion from a large number of people.
3. Unlike face-to-face surveys where the respondent has to answer within that moment
itself, questionnaires give time to the respondents to think carefully, before giving
the answers.
4. They are easy to administer and manage.
10
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
5. Questionnaires allow people to answer questions when they feel it is convenient.
Thus, it is more applicable than face-to-face surveys where people are expected to
reply to the question immediately.
6. If anonymous, more honest answers can be expected from the people being
surveyed.
7. Used for getting answers from a large group of people in a short space of time.
Disadvantages of Questionnaire:
1. The results for questionnaires are based only on the type of question being asked. If
the questions are poorly worded or are biased, then the result analyzed will also be
of the same nature.
2. The response rate may be poor in questionnaires if people do not have time or do
not feel any importance in answering them. This is one of the main disadvantages of
questionnaires.
3. Open-ended questions may take a long time and will produce a large amount of
data that will take time to analyze.
4. If any doubts in the answers, the analyst cannot trace them back to the respondents
since most of the questionnaires are usually anonymous.
5. Questionnaires can also give the respondents freedom to lie, resulting in vague
answers or opinions distant from the main issue.
6. Questionnaires do not explain the questions to the respondents, which might lead to
misinterpreted answers and facts.
7. Because of the ambiguous language used, it might be confusing for the respondent
to answer such questions.
11
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Experimentation is a controlled study in which the researcher attempts to understand cause-and-
effect relationships.
The study is "controlled" in the sense that the researcher controls (1) how subjects are
assigned to groups and (2) which treatments each group receives. It involves manipulating one
variable to determine if changes in one variable cause changes in another variable. The variables
that you manipulate are referred to as independent, while the variables that change due to
manipulation are dependent variables.
Examples: Medical technologists would like to know the effect of a new brand of vitamins on
toddlers' growth. The new brand will be taken by a set of toddlers, while another set will be given
the existing brand. The growth of toddlers will then be compared to determine which vitamins are
better.
Advantages of Experimentation:
1. The biggest advantage of the experimental method is its unique ability to isolate
causal factors since an experiment is highly controlled.
2. This method promises more accuracy in the study.
3. Reliable data can be collected.
4. This is more suitable for the problem with heterogeneous (varied) influencing factors.
Disadvantages of Experimentation
1. The disadvantage is that this control may distort the validity of the obtained results,
especially the ecological validity.
2. This is a very costly method.
3. This is suitable for simple problems with limited scope.
4. This is a time-consuming method.
12
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Registration refers to the continuous, permanent, compulsory recording of the occurrence of vital
events together with certain identifying or descriptive characteristics concerning them. It is the
gathering of information enforced by law.
Examples are the number of registered professionals that can be found at the Professional
Regulation Commission (PRC). The number of births and death rates are registered in the National
Statistics Office (NSO).
Advantages of Registration:
1. This method is the most reliable since laws enforce it. This method promises more
accuracy in the study.
Disadvantages of Registration
1. Data are limited to what are listed in the document.
Data Presentation
Data gathered to provide a partial picture of reality. Regardless of the use, it was intended to
serve, one must always consider things such as what information the data are conveying, and what
must be done to include more useful information. Since most data are available in a raw format,
they must be summarized and organized to derive such useful information from them. Furthermore,
each data set needs to be presented in a certain way depending on its use. Planning how the data
will be presented is essential before appropriately processing raw data.
Data Visualization is a term to describe the use of graphical displays to summarize and
present information about a data set. Data become more comprehensible and more useful when
they are organized and presented using graphs, frequency distribution tables, charts, diagrams, and
the like to derive logical solutions and conclusions.
13
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Data Patterns in Graphs
The data patterns are commonly described in terms of the center, spread, shape, and other
unusual features.
Center. The point in a graphic display where about half of the observations are on either
side.
Spread. This refers to the variability of the data. If the observations cover a wide range, then
the spread is larger. On the other hand, the spread is smaller when the observations are clustered
around a single value.
Shape. It is described by the following characteristics:
Symmetry. Graph can be divided at the center so that each half is a mirror image of
the other.
Number of peaks. A distribution with one peak is referred to as unimodal, while a
distribution with two peaks is bimodal.
Skewness. Some distributions have more observations on one side of the graph
than the other. A distribution with fewer observations on the right (toward higher
values) is said to be skewed to the right. On the other hand, distribution with fewer
observations on the left (toward lower values) is said to be skewed to the left.
Uniform. Data distribution is equally spread across the range of the distribution.
Unusual features.
Gaps. Areas of a distribution where there are no observations.
Outliers. The distribution of data is sometimes characterized by extreme values that
greatly differ from the other observations.
14
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
FREQUENCY DISTRIBUTION TABLE (FDT). A frequency distribution is a table that shows how
often each value (or set of values) of the variable in question occurs in a data set. It is used to
summarize categorical (qualitative) or numerical (quantitative) data. Simply put, it is a tabular
summary of data showing the number or frequency of observations in each of several non-
overlapping categories or classes.
The relative frequency of a class equals the fraction or proportion of the observations
belonging to a class or category. Thus, the relative frequency can be computed using
A relative frequency distribution gives a tabular summary of data showing the relative
frequency for each class. If the relative frequency is multiplied by 100, we get the percent frequency
of a class. A percent frequency distribution summarizes the percent frequency of the data for each
class.
15
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
The frequency distribution table for this data set can be constructed manually or by using the
PivotTable feature of Microsoft Excel. With some editing, below are the frequency, relative
frequency, and percent frequency tables generated:
RSTUDIO
Using RStudio, the task can be completed by running the following R code in the Console window.
We will use the “purchase.csv” file in our working directory.
16
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
The same R code or script can also be written in the Source window or pane if you want to
keep a copy of the scripts you write in RStudio. First, we create a new R script file by clicking on the
File menu, then click on New File and select R Script. The same result can be obtained by using the
hot keys Ctrl+Shift+N.
Write the R code on the Source window. You should be able to have something similar to
figure below.
17
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
R script for the frequency distribution table for the soft drink purchase data.
Save the R script file. R script files are named with an. R extension. Click on the save icon on
the Source window and browse to your set working directory. Name the file as purchase.R.
After saving the file, execute the script by highlighting all the lines on the Source window and then
clicking on the ‘Run’ icon on the upper right part of the Source window. As an alternative to the
‘Run’ icon, you can press on the Ctrl+Enter keys to run the script. Take note of this.
For the relative frequency table, we can run the following R script.
18
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Note that since the dataset was already imported in RStudio from the previous R script,
there is no need to import the data again. Also, since the packages were already installed and
loaded from the previous R script, there is no need to repeat these commands.
Example 7. An engineering school arranged a charity concert to raise funds for COVID-19
patients. The following data give the status of 40 randomly selected students who attended the
concert. The numbers 1, 2, 3, and 4 represent the categories freshman, sophomore, junior, and
senior, respectively:
The table below shows the frequency, relative frequency, and percent frequency for the data
in just one table. Note that in practice, it is customary only to include one such type of frequency.
19
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
In this example, the frequency table constructed is for ungrouped data, which means that
the individual values do not lose their identity in the table.
RSTUDIO
Doing this in RStudio, let us consider a different approach by instead constructing a vector
representing the data values. Open a new R script file then enter and run following script.
Frequency distribution table for the number of cars registered in each household
20
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
BAR GRAPH. It represents the data by using vertical or horizontal bars whose heights or
lengths denote the frequencies of the data. It can be used to represents qualitative or categorical
data. A bar graph can be drawn using either horizontal or vertical bars. For a vertical bar chart, the
horizontal (x) axis represents the categories; the vertical (y) axis represents a value (frequency,
relative frequency, or percent frequency) for those categories.
21
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
The figure below shows the bar chart of the data on soft drink purchases .
RSTUDIO
22
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Just a note, you may not assign the bar graphs into the objects bar1 and bar2. Removing
these assignments in the script would generate the bar charts right away. Also, the bars will be
shown in the plots window of RStudio where you have the options to “Save as Image”, “Save as
PDF”, or “Copy to Clipboard” once you click of the “Export” icon on the Plots window.
23
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
PIE CHART. It (also called a pie graph or circle graph) provides another graphical device for
presenting relative frequency and percent frequency distributions for qualitative data. The numerical
values shown for each sector can be frequencies, relative frequencies, or percent frequencies, which
subdivides the circles into sectors.
A pie chart makes use of sectors (slices) in a circle. The angle of a sector is proportional to
the frequency of each of the categories of the variable that defines the data. The formula to
determine the angle of a sector in a circle graph is:
The figure below shows the pie chart of the data on soft drink purchases generated using
Microsoft Excel.
24
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
RSTUDIO
25
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
DOT PLOT. It is a graphical display of data using dots. It is similar to a bar graph because the
height of each “bar” of dots is equal to the number of items in a particular category. To draw a dot
plot, count the number of data points falling in each category and draw a stack of dots that number
high for each category. A dot plot can be used as a graphical display of the frequency of qualitative
and quantitative (ungrouped) data.
Example 10. The figure that follows shows the dot plot for the the number of students, classified
according to year, who went to the concert:
26
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
R Script
Here we present two ways by which a dot plot is constructed. First is by importing a .csv data
file from MS Excel, which is very useful especially if we have a large data set, and the other way is by
constructing the data vector in the RStudio environment. This is applicable if we would be dealing
with a small set of data. The following are the scripts. For the first method, we use the “concert.csv”
data from our directory.
27
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Notice the difference in dot sizes when you use different binwidths. You can further explore
RStudio functionality by varying the values of “arguments” in the syntax.
STEM -AND -LEAF PLOT. It is a graphical display for quantitative data that shows both the
rank order and shape of a data set. It is particularly useful when data are not too numerous. Stem-
and-leaf plots are a method for showing the frequency with which certain classes of values occur.
28
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Step 1: Determine the smallest and largest number in the data.
Looking at the stats, we see the number of minutes played ranges from a low of 1 minute to
a high of 31 minutes.
For any number, the digit/s to the left of the right-most digit is a stem. For example, the
number 31 has a stem of 3, while the number 29 has a stem of 2. A one-digit number like 4
has a stem of 0. Think ''04'' for 4. Based on the range of 1 to 31, we need stems of 0, 1, 2 and
3.
Step 3: Draw a vertical line and list the stem numbers to the left of the line.
29
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
The place value of the leaf is called the leaf unit. In the example above, the leaf unit is 1.
Other leaf units maybe 100, 10, 0.1, and so on. If the leaf unit is not 1, it should be displayed in the
stem-and-leaf plot.
R Script
For the same example, the stem and leaf plot can be generated in RStudio by using the stem ()
function. The script is very short. Try this out in RStudio.
30
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Histogram offers a visual representation of data distribution. It can display a large amount of
data and the frequency of the data values. A histogram can determine the median and distribution
of the data. In addition, it can show any outliers or gaps in the data.
Example 12. Consider the following data set on the diameter (in mm.) for a sample of 70
machined hex bolts:
425 430 430 435 435 435 435 435 440 440 440 440 440 445 445
445 445 445 450 450 450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480 480 485 490 490 490
500 500 500 500 510 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
A frequency table with 8 class intervals for this sample is shown below. In this case, the values are
grouped together in each class, and the individual values are no longer visible.Construct the
histogram corresponding to the frequency distribution table for the data on diameter (in mm) for
a sample of 70 machined hex bolts as shown below:
31
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
RSTUDIO
R Script
To plot the histogram for the same example, again we use the “diameter.csv” file.
32
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
Summarizing Qualitative and Quantitative Data for Two Variables
Tabular and graphical displays for data obtained from two variables help understand the
relationship between them if any.
Cross-tabulation or contingency table is a tabular summary of data for two variables. The
variables can both be qualitative or both quantitative or can be a combination of one qualitative
and one quantitative variable. If either variable is quantitative, classes must be created for the values
of the quantitative variable. The labels shown in the margins of the table define the categories
(classes) for the two variables.
Example 13. For an example, we consider the “salaries.csv” file. We construct a crosstabulation of
the rank and sex of the teachers. Using RStudio, we can generate the crosstabulation.
33
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
From the crosstabulation, we can see that majority of the teachers have a rank of ‘Professor’.
There are relatively more males than females among all the ranks and teachers who are male
34
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
professors make up the largest group. This could not have been easily observed by just looking at
the raw data.
Scatter diagram or scatter plot is a graphical display of the relationship between two
quantitative variables. One variable (independent variable) is shown on the horizontal axis and the
other variable (dependent variable) is shown on the vertical axis. The general pattern of the plotted
points suggests the overall relationship between the variables. This relationship will be discussed
more in Correlation Analysis and Regression Analysis.
Example 14. Consider the hypothetical study on the age of trees where the simplest way of
determining the age of a tree is to use the relationship between a tree’s diameter at breast height
(in feet) and age. Available data on the age and diameter at breast height of 10 trees on record is
given below. We construct a scatter diagram using RStudio
R Script
Here we present two scripts in generating the scatterplot for the same problem. The example data is
contained in the “advertising.csv” data file.
35
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.
36
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or
transmitting in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document,
without the prior written permission of SLU, is strictly prohibited.