Business Analytics Unit 2
Business Analytics Unit 2
2.1 Data
2.2 Data and Data Quality
2.2.1 Data Quality
2.2.2 Selection of Statistical Method
2.3 Data Types
2.4 Measurements Scales
2.5 Data Sources
2.5.1 Internal and External Sources
2.5.2 Primary and Secondary Sources
Learning Outcomes
The main outcomes of this unit are that, once you have studied the unit, you
should be able to:
• Define the following terms – data, data quality
• Understand the importance of data quality
• Know the different types of data.
• Differentiate the four measurements of scales.
• Distinguish between internal and external sources.
• Distinguish between primary and secondary sources.
Page 1 of 12
2.1 Data
Data are facts and figures collected, analyzed and summarized for presentation and
interpretation.
All data collected in a particular study are referred to as the data set of the study. Table 1 below
shows a data set summarizing information for All Share Index of the Namibia Stock Market (NSX).
A variable is a characteristic of interest for the elements. The data set in the above table includes
the following four variables:
Categories: Section-Stock(Code); Closing Price; Market cap (Million’s) and Weight (%)
The set of measurements obtained for a particular element is called an observation. Referring to
the above table, we see that the set of measurements for the first observation (Anglo-American
plc (ANM)) is 239.07; 321,053 and 36.57%. A data set with 10 elements contains 10 observations.
Page 2 of 12
2.2 Data and Data Quality
An understanding of the nature of data is necessary for two reasons. It enables a user (i) to assess
data quality and (ii) to select the most appropriate statistical method to apply to the data. Both
factors affect the validity and reliability of statistical findings
Data quality is influenced by four factors: the data type, data source, the method of data collection
and appropriate data preparation.
Page 3 of 12
2.3 Data Types
The type of data available for analysis is determined by the nature of its random variable. A
random variable is either qualitative (categorical) or quantitative (numeric) in nature.
Qualitative random variables generate categorical (non-numerical) response data. The data is
represented by categories only.
The following are examples of qualitative random variables with categories as data:
Numbers are often assigned to represent the categories (e.g. 1 = male, 2 = female), but they are
only codes and have no numeric properties. Such categorical data can therefore only be counted
to determine how many responses belong to each category.
Quantitative random variables generate numeric response data. These are real numbers that
can be manipulated using arithmetic operations (addition, subtraction, multiplication and
division).
The following are example of quantitative random variables with real numbers as data:
For example, the number of students in a class (e.g. 24; 37; 41; 46), the number of cars sold by a
dealer in a month (e.g. 14; 27; 21; 16) and the number of machine breakdowns in a shift (e.g. 4;
0; 6; 2).
Page 4 of 12
Continuous data are any numbers that can occur in an interval.
For example, the assembly time for a part can be between 27 minutes and 31 minutes (e.g.
assembly time = 28.4 min), a passenger’s hand luggage can have a mass between 0.5 kg and 10 kg
(e.g. 2.4 kg) and the volume of fuel in a car tank can be between 0 litres and 55 litres (e.g. 42.38
litres).
Nominal data
Nominal data are associated with categorical data. If all the categories of a qualitative random
variable are of equal importance, then this categorical data is termed ‘nominal-scaled’.
Nominal data are the weakest form of data to analyze since the codes assigned to the various
categories have no numerical properties. Nominal data can only be counted (or tabulated). This
limits the range of statistical methods that can be applied to nominal-scaled data to only a few
techniques.
Page 5 of 12
Ordinal data
Ordinal data are also associated with categorical data, but have an implied ranking between the
different categories of the qualitative random variable. Each consecutive category possesses
either more or less than the previous category of a given characteristic.
Rank (ordinal) data are stronger than nominal data because the data possess the numeric
property of order (but the distances between the ranks are not equal). It is therefore still
numerically weak data, but it can be analyzed by more statistical methods (i.e. from the field of
non-parametric statistics) than nominal data.
Interval data
Interval data are associated with numeric data and quantitative random variables. It is generated
mainly from rating scales, which are used in survey questionnaires to measure respondents’
attitudes, motivations, preferences and perceptions.
Examples of rating scale responses are shown in Table 2. Statements 1, 2 and 3 are illustrations
of semantic differential rating scales that use bipolar adjectives (e.g. very slow to extremely fast
service) while statement 4 illustrates a Likert rating scale that uses a scale that ranges from
strongly disagree to strongly agree with respect to a statement or an opinion.
Page 6 of 12
Table 2 Examples of interval-scaled quantitative random variables
1. How would you rate your chances of promotion after the next performance?
Interval data possess the two properties of rank-order (same as ordinal) and distance in terms of
‘how much more or how much less’ an object possesses of a given characteristic.
However, it has no zero point. Therefore, it is not meaningful to compare the ratio of interval-
scaled values with one another. For example, it is not valid to conclude that a rating of 4 is twice
as important as a rating of 2, or that a rating of 1 is only one-third as important as a rating of 3.
Interval data (rating scales) possess sufficient numeric properties to be treated as numeric data
for the purpose of statistical analysis. A much wider range of statistical techniques can therefore
be applied to interval data compared with nominal and ordinal data.
Ratio data
Ratio data consists of all real numbers associated with quantitative random variables.
Examples of ratio-scaled data are: employee ages(years), customer income ®, distance travelled
(km), door height (cm), product mass (g), volume of liquid in a container (ml), machine speed
(rpm), tyre pressure (psi), product prices (N$), length of service (months) and number of
shopping trips per month (0; 1; 2; 3 etc.)
Ratio data have all the properties of numbers (order, distance and an absolute origin of zero) that
allow such data to be manipulated using all arithmetic operations (addition, subtraction,
multiplication and division). The zero origin property means that ratios can be computed (5 is
half of 10, 4 is one-quarter of 16, 36 is twice as great as 18, for example). Ratio data are the
Page 7 of 12
strongest data for statistical analysis. Compared to the other data types (nominal, ordinal and
interval), the most amount of statistical information can be extracted from it. Also, more statistical
methods can be applied to ratio data than any other data type.
Random variable
Qualitative Quantitative
Categorical Numeric
Discrete Continuous
Limited Extensive
Page 8 of 12
2.5 Data Sources
Data for statistical analysis is available from many different sources. A manager must decide how
reliable and accurate a set of data from a given source is before basing decisions on findings
derived from it. Unreliable data results in invalid findings.
Data sources are typically classified as (i) internal or external; and (ii) primary or secondary.
• sales vouchers, credit notes, accounts receivable, accounts payable and asset registers for
financial data
• production cost records, stock sheets and downtime records for production data
• timesheets, wages and salaries schedules and absenteeism records for human resource
data
• product sales records and advertising expenditure budgets for marketing data.
External data sources exist outside an organization. They are mainly business associations,
government agencies, universities and various research institutions. The cost and reliability of
external data is dependent on the source. A wide selection of external databases exists and, in
many cases, can be accessed via the internet, either free of charge or for a fee. A few examples
relevant to managers are: the Namibia Statistics Agency (NSA) (www.nsa.org.na) for macro-
economic data, Bank of Namibia (BON) (www.bon.com.na) for basic information on Treasury
Bills and Bonds, auction tender invitations and announcements of results, as well as Government
Debt Statistics, and the Namibia Stock Exchange (NSX) (www.nsx.com.na) for company-level
financial and performance data.
Page 9 of 12
2.5.2 Primary and Secondary Sources
Primary data is data recorded for the first time at source and with a specific purpose in mind.
Primary data can be either internal (if it is recorded directly from an internal business process,
such as machine speed settings, sales invoices, stock sheets and employee attendance records) or
external (e.g. obtained through surveys such as human resource surveys, economic surveys and
consumer surveys [market research]).
The main advantage of primary-sourced data is its high quality (i.e. relevancy and accuracy). This
is due to generally greater control over its collection and the focus on only data that is directly
relevant to the management problem.
The main disadvantage of primary sourced data is that it can be time-consuming and expensive to
collect, particularly if sourced using surveys. Internal company databases, however, are relatively
quick and cheap to access for primary data.
Secondary data is data that already exists in a processed format. It was previously collected and
processed by others for a purpose other than the problem at hand. It can be internally sourced
(e.g. a monthly stock report or a quarterly absenteeism report) or externally sourced (e.g.
economic time series on trade, exports, employment statistics from Stats SA or advertising
expenditure trends in South Africa or by sector from SAARF).
Secondary data has two main advantages. First, its access time is relatively short (especially if the
data is accessible through the internet), and second, it is generally less expensive to acquire than
primary data.
Its main disadvantages are that the data may not be problem specific (i.e. problem of its
relevancy), it may be out of date (i.e. not current), it may be difficult to assess data accuracy, it
may not be possible to manipulate data further (i.e. it may not be at the right level of aggregation),
and combining various secondary sources of data could lead to data distortion and introduce bias.
Despite such shortcomings, an analyst should always consider relevant secondary database
sources before resorting to primary data collection.
Page 10 of 12
Tutorial Questions
Brand and Product Price (£) MP3 Mini disk Cassette player CD (watts) Output
Model rating (# of player player player
stars)
Technics 1 320 – 400 Y N Y Y 360
SCEH790
Yamaha M170 3 162 – 290 N N N Y 50
Panasonic 5 188 Y N Y Y 70
SCPM29
Pure Digital 3 180 – 230 N N N Y 80
DMX50
Sony CMTNEZ3 5 60 – 100 Y N Y Y 30
Philips FWM589 4 143 – 200 Y N N Y 400
Philips MCM9 5 93 – 110 Y N Y Y 100
Samsung MM-C6 5 100 - 130 Y N N Y 40
Source: Kelkoo (http://audiovisual.kelkoo.co.uk)
Consider the data set for the sample of eight audio systems in the above table.
a) How many variables are in the data set?
b) Which of the variables are quantitative and which are categorical?
c) What percentage of the audio systems has a four-star rating or higher?
d) What percentage of the audio system includes an MP3 player?
2.
For each of the following random variables, state the data type of each random variable
(categorical or numeric), the measurement scale (nominal, ordinal, interval or ratio scaled) and
whether it is discrete or continuous.
a) Ages of athletes in a marathon
b) Floor area of Edgars stores
c) Marital status of employees
d) Types of child abuse (physical, sexual, emotional or verbal)
e) Different sectors of investments in unit trusts.
f) Responses to the following question:
How would you rate the service level of your bank?
Use the following semantic differential rating scale:
Extremely poor [1] Very poor [2] Poor [3] Unsure [4] Good [5] Very good [6] Excellent [7]
3.
Explain the value of business statistics in management.
4.
What is the difference between descriptive statistics and inferential statistics?
5.
Explain the role of statistical modelling in business practice.
Page 11 of 12
6.
Name three factors that influence data quality.
7.
Why is it important to know whether the data type is categorical or numerical in terms of the
choice of statistical analysis?
~THE END~
Page 12 of 12