0% found this document useful (0 votes)
9 views12 pages

Business Analytics Unit 2

Unit 2 covers the concepts of data and data quality, emphasizing the importance of understanding data types, measurement scales, and sources for effective statistical analysis. It outlines various data types, including qualitative and quantitative, and their respective measurement scales: nominal, ordinal, interval, and ratio. Additionally, the unit distinguishes between internal and external data sources, as well as primary and secondary data, highlighting their significance in ensuring reliable statistical findings.

Uploaded by

tt840912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Business Analytics Unit 2

Unit 2 covers the concepts of data and data quality, emphasizing the importance of understanding data types, measurement scales, and sources for effective statistical analysis. It outlines various data types, including qualitative and quantitative, and their respective measurement scales: nominal, ordinal, interval, and ratio. Additionally, the unit distinguishes between internal and external data sources, as well as primary and secondary data, highlighting their significance in ensuring reliable statistical findings.

Uploaded by

tt840912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 2: Data and Data Quality

2.1 Data
2.2 Data and Data Quality
2.2.1 Data Quality
2.2.2 Selection of Statistical Method
2.3 Data Types
2.4 Measurements Scales
2.5 Data Sources
2.5.1 Internal and External Sources
2.5.2 Primary and Secondary Sources

Learning Outcomes
The main outcomes of this unit are that, once you have studied the unit, you
should be able to:
• Define the following terms – data, data quality
• Understand the importance of data quality
• Know the different types of data.
• Differentiate the four measurements of scales.
• Distinguish between internal and external sources.
• Distinguish between primary and secondary sources.

Page 1 of 12
2.1 Data
Data are facts and figures collected, analyzed and summarized for presentation and
interpretation.

All data collected in a particular study are referred to as the data set of the study. Table 1 below
shows a data set summarizing information for All Share Index of the Namibia Stock Market (NSX).

Table 1: Namibian All Share Index (NSX) as at 30 September 2009.


Categories: Section-Stock (Code) Close Price Market cap Weights (%)
(Million’s)
Basic Materials
Sector – Industrial Metals
Anglo-American plc (ANM) 239.07 321,053 36.57%
Sector - Mining
Paladin Energy Limited (PDN) 28.56 20,482 2.33%
Trans Hex Group Limited Nm (THX) 3.40 361 0.04%
Sector - Chemical
Afrox (AOX) 20.50 7,028 0.80%
Industrials
Sector – General Industrial
Barloworld Limited (BWL) 49.00 11,144 1.27%
Consumer Goods
Sector - Beverages
**Namibian Breweries (NBS) 6.06 1,252 0.14%
Sector – Food Producers
Oceana Group Limited (OCG) 26.25 3,113 0.35%
Consumer Services
Sector – General Retailers
**Nictus (NCT) 0.86 46 0.01%
Truworths (TRW) 42.50 19,336 2.20%
Sector – Food & Drug Retailers
Shoprite Holdings (SRH) 62.00 33,696 3.84%

Source: Namibia Stock Exchange

Elements, variables and observations


Elements are the entities on which data are collected. For the data set in the table above each
individual stock is an element, and the element names appear in the first column. With 10 stocks,
the data set contains 10 elements.

A variable is a characteristic of interest for the elements. The data set in the above table includes
the following four variables:

Categories: Section-Stock(Code); Closing Price; Market cap (Million’s) and Weight (%)

The set of measurements obtained for a particular element is called an observation. Referring to
the above table, we see that the set of measurements for the first observation (Anglo-American
plc (ANM)) is 239.07; 321,053 and 36.57%. A data set with 10 elements contains 10 observations.

Page 2 of 12
2.2 Data and Data Quality
An understanding of the nature of data is necessary for two reasons. It enables a user (i) to assess
data quality and (ii) to select the most appropriate statistical method to apply to the data. Both
factors affect the validity and reliability of statistical findings

2.2.1 Data Quality


Data are the raw material of statistical analysis. If the quality of data is poor, the quality of
information derived from statistical analysis of this data will also be poor. Consequently, user
confidence in the statistical findings will be low. A useful acronym to keep in mind GIGO, which
stands for ‘garbage in, garbage out’. It is therefore necessary to understand what influences the
quality of data needed to produce meaningful and reliable statistical results.

Data quality is influenced by four factors: the data type, data source, the method of data collection
and appropriate data preparation.

2.2.2 Selection of Statistical Method


The choice of the most appropriate statistical method to use depends firstly on the management
problem to be addressed and secondly on the type of data available. Certain statistical methods
are valid for certain data types only. The incorrect choice of statistical method for a given data
type can again produce invalid statistical findings.

Page 3 of 12
2.3 Data Types
The type of data available for analysis is determined by the nature of its random variable. A
random variable is either qualitative (categorical) or quantitative (numeric) in nature.

Qualitative random variables generate categorical (non-numerical) response data. The data is
represented by categories only.

The following are examples of qualitative random variables with categories as data:

• The gender of a consumer is either male or female


• An employee’s highest qualification is either a matric, a diploma or a degree.
• A company operates in either the financial, retail, mining or industrial sector.
• A consumer’s choice of mobile phone service provider is either Vodacom, MTN, Virgin
Mobile, Cell C or 8ta.

Numbers are often assigned to represent the categories (e.g. 1 = male, 2 = female), but they are
only codes and have no numeric properties. Such categorical data can therefore only be counted
to determine how many responses belong to each category.

Quantitative random variables generate numeric response data. These are real numbers that
can be manipulated using arithmetic operations (addition, subtraction, multiplication and
division).

The following are example of quantitative random variables with real numbers as data:

• The age of an employee (e.g. 46 years; 28 years; 32 years)


• machine downtime (e.g. 8 min; 32.4 min; 12.9 min)
• the price of a product in different stores (e.g. N$6.75; N$7.20; N$6.99)
• delivery distances travelled by a courier vehicle (e.g. 14.2 km; 20.1 km; 17.8 km).

Numeric data can be further classified as either discrete or continuous.

Discrete data are whole number (or integer) data.

For example, the number of students in a class (e.g. 24; 37; 41; 46), the number of cars sold by a
dealer in a month (e.g. 14; 27; 21; 16) and the number of machine breakdowns in a shift (e.g. 4;
0; 6; 2).

Page 4 of 12
Continuous data are any numbers that can occur in an interval.

For example, the assembly time for a part can be between 27 minutes and 31 minutes (e.g.
assembly time = 28.4 min), a passenger’s hand luggage can have a mass between 0.5 kg and 10 kg
(e.g. 2.4 kg) and the volume of fuel in a car tank can be between 0 litres and 55 litres (e.g. 42.38
litres).

2.4 Measurement Scales


Data can also be classified in terms of its scale of measurement. This indicates the ‘strength’ of the
data in terms of how much arithmetic manipulation on the data is possible. There are four types
of measurement scales: nominal, ordinal, interval and ratio. The scale also determines which
statistical methods are appropriate to use on the data to produce valid statistical results.

Nominal data

Nominal data are associated with categorical data. If all the categories of a qualitative random
variable are of equal importance, then this categorical data is termed ‘nominal-scaled’.

Examples of nominal-scaled categorical data are:

• gender (1 = male; 2 = female)


• city of residence (1 = Okahandja; 2 = Gobabis; 3 = Otjiwarongo; 4 = Windhoek)
• home language (1 = Otjiherero; 2 = Oshiwambo; 3 = English; 4 = Afrikaans; 5 =
Damara/Nama)
• mode of commuter transport (1 = car; 2 = train; 3 = bus; 4 = mechanical)
• engineering profession (1 = chemical; 2 = electrical; 3 = civil; 4 = mechanical)
• survey question: ‘Do you have an Instagram account?’ (1 = yes; 2 = no).

Nominal data are the weakest form of data to analyze since the codes assigned to the various
categories have no numerical properties. Nominal data can only be counted (or tabulated). This
limits the range of statistical methods that can be applied to nominal-scaled data to only a few
techniques.

Page 5 of 12
Ordinal data

Ordinal data are also associated with categorical data, but have an implied ranking between the
different categories of the qualitative random variable. Each consecutive category possesses
either more or less than the previous category of a given characteristic.

Examples of ordinal-scaled categorical data are:

• size of clothing (1 = small; 2 = medium; 3 = large; 4 = extra large)


• product usage level (1 = light; 2 = moderate; 3 = heavy)
• income category (1 = lower; 2 = middle; 3 = upper)
• company size (1 = micro; 2 = small; 3 = medium; 4 = large)
• response to a survey question: ‘Rank your top three TV programmes in order of
preference’ (1 = first choice; 2 = second choice; 3 = third choice).

Rank (ordinal) data are stronger than nominal data because the data possess the numeric
property of order (but the distances between the ranks are not equal). It is therefore still
numerically weak data, but it can be analyzed by more statistical methods (i.e. from the field of
non-parametric statistics) than nominal data.

Interval data

Interval data are associated with numeric data and quantitative random variables. It is generated
mainly from rating scales, which are used in survey questionnaires to measure respondents’
attitudes, motivations, preferences and perceptions.

Examples of rating scale responses are shown in Table 2. Statements 1, 2 and 3 are illustrations
of semantic differential rating scales that use bipolar adjectives (e.g. very slow to extremely fast
service) while statement 4 illustrates a Likert rating scale that uses a scale that ranges from
strongly disagree to strongly agree with respect to a statement or an opinion.

Page 6 of 12
Table 2 Examples of interval-scaled quantitative random variables

1. How would you rate your chances of promotion after the next performance?

Very poor Poor Unsure Good Very good


1 2 3 4 5
2. How satisfied are you with your current job description?

Very Dissatisfied Satisfied Very satisfied


dissatisfied 2 3 4
1
3. What is your opinion of the latest Idols TV series?

Very boring Dull OK Exciting Fantastic


1 2 3 4 5
4. The performance appraisal system is biased in favour of technically oriented
employees.
Strongly Disagree Unsure Agree Strongly agree
disagree 2 3 4 5
1

Interval data possess the two properties of rank-order (same as ordinal) and distance in terms of
‘how much more or how much less’ an object possesses of a given characteristic.

However, it has no zero point. Therefore, it is not meaningful to compare the ratio of interval-
scaled values with one another. For example, it is not valid to conclude that a rating of 4 is twice
as important as a rating of 2, or that a rating of 1 is only one-third as important as a rating of 3.

Interval data (rating scales) possess sufficient numeric properties to be treated as numeric data
for the purpose of statistical analysis. A much wider range of statistical techniques can therefore
be applied to interval data compared with nominal and ordinal data.

Ratio data

Ratio data consists of all real numbers associated with quantitative random variables.

Examples of ratio-scaled data are: employee ages(years), customer income ®, distance travelled
(km), door height (cm), product mass (g), volume of liquid in a container (ml), machine speed
(rpm), tyre pressure (psi), product prices (N$), length of service (months) and number of
shopping trips per month (0; 1; 2; 3 etc.)

Ratio data have all the properties of numbers (order, distance and an absolute origin of zero) that
allow such data to be manipulated using all arithmetic operations (addition, subtraction,
multiplication and division). The zero origin property means that ratios can be computed (5 is
half of 10, 4 is one-quarter of 16, 36 is twice as great as 18, for example). Ratio data are the

Page 7 of 12
strongest data for statistical analysis. Compared to the other data types (nominal, ordinal and
interval), the most amount of statistical information can be extracted from it. Also, more statistical
methods can be applied to ratio data than any other data type.

Figure 1 diagrammatically summarizes the classification of data.

Random variable

Qualitative Quantitative

Categorical Numeric

Nominal Ordinal Interval Ratio

Discrete Continuous

Choice of suitable statistical methods

Limited Extensive

Figure 1: Classification of data types and influence on statistical analyses

Page 8 of 12
2.5 Data Sources
Data for statistical analysis is available from many different sources. A manager must decide how
reliable and accurate a set of data from a given source is before basing decisions on findings
derived from it. Unreliable data results in invalid findings.

Data sources are typically classified as (i) internal or external; and (ii) primary or secondary.

2.5.1 Internal and External Sources


In a business context, internal data is sourced from within a company. It is data that is generated
during the normal course of business activities. As such, it is relatively inexpensive to gather,
readily available from company databases and potentially of good quality (since it is recorded
using internal business systems). Examples of internal data sources are:

• sales vouchers, credit notes, accounts receivable, accounts payable and asset registers for
financial data
• production cost records, stock sheets and downtime records for production data
• timesheets, wages and salaries schedules and absenteeism records for human resource
data
• product sales records and advertising expenditure budgets for marketing data.

External data sources exist outside an organization. They are mainly business associations,
government agencies, universities and various research institutions. The cost and reliability of
external data is dependent on the source. A wide selection of external databases exists and, in
many cases, can be accessed via the internet, either free of charge or for a fee. A few examples
relevant to managers are: the Namibia Statistics Agency (NSA) (www.nsa.org.na) for macro-
economic data, Bank of Namibia (BON) (www.bon.com.na) for basic information on Treasury
Bills and Bonds, auction tender invitations and announcements of results, as well as Government
Debt Statistics, and the Namibia Stock Exchange (NSX) (www.nsx.com.na) for company-level
financial and performance data.

Page 9 of 12
2.5.2 Primary and Secondary Sources
Primary data is data recorded for the first time at source and with a specific purpose in mind.
Primary data can be either internal (if it is recorded directly from an internal business process,
such as machine speed settings, sales invoices, stock sheets and employee attendance records) or
external (e.g. obtained through surveys such as human resource surveys, economic surveys and
consumer surveys [market research]).

The main advantage of primary-sourced data is its high quality (i.e. relevancy and accuracy). This
is due to generally greater control over its collection and the focus on only data that is directly
relevant to the management problem.

The main disadvantage of primary sourced data is that it can be time-consuming and expensive to
collect, particularly if sourced using surveys. Internal company databases, however, are relatively
quick and cheap to access for primary data.

Secondary data is data that already exists in a processed format. It was previously collected and
processed by others for a purpose other than the problem at hand. It can be internally sourced
(e.g. a monthly stock report or a quarterly absenteeism report) or externally sourced (e.g.
economic time series on trade, exports, employment statistics from Stats SA or advertising
expenditure trends in South Africa or by sector from SAARF).

Secondary data has two main advantages. First, its access time is relatively short (especially if the
data is accessible through the internet), and second, it is generally less expensive to acquire than
primary data.

Its main disadvantages are that the data may not be problem specific (i.e. problem of its
relevancy), it may be out of date (i.e. not current), it may be difficult to assess data accuracy, it
may not be possible to manipulate data further (i.e. it may not be at the right level of aggregation),
and combining various secondary sources of data could lead to data distortion and introduce bias.

Despite such shortcomings, an analyst should always consider relevant secondary database
sources before resorting to primary data collection.

Page 10 of 12
Tutorial Questions

1. A sample of eight audio systems

Brand and Product Price (£) MP3 Mini disk Cassette player CD (watts) Output
Model rating (# of player player player
stars)
Technics 1 320 – 400 Y N Y Y 360
SCEH790
Yamaha M170 3 162 – 290 N N N Y 50
Panasonic 5 188 Y N Y Y 70
SCPM29
Pure Digital 3 180 – 230 N N N Y 80
DMX50
Sony CMTNEZ3 5 60 – 100 Y N Y Y 30
Philips FWM589 4 143 – 200 Y N N Y 400
Philips MCM9 5 93 – 110 Y N Y Y 100
Samsung MM-C6 5 100 - 130 Y N N Y 40
Source: Kelkoo (http://audiovisual.kelkoo.co.uk)

Consider the data set for the sample of eight audio systems in the above table.
a) How many variables are in the data set?
b) Which of the variables are quantitative and which are categorical?
c) What percentage of the audio systems has a four-star rating or higher?
d) What percentage of the audio system includes an MP3 player?
2.
For each of the following random variables, state the data type of each random variable
(categorical or numeric), the measurement scale (nominal, ordinal, interval or ratio scaled) and
whether it is discrete or continuous.
a) Ages of athletes in a marathon
b) Floor area of Edgars stores
c) Marital status of employees
d) Types of child abuse (physical, sexual, emotional or verbal)
e) Different sectors of investments in unit trusts.
f) Responses to the following question:
How would you rate the service level of your bank?
Use the following semantic differential rating scale:
Extremely poor [1] Very poor [2] Poor [3] Unsure [4] Good [5] Very good [6] Excellent [7]
3.
Explain the value of business statistics in management.
4.
What is the difference between descriptive statistics and inferential statistics?
5.
Explain the role of statistical modelling in business practice.

Page 11 of 12
6.
Name three factors that influence data quality.
7.
Why is it important to know whether the data type is categorical or numerical in terms of the
choice of statistical analysis?
~THE END~

Page 12 of 12

You might also like