DBB1202
DBB1202
Ordinal data – it involves categories that have a meaningful order or ranking, but
the intervals between categories are not necessarily consistent. Examples include
education level (high school, college, graduate) or customer satisfaction ratings
(poor, average, excellent).
2. QUANTITATIVE DATA
It consists of numerical values and can be measured and quantified. It is typically used
for mathematical analysis and statistical modeling. This type can be split into -:
Discrete data – this data is countable and can consists of distinct separate values.
For example, the number of students in a class, the numbers of cars in a parking
lot, or the number of defects in a product.
Continuous data – it can take any value within a given range and is measurable.
For instance, height, weight, temperature, or time. These values are often
measured with precision, and there can be infinite possibilities within a range.
3. STRUCTURED DATA
It is highly organized and easily searchable because it is stored in a predefined format,
such as databases or spreadsheets. Each data element is stored in a fixed field within a
record. Example include customer information (name, address, phone number) or
financial data (sales, figures, profit margins).
4. UNSTRUCTURED DATA
It is not organized in a predefined manner, making it harder to search or analyze. It
typically includes text-heavy formats or multimedia. Examples include, emails, social
media posts, videos, images, and audio files.
5. SEMI-STRUCTURED DATA
It lies between structured and non-structured data. It does not fit neatly into tables, but it
still contains some level of organization through tags or markers. Examples include XML
files, JSON files, or NoSQL databases.
6. TIME SERIES DATA
This type of data is collected at regular intervals over time and is useful for tracking
trends or changes over a period. Examples include stock prices, weather data, or traffic
data.
7. SPATIAL DATA
It represents information about physical location and their attributes. It includes
geographic data, such as coordinates, maps and satellite images.
ANSWER 3 (a) -: To calculate the mean of a frequency distribution, we use the formula -:
𝛴(𝑓.𝑋)
Mean =
𝛴𝑓
Where:
𝑓 is the frequency
X is the value of the mark
∑ (𝑓 . X) is the sum of the product of frequency and marks.
∑ 𝑓 is the total frequency.
Sum of the products of frequency and marks
Marks X Frequency 𝑓 Product
10 8 10*8 =80
20 12 20*12=240
30 20 30*20= 600
40 10 40*10=400
50 7 50*7= 350
60 3 60*3= 180
1850
Mean = = 30.83
60
So, the mean of given frequency distribution is 30.83
b) Calculating the cumulative frequency from the given data -:
Size (X) Frequency (f) Cumulative frequency (CF)
4 10 10
4.5 18 28
5 22 50
5.5 25 75
6 40 115
6.5 15 130
7 10 140
7.5 8 148
8 7 155
This means that Q1 corresponds to the position 38.75 in the cumulative frequency distribution.
To find this, we need to look for the size where the cumulative frequency first exceeds 38.75
From the cumulative frequency table
The cumulative frequency for size 4 is 10.
The cumulative frequency for size 4.5 is 28.
The cumulative frequency for size 5 is 50.
Since 38.75 lies between 28 and 50, Q1 is between 4.5 and 5.
Using linear interpolation to find the exact value of Q1.
𝑁
− 𝐶𝐹
𝑄1 = 𝐿 + ( 4 )×ℎ
𝑓
Where:
38.75 − 28
𝑄1 = 5 + ( ) × 0.5
22
10.75
𝑄1 = 5 + ( ) × 0.5
22
𝑄1 = 5 + (0.4886) × 0.5
𝑄1 = 5 + 0.2443 = 5.2443
Q1 ≈ 5.2443
This means that Q3 corresponds to the position 116.25 in the cumulative frequency distribution.
To find this, we need to look for the size where the cumulative frequency first exceeds 116.25
From the cumulative frequency table
Where:
116.25 − 115
𝑄3 = 6 + ( ) × 0.5
40
1.25
𝑄3 = 6 + ( ) × 0.5
40
𝑄3 = 5 + (0.03125) × 0.5
𝑄3 = 5 + 0.015625 = 6.0156
Q3 ≈ 6.0156
So, the Quartile one (Q1) is 5.2443 and Quartile three (Q3) is 6.0156.
SET-2
ANSWER 1 -: The coefficient of correlation is a statistical measure that quantifies the strength
and direction of the relationship between two variables. It ranges from -1 to + 1, where +1
indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates
no relationship. The most common type in Pearson’s correlation coefficient, which measures
linear relationship in continuous data. Other methods like Spearman’s rank and Kendall’s tau are
used for ordinal or non-linear relationships. The coefficient helps in understanding how changes
in one variable might predict changes in another.
There are several methods to calculate the correlation coefficient. The most common ones are –
1. Pearson’s Correlation Coefficient
Pearson’s Correlation Coefficient (r) measures the linear relationship between two
continuous variables. It is the most commonly used method and assumes that the
relationship between the variables is linear and that the data follows a normal
distribution.
The formula of Pearson’s correlation coefficient is –
𝑛 (𝛴 𝑋𝑌) − (𝛴𝑋)(𝛴𝑌)
𝑟=
√[𝑛 𝛴 𝑋 2 − (𝛴 𝑋)2 ][𝑛 𝛴 𝑌 2 − (𝛴 𝑌)2 ]
Where:
𝑛 is the number of data points.
𝛴 𝑋𝑌 is the sum of the product paired scores.
𝛴𝑋 and 𝛴Y are the sums of the values of the variables X and Y respectively.
𝛴 𝑋 2 and 𝛴𝑌 2 are sums of the squares of X and Y, respectively.
This formula calculates the degree to which the values of X and Y are related linearly. The closer
the result is to +1 or -1, the stronger the linear relationship.
ANSWER 2 -: Time series analysis is a statistical technique used to analyze data points collected
or recorded at specific time intervals. This type of analysis is useful for forecasting, identifying
trends, and understanding patterns in data that change over time. Time series data can be
collected on anything that evolves over time, such as stock prices, weather data, or economic
indicators. The key components of time series analysis are -:
1. Trend – The trend component represents the long-term movement or direction in the data
over a significant period. It indicates whether the data points are increasing, decreasing or
remaining constant. For example, the gradual increase in global temperatures over the
past few decades would be considered a trend. Trends can be linear (a steady upward or
downward movement) or nonlinear (such as exponential growth).
2. Seasonality – This component reflects periodic fluctuations in the data that occur at
regular intervals, often related to specific time periods like months, quarters, or seasons.
These patterns are typically driven by factors like climate, holidays, and business cycles.
For example, retail sales might experience higher demand during the holiday season each
year. Identifying seasonal patterns helps in forecasting and planning.
3. Cyclic patterns – cyclical variations occur over irregular intervals and are generally
linked to economic or business cycle. Unlike seasonality, cycles do not have a fixed
period and can be influenced by broader factors like economic booms or recissions. For
example, the growth and decline of a country’s economy could lead to cyclical changes
in employment rates or consumer spending.
4. Noise (irregular component) – it is also known as the residual or error component, refers
to random variations or irregularities in the data that cannot be explained by trend,
seasonality, or cycles. These are often short-term fluctuations or anomalies that arise from
unpredictable factors such as accidents, natural disasters, or sudden market shocks. Noise
typically represents the “background” variability in the data.
5. Level – The level is the baseline value around which the time series fluctuates. It
indicates the overall magnitude of the data series after removing trends, seasonality and
cycles. It’s important in understanding the average value of the series and in detecting
whether the time series is generally high, low, centered at some value.
6. Stationarity – A stationary time series has statistical properties (like mean, variance, and
autocorrection) that do not change over time. Many time series analysis techniques, such
as ARIMA models, assume that the data is stationary. If a series is not stationary, it may
need to be transformed, often by differencing or detrending, before analysis.
ANSWER 3 (a) -: To construct an index number for 2015 with 2014 as the base year, we will use
the Laspeyres Index formula. The formula is -:
𝛴(𝑃2015 × 𝑄2014 )
𝐼= ( ) × 100
𝛴(𝑃2014 × 𝑄2014 )
Where:
𝑃2014 is the price in 2014,
𝑃2015 is the price in 2015,
𝑄2014 is the quantity (or weight) of the commodity in 2014
Since no quantity is provided, we can assume equal quantities for each commodity.
Price relative for each commodity is calculated as
𝑃
Price relative = 𝑃2015 × 100
2014
60
Commodity B = 40 × 100 = 150
110
Commodity C = × 100 = 122.22
90
35
Commodity D = 30 × 100 = 116.67
Now, the average of these price relatives to get the overall index –
105.56+150+122.22+116.67
Index = = 123.11
4
The index number for 2015, taking 2014 as the base year, is 123.11.
ii) ESTIMATOR -: An estimator is a statistical tool used to estimate the value of an unknown
population parameter based on sample data. It is a function or formula that provides an
approximation of parameters such as the population mean, variance, or proportion. For instance,
the sample mean can be used as an estimator for the population mean. Estimators are categorized
as either point estimators, which provide a single value estimate, or interval estimators, which
give a range of possible values. A good estimator should be unbiased, meaning its expected value
equals the true parameter, and consistent, meaning it produces more accurate estimates as the
sample size increases. Estimators are central to statistical inference, aiding in decision-making
and predictions based on data. Their reliability and accuracy depend on the sample size and
method used.