Unit 2 Final Ids
Unit 2 Final Ids
Data: It is how the data objects and their attributes are stored.
A data object represents an entity—in a sales database, the objects may
be customers, store items, and sales; in a medical database, the objects
may be patients; in a university database, the objects may be students,
professors, and courses. Data objects are typically described by attributes
Type of attributes :
Example :
Nominal attributes. Suppose that hair color and marital status
are two attributes describing person objects. In our application,
possible values for hair color are black, brown, blond, red,
auburn, gray, and white. The attribute marital status can take on
the values single, married, divorced, and widowed. Both hair
color and marital status are nominal attributes. Another example
of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on.
2. Ordinal Attributes : The Ordinal Attributes contains values that have a
meaningful sequence or ranking(order) between them, but the magnitude
between values is not actually known, the order of values that shows what
is important but don’t indicate how important it is.
Suppose that drink size corresponds to the size of drinks available at a
fast-food restaurant. This nominal attribute has three possible values:
small, medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size).
for example, assistant, associate, and full for professors
Customer satisfaction had the following ordinal categories: 0: very
dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4: very
satisfied.
Quantitative Attributes:
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of
equal-size units. The values of interval-scaled
attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of
values, such attributes allow us to compare and
quantify the difference between values.
Example:
The mean, median and mode are all valid measures of central tendency,
In statistics, the mean, median, and mode are the three most common
measures of central tendency.
Each one calculates the central point using a different method. Choosing
the best measure of central tendency depends on the type of data you
have.
Measures of central tendency are summary statistics that represent the
center point or typical value of a dataset.
Examples of these measures include the mean, median, and mode.
These statistics indicate where most values in a distribution fall and are
also referred to as the central location of a distribution.
Measures of central tendency are summary statistics that represent the
center point or typical value of a dataset.
Examples of these measures include the mean, median, and mode.
These statistics indicate where most values in a distribution fall and are
also referred to as the central location of a distribution.
Month Salary
January $105
February $95
March $105
April $105
May $100
Suppose, we want to express the salary of the employee using a single
value and not 5 different values for 5 months. This value that can be used
to represent the data for salaries for 5 months here can be referred to as
the measure of central tendency. The three possible ways to find the
central measure of the tendency for the above data are,
Mean: The mean salary of the given salary can be used as on of the
measures of central tendency, i.e., x̄ = (105 + 95 + 105 + 105 + 100)/5 =
$102.
Mode: If we use the most frequently occurring value to represent the
above data, i.e., $105, the measure of central tendency would be mode.
Median: If we use the central value, i.e., $105 for the ordered set of
salaries, given as, $95, $100, $105, $015, $105, then the measure of
central tendency here would be median.
Example:
Solution:
2 + 4 + 6 + 8 + 10 = 30
X̄ = (x1 + x2 + x3 +….+xn)/n
To calculate the arithmetic mean of a set of data we must first add up (sum) all of
the data values (x) and then divide the result by the number of values (n). Since ∑
is the symbol used to indicate that values are to be summed (see Sigma Notation)
we obtain the following formula for the mean (x̄):
x̄=∑ x/n
Example:
In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85, 84, 80, 81, 82,
83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Solution:
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 + 82 +
83]/20
= 1660/20
= 83
If the number of values (n value) in the data set is odd then the formula to
calculate median is:
Median
The median is the mid-value or average of a data set. The data set must
be sorted either in ascending or descending order.
In other words, it is a middle value of a sorted data set. We find mean or
average by using the median.
To find the median of odd frequency distribution follow the steps given
below. But remember that data must be sorted. After sorting the data, use
the following formula:
Example 1: Find the median of 23, 2, 12, 33, 65, 45, and 9.
Solution:
There is a total of 7 values, so the mid-value (4th) will be median, i.e. 23.
To find the median of the data set that contains even number of frequency
The value that we get on dividing is the median of the given data set.
We can also write the above steps in terms of the formula:
Solution:
The middle pair terms of the list are 7th and 8th and its values are 32 and 34, respectively.
A point to be noticed here is that 33 is not in the list. But it indicates that half values
in the list are less than 33, and half values are greater than 33.
Let's find the median through the formula which we have learned above.
Step 2: Check n (number of terms of data set) is even or odd and find the
median of the data with respective ‘n’ value.
Step 3: Here, n = 5 (odd) then Median = [(n + 1)/2] th term 10, 20, 30, 40,
50
The median of the data is [(5 + 1)/2] th term is 30.
Example 2:
25, 12, 5, 24, 15, 22, 23, 25
Step 2: Check n (number of terms of data set) is even or odd and find the
median of the data with respective ‘n’ value.
Step 3: Here, n = 8 (even) then,
Median = [(n/2)th term + {(n/2) + 1) th term] / 2
Median = [(8/2)th term + {(8/2) + 1} th term] / 2 = (22+23) / 2 = 22.5
Mode
A mode is the most frequent value or item of the data set. A data set can generally
have one or more than one mode value. If the data set has one mode then it is
called “Uni-modal”. Similarly, If the data set contains 2 modes then it is called
“Bimodal” and if the data set contains 3 modes then it is known as “Trimodal”. If
the data set consists of more than one mode then it is known as “multi-modal”(can
be bimodal or trimodal). There is no mode for a data set if every number appears
only once.
Example 1:
If the data set is {1, 2, 2, 3, 3, 4, 5} then it has 2 modes i.e, 2 and 3 (bi-modal). Since,
both the values 2 and 3 are repeating twice in the data set.
Example 2:
If the data set is {15, 42, 65, 65, 95} then the mode is 65 (uni-modal). Since 65 is the
only repeating value in the data set.
Range
It is the difference between the highest value and the lowest value. It is a way to
understand how the numbers are spread in a data set. Formula to find Range is:
Example:
If the data set is {12, 19, 6, 2, 15, 4} then the lowest value is 2 and the highest
value is 19.
So the range is 19 − 2 = 17.
Reading Bar Charts: Putting it Together with Central Tendency
Question 1. Finding Mean for the above bar chart.
Mean = (sum of all data values) / (number of values)
Mean = (5 + 7 + 9 + 6) / 4 = 27 / 2 = 6.75
Range = 9 – 5 = 4
Measures of Dispersion
Statistical methods that help to know about the distribution or the
spread of the data points in the datasets are known as Measures of
Dispersion.
Range
EX:
Problem Statement:
0-10 5
10-20 8
20-30 15
30-40 9
Solution:
For the largest value – Take higher limit of the highest class = 40
For the smallest value – Take lower limit of the lowest class = 0
Range = 40 – 0
Range = 40
Interquartile Range
Before defining the interquartile range, let’s discuss the quartiles and five-number
summary
There are three quartiles Q1, Q2 and Q3, where Q2 is the median of the distribution.
Lowest value
Q1: 25 percentile
Q2: Median
Q3: 75 Percentile
Highest Value
Interquartile Range
Interquartile Range: Interquartile range is defined as the range between 75 percentile (Q3) and 25
percentile (Q1).
IQR = Q3 – Q1
Let’s understand Q1, Q2, Q3 and the Interquartile range by an example.
Problem Statement:
Let there are 8 numbers between 10 and 90 which are equally distributed.
Lowest value : 10
Q1 (25 percentile) : 25
Q2 (50 percentile) : 50
Q3 (75 percentile) : 75
Highest value : 90
Interquartile Range(IQR) = Q3 – Q1 = 75 – 25 = 50
Interquartile Range = 50
Ex: 2
Variance
Definition
Variance is a measure of how data points differ from the mean. According to Layman, a
variance is a measure of how far a set of data (numbers) are spread out from their mean
(average) value.
Variance means to find the expected difference of deviation from actual value. Therefore,
variance depends on the standard deviation of the given data set
Ex:
Find the mean of the given data set. Calculate the average of a given set of values
Now subtract the mean from each value and square them
Find the average of these squared values, that will result in variance
Say if x1, x2, x3, x4, …,xn are the given values.
x̄ = (x1+x2+x3+…+xn)/n
Now subtract the mean value from each value of the given data set and square them.
Example of Variance
Let’s say the heights (in mm) are 610, 450, 160, 420, 310.
Mean and Variance is interrelated. The first step is finding the mean which is done as follows,
To calculate the Variance, compute the difference of each from the mean, square it and find then find
the average once again.
Given,
Step 2: Make a table with three columns, one for the X values, the second for the deviations and the
third for squared deviations. As the data is not given as sample data so we use the formula for
population variance. Thus, the mean is denoted by μ.
Value
X–μ (X – μ)2
X
3 -5.8 33.64
8 -0.8 0.64
6 -2.8 7.84
10 1.2 1.44
12 3.2 10.24
9 0.2 0.04
11 2.2 4.84
10 1.2 1.44
12 3.2 10.24
7 -1.8 3.24
Total 0 73.6
Step 3:
σ2=∑(X−μ)2N
= 73.6 / 10
= 7.36
Standard Deviation
Standard deviation is a metric that represents the amount to which various
values of a statistical series tend to fluctuate or disperse from its mean or
median. It describes how the values are distributed over the data sample
and is a measure of the data points’ deviation from the mean.
Ex: 1
Consider the data set: 2, 1, 3, 2, 4. The mean and the sum of squares of deviations of the observations
from the mean will be 2.4 and 5.2, respectively. Thus, the standard deviation will be √(5.2/5) = 1.01.
Ex: 2
2 + 1 +3 + 2 + 4 = 12
12 ÷ 5 = 2.4 (mean)
2. Subtract the mean from each value:
2 - 2.4 = -0.4
1 - 2.4 = -1.4
3 - 2.4 = 0.6
2 - 2.4 = -0.4
4 - 2.4 = 1.6
EX: 3
A class of students took a math test. Their teacher wants to know whether
most students are performing at the same level, or if there is a high standard
deviation.
1. The scores for the test were 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81,
87, and 89. When the teacher adds them together, she gets 1279. She divides
by the number of scores (15) to get the mean score.
1279 ÷ 15 =85.2 (mean)
2. 85.2 is a high score, but is everyone performing at that level? To find out,
the teacher subtracts the mean from every test score.
85 - 85.2 = -0.2
86 - 85.2 = 0.8
100 - 85.2 = 14.8
76 - 85.2 = -9.2
81 - 85.2 = -4.2
93 - 85.2 = 7.8
84 - 85.2 = -1.2
99 - 85.2 = 13.8
71 - 85.2 = -14.2
69 - 85.2 = -16.2
93 - 85.2 = 7.8
85 - 85.2 = -0.2
81 - 85.2 = -4.2
87 - 85.2 = 1.8
89 - 85.2 = 3.8
4. The teacher finds the variance, which is the average of the squares:
0.04 + 0.64 + 219.04 + 84.64 + 17.64 + 60.84 +1.44 +190.44 +201.64 +262.44
+ 60.84 + 0.04 + 17.64 + 3.24 + 14.44 = 1135
The standard deviation of these tests is 8.7 points out of 100. Since the
variance is somewhat low, the teacher knows that most students are
performing around the same level.
EX:4
Because this is a sample size, the researcher needs to subtract 1 from the total
number of values in step 4.
1. The scores for the survey are 9, 7, 10, 8, 9, 7, 8, and 9. The mean is 8.4.
2. The researcher subtracts the mean from every score (differences: 0.6, -1.4, 1.6,
-0.4, 0.6, -1.4, -0.4, 0.6).
3. He squares each number (0.36, 1.96, 2.56, 0.16, 0.36, 1.96, 0.16, 0.36).
4. Because this is a sample of responses, the researcher subtracts one from the
number of values (8 values -1 = 7) to average squares and find the
variance: 1.12 (variance)
5. Last, the researcher finds the square root of the variance: 1.06 (standard
deviation)
The standard deviation is 1.06, which is somewhat low. The researcher now
knows that the results of the sample size are probably reliable.
Tables
Pictorial Representation through graphs.
They say, “A picture is worth the thousand words”. It’s always better to
represent data in graphical format.
quantile plots,
quantile–quantile plots, (Q-Q)
histograms,
and scatter plots.
Such graphs are helpful for the visual inspection of data, which is useful for data
preprocessing
These charts are also known by many other names, such as 'Scatter Graphs,
Scatter Charts, Scattergrams, Scatter Diagrams, XY Graph, etc.'
There are mainly five components in a Scatter Plot Chart, as listed below:
o Plot Area: A graphical form/area within the sheet where the data is drawn
is called the Plot Area.
o Chart Title: A chart title represents the subject of the plotted chart that
primarily helps determine the chart's topic or motive. The text in the chart
title can be edited, and the position can be arranged accordingly.
o Vertical Axis: An axis that lies vertically in the chart window is called the
vertical axis, and it is located on the bottom area of the plot area. Since
the vertical axis typically represents the measurement values across X-
axis, it is known as the X-axis.
o Horizontal Axis: An axis that lies horizontally in the chart window is
called the horizontal axis, and it is located on the left side of the plot area.
Since the horizontal axis represents the different data categories across Y-
axis, it is also known as the Y-axis. We can group series data on the
horizontal axis.
o Legend: The legend is another useful component of the chart that helps
list and distinguish various data groups. We can move the legend or
change the legend's position accordingly, and it can be placed on any side
in the chart window.
Advantages of using Scatter Plots
o The scatter charts help determine the relationship between two or more
Histogram
300 – 400 14
400 – 500 56
500 – 600 60
600 – 700 86
700 – 800 74
800 – 900 62
900 – 1000 48
Example:
Present the following information in the form of a Histogram:
Number of students 16 36 70 50 28
Solution
It is visible that the set of data given is of the equal class interval; i.e., the
difference between the upper limit and the lower limit of each class interval is
10. So, drawing a Histogram is feasible.
The X-axis represents the marks (class intervals), and Y-axis represents the
number of students (frequency distribution).
bar graph
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart
of poles. Plotting histograms is a graphical method for summarizing the
distribution of a given attribute,
This is similar to bar graphs, but it is based frequency of numerical values rather
than their actual values. The data is organized into intervals and the bars
represent the frequency of the values in that range. That is, it counts how many
values of the data lie in a particular range.
Box and Whisker Plot
These plots divide the data into four parts to show their summary. They are
more concerned about the spread, average, and median of the data.
quantile-quantile (q-q)plot
The purpose of Q Q plots is to find out if two sets of data come from the
same distribution. A 45 degree angle is plotted on the Q Q plot; if the two
data sets come from a common distribution, the points will fall on that
reference line.