IDS Unit-2
IDS Unit-2
TYPES OF DATA
A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity. In turn, data objects are
described by a number of attributes that capture the basic characteristics of an object, such as the
mass of a physical object or the time at which an event occurred. Other names for an attribute are
variable, characteristic, field, feature, or dimension.
Example is Student Information. Often, a data set is a file, in which the objects are records (or
rows) in the file and each field (or column) corresponds to an attribute. For example, Table shows
a data set that consists of student information. Each row corresponds to a student and each column
is an attribute that describes some aspect of a student, such as grade point average (GPA) or
identification number (ID).
ATTRIBUTE
• In data science, an attribute refers to a characteristic or feature that describes a data point
or an object.
• It can be seen as a data field that represents the characteristics or features of a data
object. For a customer, object attributes can be customer Id, address, etc. We can
say that a set of attributes used to describe a given object are known as attribute
vector or feature vector.
• An attribute is a data field, representing a characteristic or feature of a data object. The
nouns attribute, dimension, feature, and variable are often used interchangeably.
• The term dimension is commonly used in data warehousing. MachineLearning literature
tends to use the term feature, while statisticians prefer the term variable. Data mining and
database professionals commonly use the term attribute.
• Attributes describing a customer object can include, for example, customerID, name, and
address. Observed values for a given attribute are knownas observations. A set of
attributes used to describe a given object is calledan attribute vector (or feature vector).
• In R, attributes are additional metadata associated with objects that provide extra
information about the object. These attributes can include information such as names,
dimensions, class, comments, and more. Attributes enhance the functionality and
interpretability of objects in R.
• Names Attribute: The names attribute assigns names to the elements of vectors,matrices,
or arrays.
# Creating a named vector
> my_vector <- c(apple = 3, banana = 2, orange = 5)
# Checking the names attribute
> names(my_vector)
[1] "apple" "banana" "orange"
• Dim Attribute: The dim attribute specifies the dimensions of matrices and arrays.
# Creating a matrix with dimensions
> my_matrix <- matrix(1:6, nrow = 2, ncol = 3)
# Checking the dim attribute
> dim(my_matrix)
[1] 2 3
• Factor Levels Attribute: The levels attribute of factors defines the unique values or
categories.
# Creating a factor with levels
> my_factor <- factor(c("low", "medium", "high"), levels = c("low", "medium", "high"))
# Checking the levels attribute
> levels(my_factor)
[1] "low" "medium" "high"
• Class Attribute: The class attribute indicates the type of R object.
# Checking the class of an object
> class(my_factor)
[1] "factor"
• Column Names Attribute (Data Frame): Data frames have column names as an
attribute.
# Creating a data frame
> my_data_frame <- data.frame( Name = c("John", "Jane", "Bob"), Age = c(25, 30,2))
# Checking the column names attribute
> colnames(my_data_frame)
[1] "Name" "Age"
Each attribute type possesses all of the properties and operations of the attribute types above it. In
other words, the definition of the attribute types is cumulative. However, this does not mean that
the operations appropriate for one attribute type are appropriate for the attribute types above it.
Qualitative Attributes:
1. Nominal Attributes : Nominal attributes, as related to names, refer to categorical data
where the values represent different categories or labels without any inherent order or
ranking. These attributes are often used to represent names or labels associated with
objects, entities, or concepts. The values of a nominal attribute are just different names;
i.e., nominal values provide only enough information to distinguish one object from
another (=, ≠). Nominal attribute operations are mode, entropy, contingency correlation, x²
test.
Example for Nominal attributes: Suppose that hair color and marital status are two
attributes describing person objects. In our application, possible values for hair color are
black, brown, blond, red, auburn, gray, and white. The attribute marital status can take on
the values single, married, divorced, and widowed. Both haircolor and marital status are
nominal attributes. Another example of a nominal attribute is occupation, with the values
teacher,dentist, programmer, farmer, and so on.
# Creating a nominal attribute (factor)
> nominal_attribute <- factor(c("Red", "Green", "Blue", "Red", "Blue"))
# Display the nominal attribute
> print(nominal_attribute)
[1] Red Green Blue Red Blue
Levels: Blue Green Red
# Checking the levels of the nominal attribute
> levels(nominal_attribute)
[1] "Blue" "Green" "Red"
# Summary statistics for the nominal attribute
> summary(nominal_attribute)
Blue Green Red
2 1 2
In this example, nominal_attribute is a factor representing a nominal attribute with three
categories: "Red," "Green," and "Blue." Factors in R are often used to represent nominal
attributes because they can store categorical data with distinct levels.
# Creating a nominal attribute using a character vector
> nominal_attribute_char <- c("Small", "Medium", "Large", "Medium", "Small")
# Display the nominal attribute
> print(nominal_attribute_char)
[1] "Small" "Medium" "Large" "Medium" "Small"
# Converting the character vector to a factor
> nominal_attribute_factor <-factor(nominal_attribute_char)
# Display the nominal attribute
> print(nominal_attribute_factor)
[1] Small Medium Large Medium Small
Levels: Large Medium Small
# Checking the levels of the nominal attribute
> levels(nominal_attribute_factor)
[1] "Large" "Medium" "Small"
2. Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the
values possess a meaningful order or ranking, but the magnitude between values is not
precisely quantified. In other words, while the order of values indicates their relative
importance or precedence, the numerical difference between them is not standardized or
known. The values of an ordinal attribute provide enough information to order objects.
(<, >). Ordinal Attribute operations are median, percentiles, rank correlation, run tests,
sign tests.
Unlike nominal attributes, ordinal attributes have an inherent order, but the intervals
between categories are not necessarily equal or known. Ordinal attributes are often used
when the categories have a natural order, but the differences between them are not
precisely defined.
Example 1 for Ordinal attributes. Suppose that drink size corresponds to the size of drinks
available at a restaurant. This nominal attribute has three possible values: small, medium,
and large. The values have a meaningful sequence (which corresponds to increasing drink
size). however, we cannot tell from the values how much bigger, say, a medium is than a
large. Other examples of ordinal attributes include grade and professional rank.
Professional ranks can be enumerated in a sequential order: for example, assistant,
associate, and full for professors, and private, private first class, specialist, corporal, and
sergeant for army ranks.
Ordinal attributes are useful for registering subjective assessments of qualities that cannot
be measured objectively, thus ordinal attributes are often used in surveys for ratings. In one
survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat
dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied.
Another examples are
• Asymmetric: An asymmetric attribute indicates that the two values or states are not
equally important or interchangeable. For asymmetric attributes, only presence a non-zero
attribute value-is regarded as important. For instance, in the attribute “Result” with values
“Pass” and “Fail,” the states are not of equal importance, passing may hold greater
significance than failing in certain contexts, such as academic grading or certification
exams. Consider a data set where each object is a student and each attribute records whether
or not a student took a particular course at a university. For a specific student, an attribute
has a value of 1 if the student took the course associated with that attribute and a value of
0 otherwise. Because students take only a small fraction of all available courses, most of
the values in such a data set would be 0. Therefore, it is more meaningful and more efficient
to focus on the non- zero values.
To illustrate, if students are compared on the basis of the courses they don't take, then most
students would seem very similar, at least if the number of courses is large. Binary
attributes where only non-zero values are important are called asymmetric binary
attributes. This type of attribute is particularly important for association analysis. It is also
possible to have discrete or continuous asymmetric features. For instance, if the number of
credits associated with each course is recorded, then the resulting data set will consist of
asymmetric discrete or continuous attributes.
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numeric attributes in R refer to variables that represent quantitative data
with meaningful numerical values. These values can be either discrete or continuous. Numeric
attributes are used to store information that can be measured or counted and are amenable to
arithmetic operations.
# Creating a numeric attribute (numeric vector)
> numeric_attribute <- c(25, 30, 22, 18, 35)
# Display the numeric attribute
> print(numeric_attribute)
[1] 25 30 22 18 35
# Checking the class of the attribute
> class(numeric_attribute)
[1] "numeric"
# Summary statistics for the numeric attribute
> summary(numeric_attribute)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18 22 25 26 30 35
In this example, numeric_attribute is a numeric vector representing a numeric attribute with
values such as ages. The class() function confirms that it is a numeric vector, and the summary()
function provides summary statistics like mean, median, minimum, and maximum.
# Creating a data frame with a numeric attribute
> my_data_frame <- data.frame(Name = c("John", "Jane", "Bob"), Age =c(25, 30, 22))
# Display the data frame
> print(my_data_frame)
Name Age
1 John 25
2 Jane 30
3 Bob 22
# Checking the class of the character attribute in the data frame
> class(my_data_frame$Name)
[1] "character"
# Checking the class of the numeric attribute in the data frame
> class(my_data_frame$Age)
[1] "numeric"
# Summary statistics for the numeric attribute in the data frame
> summary(my_data_frame$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.00 23.50 25.00 25.67 27.50 30.00
Numerical attributes are of 2 types: interval , and ratio-scaled.
• An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points. An
interval scale is one where there is order and the difference between two values is
meaningful. Interval-scaled attributes are measured on a scale of equal-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition
to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values. For interval attributes, the differences between values are
meaningful, i.e., a unit of measurement exists. (+, -). Data can be added and subtracted at an
interval scale but cannot be multiplied or divided. An interval-scaled operations are mean,
standard deviation, Pearson's correlation, t and F tests. Consider an example of calendar
dates, temperature in Celsius or Fahrenheit. If a day’s temperature of one day is twice of the
other day, we cannot say that one day is twice as hot as another day.
• A ratio-scaled attribute is a numeric attribute with a fix zero-point. Ratio scales attributes
allow you to categorize and rank your data along equal intervals. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median,
mode, Quantile-range, and Five number summary can be given. A ratio-scaled attribute
operations are geometric mean, harmonic mean, percent variation. For ratio variables, both
differences and ratios are meaningful. (*, /). Consider an example of temperature in Kelvin.
Other examples of ratio-scaled attributes include count attributes such as years of experience
(e.g., the objects are employees) and number of words (e.g., the objects are documents).
Additional examples include attributes to measure age, weight, height, counts, mass, length,
electrical current, latitude and longitude coordinates (e.g., when clustering houses), and
monetary.
2. Discrete: A discrete attribute has a finite or countably infinite set of values, which may or may
not be represented as integers. Discrete data refer to information that can take on specific, separate
values rather than a continuous range. These values are often distinct and separate from one
another, and they can be either numerical or categorical in nature. Discrete attributes are often
represented using integer variables. Binary attributes are a special case of discrete attributes and
assume only two values, e.g., true/false, yes/no, male/female, or 0/1. Binary attributes are often
represented as Boolean variables, or as integer variables that only take the values 0 or 1.
Classification algorithms developed from the field of machine learning often talk of attributes as
being either discrete or continuous. Each type may be processed differently.
The attributes hair color, smoker, medical test, and drink size each have a finite number of values,
and so are discrete. Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes or, the values 0 to 110 for the attribute age. An attribute is countably infinite if
the set of possible values is infinite but the values can be put in a one-to-one correspondence with
natural numbers. For example, the attribute customer ID is countably infinite. The number of
customers can grow to infinity, but in reality, the actual set of values is countable (where the values
can be put in one-to-one correspondence with the set of integers). Zip codes are another example.
Example:
In statistics, the mean, median, and mode are the three most common measures of central
tendency.
• Each one calculates the central point using a different method. Choosing the best measure
of central tendency depends on the type of data you have.
• Measures of central tendency are summary statistics that represent the center point or typical
value of a dataset.
• Examples of these measures include the mean, median, and mode. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.
• Measures of central tendency are summary statistics that represent the center point or typical
value of a dataset.
• Examples of these measures include the mean, median, and mode. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.
Measures of Central Tendency Example
Example. The monthly salary of an employee for the 5 months is given in the table below,
Month Salary
January $105
February $95
March $105
April $105
May $100
Suppose, we want to express the salary of the employee using a single value and not 5 different
values for 5 months. This value that can be used to represent the data for salaries for 5 months
here can be referred to as the measure of central tendency. The three possible ways to find the
central measure of the tendency for the above data are,
• Mean: The mean salary of the given salary can be used as on of the measures of central
tendency, i.e., x̄ = (105 + 95 + 105 + 105 + 100)/5 = $102.
• Mode: If we use the most frequently occurring value to represent the above data, i.e., $105,
the measure of central tendency would be mode.
• Median: If we use the central value, i.e., $105 for the ordered set of salaries, given as, $95,
$100, $105, $015, $105, then the measure of central tendency here would be median.
MEAN
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers
by the total number of numbers.
• The mean, also known as the average, is calculated by summing up all the values in a
dataset and dividing by the number of observations.
Mean = (Sum of all the observations/Total number of observations)
Example:
What is the mean of 2, 4, 6, 8 and 10?
Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
Mean Symbol (x̄):
The symbol of mean is usually given by the symbol ‘x̄’. The bar above the letter x, represents the
mean of x number of values.
X̄ = (Sum of values ÷ Number of values)
X̄ = (x1 + x2 + x3 +….+xn)/n
Mean = Sum of the Given Data/Total number of Data
To calculate the arithmetic mean of a set of data we must first add up (sum) all of the data values
(x) and then divide the result by the number of values (n). Since Σ is the symbol used to indicate
that values are to be summed, we obtain the following formula for the mean (x̄ ): x̄ =Σ x/n
Example:
In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85, 84, 80, 81,
82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Find the mean percentage obtained by the class.
Solution:
Mean = Total of percentage obtained by 20 students in class/Total number of students
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 + 82
+ 83] / 20
= 1660/20
= 83
Hence, the mean percentage of each student in the class is 83%.
Example code for Mean using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the mean
> mean_value <- mean(data)
# Displaying the mean
> print(mean_value)
[1] 20
MEDIAN
• A Median is a middle value or average for a sorted data. The sorting of the data
set must be done either in ascending order or in descending order.
• In other words, it is a middle value of a sorted data set. We find mean or average
by using the median.
• A median divides the data into two halves. The formula for median:
✓ If the number of values (n value) in the data set is odd then the formula to calculate
median is:
• A point to be noticed here is that 33 is not in the list. But it indicates that half values in
the list are less than 33, and half values are greater than 33.
• Let’s find the median through the formula which we have learned above.
Range: It is the difference between the highest value and the lowest value. It is a way to understand
how the numbers are spread in a data set. Formula to find Range is:
Range = Highest value – Lowest Value
Example: If the data set is {12, 19, 6, 2, 15, 4} then the lowest value is 2 and
the highest value is 19.
So, the range is 19 − 2 = 17.
Reading Bar Charts: Putting it Together with Central Tendency
Question 1. Finding Mean for the above bar chart.
Mean = (sum of all data values) / (number of values)
Mean = (5 + 7 + 9 + 6) / 4 = 27 / 2 = 6.75
Question 2. Finding the Median for the above bar chart:
Order the given data in ascending order as: 5, 6, 7, 9
Here, n = 4 (number of students which is even)
Median = [(n/2)th term + {(n/2) + 1}th term] / 2
Median = (6 + 7) / 2 = 6.5
Question 3. Finding Mode for the above bar chart:
Mode = most frequent value = 9 (highest value)
Question 4. Finding the range for the above bar chart:
Range = highest value – lowest value
Range = 9 – 5 = 4
These measures of central tendency provide insights into the typical or central value of a dataset.
The choice of which measure to use depends on the nature of the data and the specific
characteristics of the distribution. The mean is sensitive to outliers, while the median is robust
against extreme values. The mode is particularly useful for categorical data or discrete
distributions.
BASIC STATISTICAL DESCRIPTIONS OF DATA
• Statistical methods that help to know about the distribution or the spread of the data points
in the datasets are known as Measures of Dispersion.
• Measuring the dispersion of data involves assessing how spread out or clustered the values
in a dataset are.
• Common measures of dispersion include Range, Quartiles, Interquartile range (IQR),
Variance, Standard Deviation.
RANGE
• The range is the easiest measure of dispersion. It is simply calculated by
subtracting the highest value from the lowest value.
• The range is the difference between the maximum and minimum values in a dataset. It
provides a simple measure of the spread of thedata.
Range = Highest value – Lowest Value
• The range is a simple and straight forward measure of dispersion that provides an
indication of the difference between the largest and smallest values in a dataset. However,
it can be affected by outliers, or extreme values, in the data and does not provide
information about the distribution of values within the range.
• In combination with measures of central tendency, such as mean or median, range can
provide a quick summary of the distribution of the data. However, other measures of
dispersion, such as variance or standard deviation, are often used to provide a more
comprehensive picture of the spread of the data.
Ex: Problem Statement: Let there be 5 students in the class having heights of
150cm, 160cm,175cm, 190cm and 200cm.Calculate the range of heights?
Range = 200cm – 150cm
Hence, Range = 50cm
Range for ungrouped data:
Question 1: Find out the range for the following observations 20, 24, 31, 17, 45, 39, 51, 61
Solution: The largest value in the given observations is 61 and the smallest value is 17.
The Range is 61 – 17 = 44
Range for grouped data:
Question 2: Find out the range for the following frequency distribution table for the marks scored
by class 10 students.
Marks Intervals Number of Students
0-10 5
10-20 8
20-30 15
30-40 9
Solution:
For the largest value – Take higher limit of the highest class = 40
For the smallest value – Take lower limit of the lowest class = 0
Range = 40 – 0
Range = 40
Sample code using R:
# Creating a numeric vector
> data<- c(10, 15, 20, 25, 30)
# Calculating the range
> range_value <- max(data) - min(data)
# Displaying the range
> print(range_value)
[1] 20
QUARTILES
• Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that
we can pick certain data points so as to split the data distribution into equal-size consecutive
sets. These data points are called quantiles.
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
• Quartiles divide the set into 5 equal parts.
• There are three quartiles Q1, Q2 and Q3, where Q2 is the median of the distribution.
• Five number summaries: every dataset can be described using these 5 numbers
✓ Lowest value
✓ Q1: 25 percentile
✓ Q2: Median
✓ Q3: 75 Percentile
✓ Highest Value
• The interquartile range is the range of the middle 50% of the data. Itis calculated as the
difference between the third quartile (Q3) and the first quartile (Q1).
IQR = Q3 – Q1
Let’s understand Q1, Q2, Q3 and the Interquartile range by an example.
Problem Statement:
Let there are 8 numbers between 10 and 90 which are equally distributed.
Define the five-number summary and find the Interquartile Range?
✓ Lowest value : 10
✓ Q1 (25 percentile) : 25
✓ Q2 (50 percentile) : 50
✓ Q3 (75 percentile) : 75
✓ Highest value : 90
✓ Interquartile Range(IQR) = Q3 – Q1 = 75 – 25 = 50
Interquartile Range = 50
Ex: 2
Step 3: σ2=Σ(X−μ)2/N
= 73.6 / 10
= 7.36
Sample code using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the variance using the 'var' function
> variance_value <- var(data)
# Displaying the variance
> print(variance_value)
[1] 62.5
STANDARD DEVIATION
● Standard deviation is the square root of the variance. It provides a measure of the
average deviation of data points from the mean.
● Standard deviation is a metric that represents the amount to which various values of a
statistical series tend to fluctuate or disperse from its mean or median. It describes how
the values are distributed over the data sample and is a measure of the data points’
deviation from the mean.
● The square root of the variance of a sample, statistical population, random variable, data
collection, or probability distribution is its standard deviation.
Steps to Calculate Standard Deviation
✓ Find the mean, which is the arithmetic mean of the observations.
✓ Find the squared differences from the mean.
(The data value - mean)2
✓ Find the average of the squared differences.
Variance = The sum of squared differences ÷ the number of observations
✓ Find the square root of variance.
Standard deviation = √Variance
• The standard deviation provides a summary of the spread of values in a dataset and can be
used to determine how far each value is from the mean. A low standard deviation indicates
that the values in the dataset are close to the mean, while a high standard deviation indicates
that the values are spread out.
• The standard deviation is widely used in statistical analysis and is a useful measure of
dispersion for datasets with a normal or symmetrical distribution. However, it can be
affected by outliers or extreme values in the dataset and may not provide a good summary
of the spread of the data for datasets with skewed or non-normal distributions.
• In summary, the standard deviation is a useful measure of dispersion that quantifies the
amount of variation or dispersion of a set of values around the mean, and provides a
summary of the spread of values in a dataset.
Ex: 1
Consider the data set: 2, 1, 3, 2, 4. The mean and the sum of squares of deviations of the
observations from the mean will be 2.4 and 5.2, respectively. Thus, the standard deviation will be
√(5.2/5) = 1.01.
Ex: 2
For example: Take the values 2, 1, 3, 2 and 4.
1. Determine the mean (average):
2 + 1 +3 + 2 + 4 = 12
12 ÷ 5 = 2.4 (mean)
2. Subtract the mean from each value:
2 - 2.4 = -0.4
1 - 2.4 = -1.4
3 - 2.4 = 0.6
2 - 2.4 = -0.4
4 - 2.4 = 1.6
3. Square each of those differences:
-0.4 x -0.4 = 0.16
-1.4 x -1.4 = 1.96
0.6 x 0.6 = 0.36
-0.4 x -0.4 = 0.16
1.6 x 1.6 = 2.56
4. Determine the average of those squared numbers to get the variance.
0.16 + 1.96 + 0.36 + 0.16 + 2.56 = 5.2
5.2 ÷ 5 = 1.04 (variance)
5. Find the square root of the variance.
Square root of 1.04 = 1.01
The standard deviation of the values 2, 1, 3, 2 and 4 is 1.01.
EX: 3
A class of students took a math test. Their teacher wants to know whether most students are
performing at the same level, or if there is a high standard deviation.
1. The scores for the test were 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89. When
the teacher adds them together, she gets 1279. She divides by the number of scores (15) to get the
mean score.
1279 ÷ 15 =85.2 (mean)
2. 85.2 is a high score, but is everyone performing at that level? To find out, the teacher subtracts
the mean from every test score.
85 - 85.2 = -0.2
86 - 85.2 = 0.8
100 - 85.2 = 14.8
76 - 85.2 = -9.2
81 - 85.2 = -4.2
93 - 85.2 = 7.8
84 - 85.2 = -1.2
99 - 85.2 = 13.8
71 - 85.2 = -14.2
69 - 85.2 = -16.2
93 - 85.2 = 7.8
85 - 85.2 = -0.2
81 - 85.2 = -4.2
87 - 85.2 = 1.8
89 - 85.2 = 3.8
4. She squares each difference:
-0.2 x -0.2 = 0.04
0.8 x 0.8 = 0.64
14.8 14.8 = 219.04
-9.2 x -9.2 = 84.64
-4.2 x -4.2 = 17.64
7.8 x 7.8 = 60.84
-1.2 x -1.2 = 1.44
13.8 x 13.8 = 190.44
-14.2 x -14.2 = 201.64
-16.2 x -16.2 = 262.44
7.8 x 7.8 = 60.84
-0.2 x -0.2 = 0.04
-4.2 x -4.2 = 17.64
1.8 x 1.8 = 3.24
3.8 x 3.8 = 14.44
4. The teacher finds the variance, which is the average of the squares:
0.04 + 0.64 + 219.04 + 84.64 + 17.64 + 60.84 +1.44 +190.44 +201.64 +262.44 + 60.84 +
0.04 + 17.64 + 3.24 + 14.44 = 1135
1135÷ 15 = 75.6 (variance)
5. Last, the teacher finds the square root of the variance:
Square root of 75.6 = 8.7 (standard deviation)
The standard deviation of these tests is 8.7 points out of 100. Since the variance is
somewhat low, the teacher knows that most students are performing around the same level.
EX:4
A market researcher is analyzing the results of a recent customer survey that ranks a product from
1 to 10. He wants to have some measure of the reliability of the answers received in the survey in
order to predict how a larger group of people might answer the same questions.
Because this is a sample size, the researcher needs to subtract 1 from the total number of values in
step 4.
1. The scores for the survey are 9, 7, 10, 8, 9, 7, 8, and 9. The mean is 8.4.
2. The researcher subtracts the mean from every score.
(differences: 0.6, -1.4, 1.6, -0.4, 0.6, 1.4, -0.4, 0.6).
3. He squares each number (0.36, 1.96, 2.56, 0.16, 0.36, 1.96, 0.16, 0.36).
4. Because this is a sample of responses, the researcher subtracts one from the number of
values (8 values -1 = 7) to average squares and find the variance: 1.12 (variance)
5. Last, the researcher finds the square root of the variance: 1.06 (standard deviation)
The standard deviation is 1.06, which is somewhat low. The researcher now knows that the results
of the sample size are probably reliable.
Sample code for standard deviation using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the standard deviation using the 'sd' function
> sd_value <- sd(data)
# Displaying the standard deviation
> print(sd_value)
[1] 7.905694
GRAPHIC DISPLAYS OF BASIC STATISTICAL DESCRIPTIONS OF DATA
• Graphic displays are a powerful tool for visualizing and summarizing basic statistical
descriptions of data.
• Graphic displays of basic statistical descriptions of data are essential for visualizing the
distribution, central tendency, and dispersion of the data. Here are some common graphical
representations.
• In today’s world of the internet and connectivity, there is a lot of data available and some
or the other method is needed for looking at large data, the patterns, and trends in it.
• There is an entire branch in mathematics dedicated to dealing with collecting, analyzing,
interpreting, and presenting the numerical data in visual form in such a way that it becomes
easy to understand and the data becomes easy to compare as well, the branch is known as
Statistics.
✓ There are two ways of representing data, they are Tables and Pictorial Representation
through graphs.
✓ They say, “A picture is worth the thousand words”. It’s always better to represent data in
graphical format.
✓ Study the graphic displays of basic statistical descriptions.
✓ Some of the most commonly used graphic displays for basic statistical descriptions of data
include:
Histograms:
• Histogram shows the frequency of values within a set of intervals or "bins". They are used
to display the distribution of continuous data.
• A histogram is a graphical representation of the frequency distribution of continuous
series using rectangles.
• The x-axis of the graph represents the class interval, and the y-axis shows the various
frequencies corresponding to different class intervals.
• A histogram is a two-dimensional diagram in which the width of the rectangles shows
the width of the class intervals, and the length of the rectangles depicts the corresponding
frequency.
• There are no gaps between two consecutive rectangles based on the fact that histograms
can be drawn when data are in the form of the frequency distribution of continuous series.
Example: The following table gives the lifetime of 400 neon lamps. Draw the histogram for the
below data.
Example: Present the following information in the form of a Histogram:
Solution:
• It is visible that the set of data given is of the equal class interval; i.e., the difference
between the upper limit and the lower limit of each class interval is 10. So, drawing a
Histogram is feasible.
• The X-axis represents the marks (class intervals), and Y-axis represents the
number of students (frequency distribution).
STEM-AND-LEAF PLOTS:
It shows the distribution of values in a dataset by dividing each value into a "stem" and a "leaf".
They are used to display the distribution of continuous data in a compact format.