0% found this document useful (0 votes)
17 views39 pages

IDS Unit-2

The document discusses the types of data and their attributes, explaining how data objects are characterized by various attributes that capture their properties. It differentiates between attributes and measurements, detailing the significance of understanding these distinctions for data analysis and processing. Additionally, it categorizes attributes into qualitative and quantitative types, providing examples and explaining their relevance in data science.

Uploaded by

upender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views39 pages

IDS Unit-2

The document discusses the types of data and their attributes, explaining how data objects are characterized by various attributes that capture their properties. It differentiates between attributes and measurements, detailing the significance of understanding these distinctions for data analysis and processing. Additionally, it categorizes attributes into qualitative and quantitative types, providing examples and explaining their relevance in data science.

Uploaded by

upender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-2

TYPES OF DATA
A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity. In turn, data objects are
described by a number of attributes that capture the basic characteristics of an object, such as the
mass of a physical object or the time at which an event occurred. Other names for an attribute are
variable, characteristic, field, feature, or dimension.
Example is Student Information. Often, a data set is a file, in which the objects are records (or
rows) in the file and each field (or column) corresponds to an attribute. For example, Table shows
a data set that consists of student information. Each row corresponds to a student and each column
is an attribute that describes some aspect of a student, such as grade point average (GPA) or
identification number (ID).

Student ID Name Grade Point Average - ---


(GPA)
- -
- -
- -
103255 Jack 7.8 ---
103256 Harry 8.5 ---
103257 James 8.2 ---
-
-
-
Table contain A sample data set containing student information.
Although record-based data sets are common, either in flat files or relational database systems,
there are other important types of data sets and systems for storing data.
Data: It is how the data objects and their attributes are stored.
• A data object represents an entity—in a sales database, the objects maybe customers,
store items, and sales; in a medical database, the objects may be patients; in a university
database, the objects may be students, professors, and courses. Data objects are typically
described by attributes.
• An attribute is an object’s property or characteristics. For example. A person’s hair
colour, air humidity etc.
• An attribute set defines an object. The object is also referred to as arecord of the instances
or entity.
• For example, a sales data object may represent customers, sales, or purchases. When a data
object is listed in a database they are called data tuples.
ATTRIBUTES AND MEASUREMENT
In data science, attributes and measurements are terms used to describe the variables and their
corresponding values in a dataset.
Attributes:
• In a dataset, an attribute is a variable that describes a characteristic of the data. For
example, in a customer dataset, the attributes could include age, gender, income, and
location.
• An attribute is a property or characteristic of an object that may vary; either from one
object to another or from one time to another. For example, eye color varies from person
to person, while the temperature of an object varies over time. Note that eye color is a
symbolic attribute with a small number of possible values {brown, black, blue, green,
hazel, etc.), while temperature is a numerical attribute with a potentially unlimited number
of values. At most basic level, attributes not about numbers or symbols.
Measurements:
• In R, measurement typically refers to the process of assigning numerical values to objects
or observations in a dataset. The measurements could represent various attributes or
characteristics of the data, and they play a crucial role in statistical analysis, modeling,
and data visualization.
• Measurements refer to the values or observations that are recorded for each attribute. For
example, in a customer dataset, the measurement for the "age" attribute could be 35, and
the measurement for the "income" attribute could be $50,000.
• A measurement scale is a rule (function) that associates a numerical or symbolic value
with an attribute of an object.
• Measurements in R can be quantitative (numeric) or qualitative (categorical). Quantitative
measurements involve numerical values, while qualitative measurements involve
categories or labels.
#Numeric measurement
> heights <- c(65, 72, 68, 60, 71)
# Categorical measurements
> grades <- factor(c("A", "B", "C", "A", "B"))
# Summary statistics for numeric measurements
> summary(heights)
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.0 65.0 68.0 67.2 71.0 72.0
# Frequency table for categorical measurements
> table(grades)
grades
ABC
221
It's important to differentiate between attributes and measurements as this distinction will affect
how the data is processed and analyzed. The type of attribute and measurement will determine
what types of analysis and visualization can beperformed and will also impact the interpretation
of the results.

ATTRIBUTE
• In data science, an attribute refers to a characteristic or feature that describes a data point
or an object.
• It can be seen as a data field that represents the characteristics or features of a data
object. For a customer, object attributes can be customer Id, address, etc. We can
say that a set of attributes used to describe a given object are known as attribute
vector or feature vector.
• An attribute is a data field, representing a characteristic or feature of a data object. The
nouns attribute, dimension, feature, and variable are often used interchangeably.
• The term dimension is commonly used in data warehousing. MachineLearning literature
tends to use the term feature, while statisticians prefer the term variable. Data mining and
database professionals commonly use the term attribute.
• Attributes describing a customer object can include, for example, customerID, name, and
address. Observed values for a given attribute are knownas observations. A set of
attributes used to describe a given object is calledan attribute vector (or feature vector).
• In R, attributes are additional metadata associated with objects that provide extra
information about the object. These attributes can include information such as names,
dimensions, class, comments, and more. Attributes enhance the functionality and
interpretability of objects in R.
• Names Attribute: The names attribute assigns names to the elements of vectors,matrices,
or arrays.
# Creating a named vector
> my_vector <- c(apple = 3, banana = 2, orange = 5)
# Checking the names attribute
> names(my_vector)
[1] "apple" "banana" "orange"
• Dim Attribute: The dim attribute specifies the dimensions of matrices and arrays.
# Creating a matrix with dimensions
> my_matrix <- matrix(1:6, nrow = 2, ncol = 3)
# Checking the dim attribute
> dim(my_matrix)
[1] 2 3
• Factor Levels Attribute: The levels attribute of factors defines the unique values or
categories.
# Creating a factor with levels
> my_factor <- factor(c("low", "medium", "high"), levels = c("low", "medium", "high"))
# Checking the levels attribute
> levels(my_factor)
[1] "low" "medium" "high"
• Class Attribute: The class attribute indicates the type of R object.
# Checking the class of an object
> class(my_factor)
[1] "factor"
• Column Names Attribute (Data Frame): Data frames have column names as an
attribute.
# Creating a data frame
> my_data_frame <- data.frame( Name = c("John", "Jane", "Bob"), Age = c(25, 30,2))
# Checking the column names attribute
> colnames(my_data_frame)
[1] "Name" "Age"

THE TYPE OF AN ATTRIBUTE


In other words, the values used to represent an attribute may have properties that are not properties
of the attribute itself, and vice versa. This is illustrated with two examples.
• Example is (Employee Age and ID Number). Two attributes that might be associated with
an employee are ID and age (in years). Both of these attributes can be represented as
integers, However, while it is reasonable to talk about the average age of an employee, it
makes no sense to talk about the average employee ID.
• Consider below Figure, which shows some objects-line segments and how the length
attribute of these objects can be mapped to numbers in two different ways. Each successive
line segment, going from the top to the bottom, is formed by appending the topmost line
segment to itself. Thus, the second line segment from the top is formed by appending the
topmost line segment to itself twice, the third line segment from the top is formed by
appending the topmost line segment to itself three times, and so forth. In a very real
(physical) sense, all the line segments are multiples of the first. This fact is captured by the
measurements on the right-hand side of the figure, but not by those on the left hand-side.
• More specifically, the measurement scale on the left-hand side captures only the ordering
of the length attribute, while the scale on the right-hand side captures both the ordering and
additivity properties. Thus, an attribute can be measured in a way that does not capture all
the properties of the attribute. The type of an attribute should tell us what properties of the
attribute are reflected in the values used to measure it. Knowing the type of an attribute is
important because it tells us which properties of the measured values are consistent with
the underlying properties of the attribute, and therefore, it allows us to avoid foolish
actions, such as computing the average employee ID.
Note that it is common to refer to the type of an attribute as the type of a measurement scale.
The measurement of the length of line segments on two different scales of measurements.
The type of an attribute refers to the data type that is used to represent the attribute in a dataset.
Some common types of attributes include:
1. Numeric: Numeric data type is used to represent numbers, such as integersand floating-point
numbers.
2. Character: Character data type is used to represent text, such as words andsentences.
3. Logical: Logical data type is used to represent binary values, such as TRUE andFALSE.
4. Factor: Factor data type is used to represent categorical variables, such as gender (male,
female), and is often used in statistical analysis.
5. Date/time: Date/time data type is used to represent time and dateinformation, such as the date
a customer made a purchase.

THE DIFFERENT TYPES OF ATTRIBUTES


In R, attributes are metadata associated with objects that provide additional information about the
structure or properties of the object. The different types of attributes in R depend on the type of
object they are associated with.
This is the First step of Data-preprocessing. We differentiate between different types of attributes
and then preprocess the data. A useful (and simple) way to specify the type of an attribute is to
identify the properties of numbers that correspond to underlying properties of the attribute. For
example, an attribute such as length has many of the properties of numbers. It makes sense to
compare and order objects by length, as well as to talk about the differences and ratios of length.
The following properties (operations) of numbers are typically used to describe attributes.
• Distinctness = and ≠
• Order <, <=, >, and >=
• Addition + and -
• Multiplication * and /
Given these properties, Attributes can be broadly classified into two main types:
1. Qualitative (Nominal (N), Ordinal (O), Binary(B)).
2. Quantitative (Numeric, Discrete, Continuous)

Each attribute type possesses all of the properties and operations of the attribute types above it. In
other words, the definition of the attribute types is cumulative. However, this does not mean that
the operations appropriate for one attribute type are appropriate for the attribute types above it.
Qualitative Attributes:
1. Nominal Attributes : Nominal attributes, as related to names, refer to categorical data
where the values represent different categories or labels without any inherent order or
ranking. These attributes are often used to represent names or labels associated with
objects, entities, or concepts. The values of a nominal attribute are just different names;
i.e., nominal values provide only enough information to distinguish one object from
another (=, ≠). Nominal attribute operations are mode, entropy, contingency correlation, x²
test.
Example for Nominal attributes: Suppose that hair color and marital status are two
attributes describing person objects. In our application, possible values for hair color are
black, brown, blond, red, auburn, gray, and white. The attribute marital status can take on
the values single, married, divorced, and widowed. Both haircolor and marital status are
nominal attributes. Another example of a nominal attribute is occupation, with the values
teacher,dentist, programmer, farmer, and so on.
# Creating a nominal attribute (factor)
> nominal_attribute <- factor(c("Red", "Green", "Blue", "Red", "Blue"))
# Display the nominal attribute
> print(nominal_attribute)
[1] Red Green Blue Red Blue
Levels: Blue Green Red
# Checking the levels of the nominal attribute
> levels(nominal_attribute)
[1] "Blue" "Green" "Red"
# Summary statistics for the nominal attribute
> summary(nominal_attribute)
Blue Green Red
2 1 2
In this example, nominal_attribute is a factor representing a nominal attribute with three
categories: "Red," "Green," and "Blue." Factors in R are often used to represent nominal
attributes because they can store categorical data with distinct levels.
# Creating a nominal attribute using a character vector
> nominal_attribute_char <- c("Small", "Medium", "Large", "Medium", "Small")
# Display the nominal attribute
> print(nominal_attribute_char)
[1] "Small" "Medium" "Large" "Medium" "Small"
# Converting the character vector to a factor
> nominal_attribute_factor <-factor(nominal_attribute_char)
# Display the nominal attribute
> print(nominal_attribute_factor)
[1] Small Medium Large Medium Small
Levels: Large Medium Small
# Checking the levels of the nominal attribute
> levels(nominal_attribute_factor)
[1] "Large" "Medium" "Small"

2. Ordinal Attributes : Ordinal attributes are a type of qualitative attribute where the
values possess a meaningful order or ranking, but the magnitude between values is not
precisely quantified. In other words, while the order of values indicates their relative
importance or precedence, the numerical difference between them is not standardized or
known. The values of an ordinal attribute provide enough information to order objects.
(<, >). Ordinal Attribute operations are median, percentiles, rank correlation, run tests,
sign tests.
Unlike nominal attributes, ordinal attributes have an inherent order, but the intervals
between categories are not necessarily equal or known. Ordinal attributes are often used
when the categories have a natural order, but the differences between them are not
precisely defined.
Example 1 for Ordinal attributes. Suppose that drink size corresponds to the size of drinks
available at a restaurant. This nominal attribute has three possible values: small, medium,
and large. The values have a meaningful sequence (which corresponds to increasing drink
size). however, we cannot tell from the values how much bigger, say, a medium is than a
large. Other examples of ordinal attributes include grade and professional rank.
Professional ranks can be enumerated in a sequential order: for example, assistant,
associate, and full for professors, and private, private first class, specialist, corporal, and
sergeant for army ranks.
Ordinal attributes are useful for registering subjective assessments of qualities that cannot
be measured objectively, thus ordinal attributes are often used in surveys for ratings. In one
survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat
dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied.
Another examples are

# Creating an ordinal attribute (ordered factor)


> ordinal_attribute <- ordered(c("Low", "Medium", "High", "Low", "High"),
levels = c("Low", "Medium", "High"))
# Display the ordinal attribute
> print(ordinal_attribute)
[1] Low Medium High Low High
Levels: Low < Medium < High
# Checking the levels of the ordinal attribute
> levels(ordinal_attribute)
[1] "Low" "Medium" "High"
# Summary statistics for the ordinal attribute
> summary(ordinal_attribute)
Low Medium High
2 1 2
In this example, ordinal_attribute is an ordered factor representing an ordinal attribute with three
categories: "Low," "Medium," and "High." The levels of the ordered factor are specified to
indicate the order of the categories.
# Creating an ordinal attribute using a numeric vector
> ordinal_attribute_numeric <- c(2, 3, 1, 3, 2)
# Converting the numeric vector to an ordered factor
> ordinal_attribute_ordered <- ordered(ordinal_attribute_numeric, levels = c(1, 2, 3),
labels = c("Low", "Medium", "High"))
# Display the ordinal attribute
> print(ordinal_attribute_ordered)
[1] Medium High Low High Medium
Levels: Low < Medium < High
# Checking the levels of the ordinal attribute
> levels(ordinal_attribute_ordered)
[1] "Low" "Medium" "High"
3. Binary Attributes: Binary attributes are a type of qualitative attribute where the data
can take on only two distinct values or states. These attributes are often used to represent
0/1, yes/no, presence/absence, or true/false conditions within a dataset. They are
particularly useful for representing categorical data where there are only two possible
outcomes. For instance, in a medical study, a binary attribute could represent whether a
patient is affected or unaffected by a particular condition.
Example:
# Creating a binary attribute
> binary_attribute <- c(0, 1, 1, 0, 1, 0, 1, 0, 1, 1)
# Display the binary attribute
> print(binary_attribute)
[1] 0 1 1 0 1 0 1 0 1 1
# Checking the class of the attribute
> class(binary_attribute)
[1] "numeric"
# Summarizing the binary attribute
> summary(binary_attribute)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 1.0 0.6 1.0 1.0
In this example, binary_attribute is a vector representing a binary attribute. You can see that the
values are 0 or 1. The class() function confirms that it is a numeric vector, but it is often used as
a binary indicator.
Binary attributes are frequently used in various applications, such as representing the outcome
of a binary event (e.g., success or failure, true or false) or encoding categorical variables with
two levels.
R's logical vectors (TRUE/FALSE) are also commonly used to represent binary attributes,
where TRUE might represent the presence or positive state, and FALSE represents the absence
or negativestate.
# Creating a binary attribute using logical values
> logical_binary_attribute <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
# Display the logical binary attribute
> print(logical_binary_attribute)
[1] TRUE FALSE TRUE TRUE FALSE
# Checking the class of the attribute
> class(logical_binary_attribute)
[1] "logical"
# Summarizing the logical binary attribute
> summary(logical_binary_attribute)
Mode FALSE TRUE
logical 2 3
• Symmetric: In a symmetric attribute, both values or states are considered equally
important or interchangeable. For example, in the attribute “Gender” with values “Male”
and “Female,” neither value holds precedence over the other, and they are considered
equally significant for analysis purposes.

• Asymmetric: An asymmetric attribute indicates that the two values or states are not
equally important or interchangeable. For asymmetric attributes, only presence a non-zero
attribute value-is regarded as important. For instance, in the attribute “Result” with values
“Pass” and “Fail,” the states are not of equal importance, passing may hold greater
significance than failing in certain contexts, such as academic grading or certification
exams. Consider a data set where each object is a student and each attribute records whether
or not a student took a particular course at a university. For a specific student, an attribute
has a value of 1 if the student took the course associated with that attribute and a value of
0 otherwise. Because students take only a small fraction of all available courses, most of
the values in such a data set would be 0. Therefore, it is more meaningful and more efficient
to focus on the non- zero values.

To illustrate, if students are compared on the basis of the courses they don't take, then most
students would seem very similar, at least if the number of courses is large. Binary
attributes where only non-zero values are important are called asymmetric binary
attributes. This type of attribute is particularly important for association analysis. It is also
possible to have discrete or continuous asymmetric features. For instance, if the number of
credits associated with each course is recorded, then the resulting data set will consist of
asymmetric discrete or continuous attributes.
Quantitative Attributes:
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numeric attributes in R refer to variables that represent quantitative data
with meaningful numerical values. These values can be either discrete or continuous. Numeric
attributes are used to store information that can be measured or counted and are amenable to
arithmetic operations.
# Creating a numeric attribute (numeric vector)
> numeric_attribute <- c(25, 30, 22, 18, 35)
# Display the numeric attribute
> print(numeric_attribute)
[1] 25 30 22 18 35
# Checking the class of the attribute
> class(numeric_attribute)
[1] "numeric"
# Summary statistics for the numeric attribute
> summary(numeric_attribute)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18 22 25 26 30 35
In this example, numeric_attribute is a numeric vector representing a numeric attribute with
values such as ages. The class() function confirms that it is a numeric vector, and the summary()
function provides summary statistics like mean, median, minimum, and maximum.
# Creating a data frame with a numeric attribute
> my_data_frame <- data.frame(Name = c("John", "Jane", "Bob"), Age =c(25, 30, 22))
# Display the data frame
> print(my_data_frame)
Name Age
1 John 25
2 Jane 30
3 Bob 22
# Checking the class of the character attribute in the data frame
> class(my_data_frame$Name)
[1] "character"
# Checking the class of the numeric attribute in the data frame
> class(my_data_frame$Age)
[1] "numeric"
# Summary statistics for the numeric attribute in the data frame
> summary(my_data_frame$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.00 23.50 25.00 25.67 27.50 30.00
Numerical attributes are of 2 types: interval , and ratio-scaled.
• An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point, or we can call zero points. An
interval scale is one where there is order and the difference between two values is
meaningful. Interval-scaled attributes are measured on a scale of equal-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition
to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values. For interval attributes, the differences between values are
meaningful, i.e., a unit of measurement exists. (+, -). Data can be added and subtracted at an
interval scale but cannot be multiplied or divided. An interval-scaled operations are mean,
standard deviation, Pearson's correlation, t and F tests. Consider an example of calendar
dates, temperature in Celsius or Fahrenheit. If a day’s temperature of one day is twice of the
other day, we cannot say that one day is twice as hot as another day.
• A ratio-scaled attribute is a numeric attribute with a fix zero-point. Ratio scales attributes
allow you to categorize and rank your data along equal intervals. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median,
mode, Quantile-range, and Five number summary can be given. A ratio-scaled attribute
operations are geometric mean, harmonic mean, percent variation. For ratio variables, both
differences and ratios are meaningful. (*, /). Consider an example of temperature in Kelvin.
Other examples of ratio-scaled attributes include count attributes such as years of experience
(e.g., the objects are employees) and number of words (e.g., the objects are documents).
Additional examples include attributes to measure age, weight, height, counts, mass, length,
electrical current, latitude and longitude coordinates (e.g., when clustering houses), and
monetary.
2. Discrete: A discrete attribute has a finite or countably infinite set of values, which may or may
not be represented as integers. Discrete data refer to information that can take on specific, separate
values rather than a continuous range. These values are often distinct and separate from one
another, and they can be either numerical or categorical in nature. Discrete attributes are often
represented using integer variables. Binary attributes are a special case of discrete attributes and
assume only two values, e.g., true/false, yes/no, male/female, or 0/1. Binary attributes are often
represented as Boolean variables, or as integer variables that only take the values 0 or 1.
Classification algorithms developed from the field of machine learning often talk of attributes as
being either discrete or continuous. Each type may be processed differently.
The attributes hair color, smoker, medical test, and drink size each have a finite number of values,
and so are discrete. Note that discrete attributes may have numeric values, such as 0 and 1 for
binary attributes or, the values 0 to 110 for the attribute age. An attribute is countably infinite if
the set of possible values is infinite but the values can be put in a one-to-one correspondence with
natural numbers. For example, the attribute customer ID is countably infinite. The number of
customers can grow to infinity, but in reality, the actual set of values is countable (where the values
can be put in one-to-one correspondence with the set of integers). Zip codes are another example.

Example:

3. Continuous : If an attribute is not discrete, it is continuous. Continuous data, unlike discrete


data, can take on an infinite number of possible values within a given range. Continuous
attributes are typically represented as floating-point variables. For instance, it is difficult to think
of a realistic data set that contains a continuous binary attribute. It is characterized by being able
to assume any value within a specified interval, often including fractional or decimal values.
Classification algorithms developed from the field of machine learning often talk of attributes as
being either discrete or continuous. Each type may be processed differently.
The terms numeric attribute and continuous attribute are often used interchangeably in the
literature. (This can be confusing because, in the classic sense, continuous values are real numbers,
whereas numeric values can be either integers or real numbers.) In practice, real values are
represented using a finite number of digits. Continuous attributes are typically represented as
floating-point variables.
Example :

BASIC STATISTICAL DESCRIPTIONS OF DATA


For data pre-processing to be successful, it is essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify properties of the data and highlight which data
values should be treated as noise or outliers.
We start with measures of central tendency, which measure the location of the middle or center of
a data distribution. In particular, we discuss the mean, median, mode, and midrange. In addition to
assessing the central tendency of our data set, we also would like to have an idea of the dispersion
of the data. That is, how are the data spread out? The most common data dispersion measures are
the range, quartiles, and interquartile range, the five-number summary and boxplots, and the
variance and standard deviation of the data.
Finally, we can use many graphic displays of basic statistical descriptions to visually inspect our
data. Most statistical or graphical data presentation software packages include bar charts, pie
charts, and line graphs.
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify
the main characteristics of that data. Graphs or numbers summarize the data. Average,
Mode, SD (Standard Deviation), and Correlation are some of the commonly used
descriptive statistical methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about
populations and make models of relationships within data.

MEASURE OF CENTRAL TENDENCY


• A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data.
• As such, measures of central tendency are sometimes called measures of central location.
• They are also classed as summary statistics. The mean (often called the average) is most likely
the measure of central tendency that you are most familiar with, but there are others, such as
the median and the mode.
• The central tendency is defined as the statistical measure that can be used to represent the
entire distribution or a dataset using a single value called a measure of central tendency.
The mean, median and mode are all valid measures of central tendency,

In statistics, the mean, median, and mode are the three most common measures of central
tendency.
• Each one calculates the central point using a different method. Choosing the best measure
of central tendency depends on the type of data you have.
• Measures of central tendency are summary statistics that represent the center point or typical
value of a dataset.
• Examples of these measures include the mean, median, and mode. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.
• Measures of central tendency are summary statistics that represent the center point or typical
value of a dataset.
• Examples of these measures include the mean, median, and mode. These statistics indicate
where most values in a distribution fall and are also referred to as the central location of a
distribution.
Measures of Central Tendency Example
Example. The monthly salary of an employee for the 5 months is given in the table below,

Month Salary
January $105
February $95
March $105
April $105
May $100

Suppose, we want to express the salary of the employee using a single value and not 5 different
values for 5 months. This value that can be used to represent the data for salaries for 5 months
here can be referred to as the measure of central tendency. The three possible ways to find the
central measure of the tendency for the above data are,
• Mean: The mean salary of the given salary can be used as on of the measures of central
tendency, i.e., x̄ = (105 + 95 + 105 + 105 + 100)/5 = $102.
• Mode: If we use the most frequently occurring value to represent the above data, i.e., $105,
the measure of central tendency would be mode.
• Median: If we use the central value, i.e., $105 for the ordered set of salaries, given as, $95,
$100, $105, $015, $105, then the measure of central tendency here would be median.
MEAN
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers
by the total number of numbers.
• The mean, also known as the average, is calculated by summing up all the values in a
dataset and dividing by the number of observations.
Mean = (Sum of all the observations/Total number of observations)

Example:
What is the mean of 2, 4, 6, 8 and 10?
Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
Mean Symbol (x̄):
The symbol of mean is usually given by the symbol ‘x̄’. The bar above the letter x, represents the
mean of x number of values.
X̄ = (Sum of values ÷ Number of values)
X̄ = (x1 + x2 + x3 +….+xn)/n
Mean = Sum of the Given Data/Total number of Data
To calculate the arithmetic mean of a set of data we must first add up (sum) all of the data values
(x) and then divide the result by the number of values (n). Since Σ is the symbol used to indicate
that values are to be summed, we obtain the following formula for the mean (x̄ ): x̄ =Σ x/n
Example:
In a class there are 20 students and they have secured a percentage of 88, 82, 88, 85, 84, 80, 81,
82, 83, 85, 84, 74, 75, 76, 89, 90, 89, 80, 82, and 83.
Find the mean percentage obtained by the class.
Solution:
Mean = Total of percentage obtained by 20 students in class/Total number of students
= [88 + 82 + 88 + 85 + 84 + 80 + 81 + 82 + 83 + 85 + 84 + 74 + 75 + 76 + 89 + 90 + 89 + 80 + 82
+ 83] / 20
= 1660/20
= 83
Hence, the mean percentage of each student in the class is 83%.
Example code for Mean using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the mean
> mean_value <- mean(data)
# Displaying the mean
> print(mean_value)
[1] 20
MEDIAN
• A Median is a middle value or average for a sorted data. The sorting of the data
set must be done either in ascending order or in descending order.
• In other words, it is a middle value of a sorted data set. We find mean or average
by using the median.
• A median divides the data into two halves. The formula for median:
✓ If the number of values (n value) in the data set is odd then the formula to calculate
median is:

Median = ((n + 1)/2)th term


✓ If the number of values (n value) in the data set is even then the formula to
calculate median is:

Median = [(n/2)th term + {(n/2) + 1}th term] / 2


How to Find Median:
To determine the median of a data set, the values of the data set must be sorted or arranged in
either ascending or descending order. The data may be in two formats:
o Ungrouped Frequency Distribution
o Grouped Frequency Distribution
In an ungrouped frequency distribution, the data may of two types:
o When an odd number of the frequency distribution is given
o When even number of the frequency distribution is given
When an odd number of the frequency distribution is given:
To find the median of odd frequency distribution follow the steps given below. But remember
that data must be sorted. After sorting the data, use the following formula:

Where n is the total number of items in the data set.


Another quick method to find the median is:
o First, sort the values or items.
o Pick the middle value as the median.
Example 1: Find the median of 23, 2, 12, 33, 65, 45, and 9.
Solution:
First, we sort the given data set.
2, 9, 12, 23, 33, 45, 65
There is a total of 7 values, so the mid-value (4th) will be median, i.e. 23.
Similarly, we can find the median using the formula:

Putting the value of n in the formula, we get:

The 4th item or value will be median, i.e. 23.


Hence, the median of the given data set is 23.
When even number of the frequency distribution is given
To find the median of the data set that contains even number of frequency distribution, we must
follow the steps given below:
• Sort the values of the data set.
• Find the middle pair and its values.
• Sum up the values and divide it by 2.
• The value that we get on dividing is the median of the given data set.
• We can also write the above steps in terms of the formula:

Where N is the total number of items in the data set.


Let’s understand the example of even frequency distribution.
Example 2: Find the median of the following list: 1, 5, 77, 32, 65, 12, 44, 21, 90, 34, 8, 56, 4, 99
Solution:
Step 1: Sort the given list.
1, 4, 5, 8, 12, 21, 32, 34, 44, 56, 65, 77, 90, 99
There are total 14 values in the list.
Step 2: Find the middle pair and its values.
The middle pair terms of the list are 7th and 8th and its values are 32 and 34,
respectively.
Step 3: Sum up the values and divide it by 2.

• A point to be noticed here is that 33 is not in the list. But it indicates that half values in
the list are less than 33, and half values are greater than 33.
• Let’s find the median through the formula which we have learned above.

Hence, the median of the given list is 33.


Example 3: The median of the data 30, 40, 10, 20, 50 is:
Step 1: Order the given data in ascending order as:
10, 20, 30, 40, 50
Step 2: Check n (number of terms of data set) is even or odd and find the median of the data
with respective ‘n’ value.
Step 3: Here, n = 5 (odd) then Median = [(n + 1)/2]th term
The median of the data is [(5 + 1)/2]th term is 30.
Example 2: The median of the data 25, 12, 5, 24, 15, 22, 23, 25
Step 1: Order the given data in ascending order as:
5, 12, 15, 22, 23, 24, 25, 25
Step 2: Check n (number of terms of data set) is even or odd and find the median of the data
with respective ‘n’ value.
Step 3: Here, n = 8 (even) then,
Median = [(n/2)th term + {(n/2) + 1)th term] / 2
Median = [(8/2)th term + {(8/2) + 1}th term] / 2 = (22+23) / 2 = 22.5

Example code for Median Using R:


# Creating a numeric vector
> data<-c(30, 40, 10, 20, 50 )
# Calculating the median
> median_value<-median(data)
# Displaying the median
> print(median_value)
[1] 30
# Creating a numeric vector
> data<-c(25, 12, 5, 24, 15, 22, 23, 25 )
# Calculating the median
> median_value<-(median(data))
# Displaying the median
> print(median_value)
[1] 22.5
MODE
• A mode is the value or item that occurs most frequently in a dataset. A data set can
generally have one or more than one mode value. If the data set has one mode then it is
called “Uni-modal”. Similarly, If the data set contains 2 modes then it is called “Bimodal”
and if the data set contains 3 modes then it is known as “Trimodal”.
• If the data set consists of more than one mode then it is known as “multi-modal”(can be
bimodal or trimodal). There is no mode for a data set if every number appears only once.
Example 1: If the data set is {1, 2, 2, 3, 3, 4, 5} then it has 2 modes i.e, 2 and 3 (bi-modal). Since,
both the values 2 and 3 are repeating twice in the data set.
Example 2: If the data set is {15, 42, 65, 65, 95} then the mode is 65 (uni-modal). Since 65 is the
only repeating value in the data set.
Example code for mode using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 20, 25, 30, 30)
# Calculating the mode using the 'Mode' function
Mode <- function(x) {
+ ux <- unique(x)
+ ux[which.max(tabulate(match(x, ux)))]
+}
# Displaying the mode
> mode_value <- Mode(data)
> print(mode_value)
[1] 20

Range: It is the difference between the highest value and the lowest value. It is a way to understand
how the numbers are spread in a data set. Formula to find Range is:
Range = Highest value – Lowest Value
Example: If the data set is {12, 19, 6, 2, 15, 4} then the lowest value is 2 and
the highest value is 19.
So, the range is 19 − 2 = 17.
Reading Bar Charts: Putting it Together with Central Tendency
Question 1. Finding Mean for the above bar chart.
Mean = (sum of all data values) / (number of values)
Mean = (5 + 7 + 9 + 6) / 4 = 27 / 2 = 6.75
Question 2. Finding the Median for the above bar chart:
Order the given data in ascending order as: 5, 6, 7, 9
Here, n = 4 (number of students which is even)
Median = [(n/2)th term + {(n/2) + 1}th term] / 2
Median = (6 + 7) / 2 = 6.5
Question 3. Finding Mode for the above bar chart:
Mode = most frequent value = 9 (highest value)
Question 4. Finding the range for the above bar chart:
Range = highest value – lowest value
Range = 9 – 5 = 4
These measures of central tendency provide insights into the typical or central value of a dataset.
The choice of which measure to use depends on the nature of the data and the specific
characteristics of the distribution. The mean is sensitive to outliers, while the median is robust
against extreme values. The mode is particularly useful for categorical data or discrete
distributions.
BASIC STATISTICAL DESCRIPTIONS OF DATA
• Statistical methods that help to know about the distribution or the spread of the data points
in the datasets are known as Measures of Dispersion.
• Measuring the dispersion of data involves assessing how spread out or clustered the values
in a dataset are.
• Common measures of dispersion include Range, Quartiles, Interquartile range (IQR),
Variance, Standard Deviation.
RANGE
• The range is the easiest measure of dispersion. It is simply calculated by
subtracting the highest value from the lowest value.
• The range is the difference between the maximum and minimum values in a dataset. It
provides a simple measure of the spread of thedata.
Range = Highest value – Lowest Value
• The range is a simple and straight forward measure of dispersion that provides an
indication of the difference between the largest and smallest values in a dataset. However,
it can be affected by outliers, or extreme values, in the data and does not provide
information about the distribution of values within the range.
• In combination with measures of central tendency, such as mean or median, range can
provide a quick summary of the distribution of the data. However, other measures of
dispersion, such as variance or standard deviation, are often used to provide a more
comprehensive picture of the spread of the data.
Ex: Problem Statement: Let there be 5 students in the class having heights of
150cm, 160cm,175cm, 190cm and 200cm.Calculate the range of heights?
Range = 200cm – 150cm
Hence, Range = 50cm
Range for ungrouped data:
Question 1: Find out the range for the following observations 20, 24, 31, 17, 45, 39, 51, 61
Solution: The largest value in the given observations is 61 and the smallest value is 17.
The Range is 61 – 17 = 44
Range for grouped data:
Question 2: Find out the range for the following frequency distribution table for the marks scored
by class 10 students.
Marks Intervals Number of Students
0-10 5
10-20 8
20-30 15
30-40 9
Solution:
For the largest value – Take higher limit of the highest class = 40
For the smallest value – Take lower limit of the lowest class = 0
Range = 40 – 0
Range = 40
Sample code using R:
# Creating a numeric vector
> data<- c(10, 15, 20, 25, 30)
# Calculating the range
> range_value <- max(data) - min(data)
# Displaying the range
> print(range_value)
[1] 20

QUARTILES
• Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that
we can pick certain data points so as to split the data distribution into equal-size consecutive
sets. These data points are called quantiles.
• Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
• Quartiles divide the set into 5 equal parts.
• There are three quartiles Q1, Q2 and Q3, where Q2 is the median of the distribution.
• Five number summaries: every dataset can be described using these 5 numbers
✓ Lowest value
✓ Q1: 25 percentile
✓ Q2: Median
✓ Q3: 75 Percentile
✓ Highest Value

INTERQUARTILE RANGE (IQR)


• The interquartile range is the range of the middle 50% of the data. It is calculated as the
difference between the third quartile (Q3) and the first quartile (Q1).
• Interquartile range is defined as the range between 75 percentile (Q3) and 25 percentile
(Q1).
• Before defining the interquartile range, first we know about quartiles and five-number
summary.

• The interquartile range is the range of the middle 50% of the data. Itis calculated as the
difference between the third quartile (Q3) and the first quartile (Q1).
IQR = Q3 – Q1
Let’s understand Q1, Q2, Q3 and the Interquartile range by an example.
Problem Statement:
Let there are 8 numbers between 10 and 90 which are equally distributed.
Define the five-number summary and find the Interquartile Range?
✓ Lowest value : 10
✓ Q1 (25 percentile) : 25
✓ Q2 (50 percentile) : 50
✓ Q3 (75 percentile) : 75
✓ Highest value : 90
✓ Interquartile Range(IQR) = Q3 – Q1 = 75 – 25 = 50
Interquartile Range = 50

Ex: 2

Sample code using R:


# Creating a numeric vector
> data<- c(10, 15, 20, 25, 30)
# Calculating the interquartile range using the 'IQR' function
> iqr_value <- IQR(data)
# Displaying the interquartile range
> print(iqr_value)
[1] 10
VARIANCE
• Variance is a measure of how data points differ from the mean. According to Layman, a
variance is a measure of how far a set of data (numbers) are spread out from their mean
(average) value.
• Variance means to find the expected difference of deviation from actual value. Therefore,
variance depends on the standard deviation of the given data set.
• Variance measures the average squared deviation of each data point from the
mean. It gives a sense of how much individual data points differ from the
mean

How to Calculate Variance:


Variance can be calculated easily by following the steps given below:
• Find the mean of the given data set. Calculate the average of a given set of values
• Now subtract the mean from each value and square them
• Find the average of these squared values, that will result in variance
Say if x1, x2, x3, x4, …,xn are the given values.
Therefore, the mean of all these values is:
x̄ = (x1+x2+x3+…+xn)/n
Now subtract the mean value from each value of the given data set and square them.
(x1-x̄ )2, (x2-x̄ )2, (x3-x̄ )2,…….,(xn-x̄)2
Find the average of the above values to get the variance.
Var (X) = [(x1-x̄ )2+ (x2-x̄ )2+ (x3-x̄ )2+…….+(xn-x̄ )2]/n
Hence, the variance is calculated.
Example of Variance:
Let’s say the heights (in mm) are 610, 450, 160, 420, 310.
Mean and Variance is interrelated. The first step is finding the mean which is done as follows,
Mean = ( 610+450+160+420+310)/ 5 = 390
So, the mean average is 390 mm.
To calculate the Variance, compute the difference of each from the mean, square it and find then
find the average once again.
So, for this particular case the variance is :
= (220)2 + (60)2 + (-230)2 +(30)2 + (-80)2)/5
= (48400 + 3600 + 52900 + 900 + 6400)/5
Final answer : Variance = 22440
Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9, 11, 10, 12, 7.
Solution: Given, 3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8
Step 2: Make a table with three columns, one for the X values, the second for the deviations and
the third for squared deviations. As the data is not given as sample data so we use the
formula for population variance. Thus, the mean is denoted by μ.
Value X X–μ (X – μ)2
3 -5.8 33.64
8 -0.8 0.64
6 -2.8 7.84
10 1.2 1.44
12 3.2 10.24
9 0.2 0.04
11 2.2 4.84
10 1.2 1.44
12 3.2 10.24
7 -1.8 3.24
Total 0 73.6

Step 3: σ2=Σ(X−μ)2/N
= 73.6 / 10
= 7.36
Sample code using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the variance using the 'var' function
> variance_value <- var(data)
# Displaying the variance
> print(variance_value)
[1] 62.5
STANDARD DEVIATION
● Standard deviation is the square root of the variance. It provides a measure of the
average deviation of data points from the mean.
● Standard deviation is a metric that represents the amount to which various values of a
statistical series tend to fluctuate or disperse from its mean or median. It describes how
the values are distributed over the data sample and is a measure of the data points’
deviation from the mean.
● The square root of the variance of a sample, statistical population, random variable, data
collection, or probability distribution is its standard deviation.
Steps to Calculate Standard Deviation
✓ Find the mean, which is the arithmetic mean of the observations.
✓ Find the squared differences from the mean.
(The data value - mean)2
✓ Find the average of the squared differences.
Variance = The sum of squared differences ÷ the number of observations
✓ Find the square root of variance.
Standard deviation = √Variance
• The standard deviation provides a summary of the spread of values in a dataset and can be
used to determine how far each value is from the mean. A low standard deviation indicates
that the values in the dataset are close to the mean, while a high standard deviation indicates
that the values are spread out.
• The standard deviation is widely used in statistical analysis and is a useful measure of
dispersion for datasets with a normal or symmetrical distribution. However, it can be
affected by outliers or extreme values in the dataset and may not provide a good summary
of the spread of the data for datasets with skewed or non-normal distributions.
• In summary, the standard deviation is a useful measure of dispersion that quantifies the
amount of variation or dispersion of a set of values around the mean, and provides a
summary of the spread of values in a dataset.
Ex: 1
Consider the data set: 2, 1, 3, 2, 4. The mean and the sum of squares of deviations of the
observations from the mean will be 2.4 and 5.2, respectively. Thus, the standard deviation will be
√(5.2/5) = 1.01.
Ex: 2
For example: Take the values 2, 1, 3, 2 and 4.
1. Determine the mean (average):
2 + 1 +3 + 2 + 4 = 12
12 ÷ 5 = 2.4 (mean)
2. Subtract the mean from each value:
2 - 2.4 = -0.4
1 - 2.4 = -1.4
3 - 2.4 = 0.6
2 - 2.4 = -0.4
4 - 2.4 = 1.6
3. Square each of those differences:
-0.4 x -0.4 = 0.16
-1.4 x -1.4 = 1.96
0.6 x 0.6 = 0.36
-0.4 x -0.4 = 0.16
1.6 x 1.6 = 2.56
4. Determine the average of those squared numbers to get the variance.
0.16 + 1.96 + 0.36 + 0.16 + 2.56 = 5.2
5.2 ÷ 5 = 1.04 (variance)
5. Find the square root of the variance.
Square root of 1.04 = 1.01
The standard deviation of the values 2, 1, 3, 2 and 4 is 1.01.
EX: 3
A class of students took a math test. Their teacher wants to know whether most students are
performing at the same level, or if there is a high standard deviation.
1. The scores for the test were 85, 86, 100, 76, 81, 93, 84, 99, 71, 69, 93, 85, 81, 87, and 89. When
the teacher adds them together, she gets 1279. She divides by the number of scores (15) to get the
mean score.
1279 ÷ 15 =85.2 (mean)
2. 85.2 is a high score, but is everyone performing at that level? To find out, the teacher subtracts
the mean from every test score.
85 - 85.2 = -0.2
86 - 85.2 = 0.8
100 - 85.2 = 14.8
76 - 85.2 = -9.2
81 - 85.2 = -4.2
93 - 85.2 = 7.8
84 - 85.2 = -1.2
99 - 85.2 = 13.8
71 - 85.2 = -14.2
69 - 85.2 = -16.2
93 - 85.2 = 7.8
85 - 85.2 = -0.2
81 - 85.2 = -4.2
87 - 85.2 = 1.8
89 - 85.2 = 3.8
4. She squares each difference:
-0.2 x -0.2 = 0.04
0.8 x 0.8 = 0.64
14.8 14.8 = 219.04
-9.2 x -9.2 = 84.64
-4.2 x -4.2 = 17.64
7.8 x 7.8 = 60.84
-1.2 x -1.2 = 1.44
13.8 x 13.8 = 190.44
-14.2 x -14.2 = 201.64
-16.2 x -16.2 = 262.44
7.8 x 7.8 = 60.84
-0.2 x -0.2 = 0.04
-4.2 x -4.2 = 17.64
1.8 x 1.8 = 3.24
3.8 x 3.8 = 14.44
4. The teacher finds the variance, which is the average of the squares:
0.04 + 0.64 + 219.04 + 84.64 + 17.64 + 60.84 +1.44 +190.44 +201.64 +262.44 + 60.84 +
0.04 + 17.64 + 3.24 + 14.44 = 1135
1135÷ 15 = 75.6 (variance)
5. Last, the teacher finds the square root of the variance:
Square root of 75.6 = 8.7 (standard deviation)
The standard deviation of these tests is 8.7 points out of 100. Since the variance is
somewhat low, the teacher knows that most students are performing around the same level.
EX:4
A market researcher is analyzing the results of a recent customer survey that ranks a product from
1 to 10. He wants to have some measure of the reliability of the answers received in the survey in
order to predict how a larger group of people might answer the same questions.
Because this is a sample size, the researcher needs to subtract 1 from the total number of values in
step 4.
1. The scores for the survey are 9, 7, 10, 8, 9, 7, 8, and 9. The mean is 8.4.
2. The researcher subtracts the mean from every score.
(differences: 0.6, -1.4, 1.6, -0.4, 0.6, 1.4, -0.4, 0.6).
3. He squares each number (0.36, 1.96, 2.56, 0.16, 0.36, 1.96, 0.16, 0.36).
4. Because this is a sample of responses, the researcher subtracts one from the number of
values (8 values -1 = 7) to average squares and find the variance: 1.12 (variance)
5. Last, the researcher finds the square root of the variance: 1.06 (standard deviation)
The standard deviation is 1.06, which is somewhat low. The researcher now knows that the results
of the sample size are probably reliable.
Sample code for standard deviation using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30)
# Calculating the standard deviation using the 'sd' function
> sd_value <- sd(data)
# Displaying the standard deviation
> print(sd_value)
[1] 7.905694
GRAPHIC DISPLAYS OF BASIC STATISTICAL DESCRIPTIONS OF DATA
• Graphic displays are a powerful tool for visualizing and summarizing basic statistical
descriptions of data.
• Graphic displays of basic statistical descriptions of data are essential for visualizing the
distribution, central tendency, and dispersion of the data. Here are some common graphical
representations.
• In today’s world of the internet and connectivity, there is a lot of data available and some
or the other method is needed for looking at large data, the patterns, and trends in it.
• There is an entire branch in mathematics dedicated to dealing with collecting, analyzing,
interpreting, and presenting the numerical data in visual form in such a way that it becomes
easy to understand and the data becomes easy to compare as well, the branch is known as
Statistics.
✓ There are two ways of representing data, they are Tables and Pictorial Representation
through graphs.
✓ They say, “A picture is worth the thousand words”. It’s always better to represent data in
graphical format.
✓ Study the graphic displays of basic statistical descriptions.
✓ Some of the most commonly used graphic displays for basic statistical descriptions of data
include:
Histograms:
• Histogram shows the frequency of values within a set of intervals or "bins". They are used
to display the distribution of continuous data.
• A histogram is a graphical representation of the frequency distribution of continuous
series using rectangles.
• The x-axis of the graph represents the class interval, and the y-axis shows the various
frequencies corresponding to different class intervals.
• A histogram is a two-dimensional diagram in which the width of the rectangles shows
the width of the class intervals, and the length of the rectangles depicts the corresponding
frequency.
• There are no gaps between two consecutive rectangles based on the fact that histograms
can be drawn when data are in the form of the frequency distribution of continuous series.
Example: The following table gives the lifetime of 400 neon lamps. Draw the histogram for the
below data.
Example: Present the following information in the form of a Histogram:

Solution:
• It is visible that the set of data given is of the equal class interval; i.e., the difference
between the upper limit and the lower limit of each class interval is 10. So, drawing a
Histogram is feasible.
• The X-axis represents the marks (class intervals), and Y-axis represents the
number of students (frequency distribution).

Sample code using R:


# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30, 25, 20, 15, 10)
# Creating a histogram
> hist(data, main = "Histogram of Data", xlab = "Values", ylab = "Frequency", col =
"lightblue")
Bar Graphs:
• A bar graph is a type of graphical representation of the data in which bars of uniform width
are drawn with equal spacing between them on one axis (x-axis usually), depicting the
variable. The values of the variables are represented by the height of the bars.
• It shows the frequency or count of values for a categorical variable. They are used to
display the distribution of categorical data.

BOX PLOTS (BOX-AND-WHISKER PLOTS):


• It is also known as box-and-whisker plots, show the median, quartiles, and range of a
dataset in a compact format. They are used to display the distribution of both continuous
and categorical data.
• These plots divide the data into four parts to show their summary. They are more concerned
about the spread, average, and median of the data.

Sample code using R:


# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30, 25, 20, 15, 10)
# Creating a boxplot
> boxplot(data, main = "Boxplot of Data", ylab = "Values", col = "lightgreen")
Scatter Plots:
• Scatter Plots shows the relationship between two variables by plotting individual points in
a two-dimensional graph. They are used to display the relationship between two continuous
variables.
• Scatter Plot refers to a two-dimensional chart that visually represents supplied data in real-
time.
• Generally, the scatter plot visualizes two sets of data on the X and Y axis that are co-related.
• This type of chart displays many points on vertical and horizontal axes as per the supplied
data sets, and it is mainly used to show relationships between two variables.
• A scatter plot works by placing one variable on the vertical axis and a different variable on
the horizontal axis.
• Each piece of data is then plotted as a discrete point on the chart. Both the X and Y axis
display values in a scatter plot, which means that the scatter chart has no category axis.
• By convention, the X-axis represents arbitrary values that do not depend on another
variable, called the independent variable. Besides, Y values are placed on the vertical axis
and represent the dependent variable.
• These charts are also known by many other names, such as 'Scatter Graphs, Scatter Charts,
Scatter grams, Scatter Diagrams, XY Graph, etc.’
Components of Scatter Plot Chart:
There are mainly five components in a Scatter Plot Chart, as listed below:
• Plot Area: A graphical form/area within the sheet where the data is drawn is called the
Plot Area.
• Chart Title: A chart title represents the subject of the plotted chart that primarily helps
determine the chart’s topic or motive. The text in the chart title can be edited, and the
position can be arranged accordingly.
• Vertical Axis: An axis that lies vertically in the chart window is called the vertical axis,
and it is located on the bottom area of the plot area. Since the vertical axis typically
represents the measurement values across Xaxis, it is known as the X-axis.
• Horizontal Axis: An axis that lies horizontally in the chart window is called the horizontal
axis, and it is located on the left side of the plot area. Since the horizontal axis represents
the different data categories across Yaxis, it is also known as the Y-axis. We can group
series data on the horizontal axis.
• Legend: The legend is another useful component of the chart that helps list and distinguish
various data groups. We can move the legend or change the legend’s position accordingly,
and it can be placed on any side in the chart window.

Advantages of using Scatter Plots:


• The scatter charts help determine the relationship between two or more variables, and
they mainly showcase the relationship of one variable concerning another.
• The scatter plots can show correlations visually.
• It is easy to analyze the maximum and minimum values (high and low) in scatter charts
on the data flow range.
• The scatter charts are used for various scientific analyses because plotting these charts is
moderately easy, and perception and readings are accurate.
Sample code using R:
# Creating numeric vectors
> x <- c(1, 2, 3, 4, 5)
> y <- c(10, 15, 20, 25, 30)
# Creating a scatter plot
> plot(x, y, main = "Scatter Plot", xlab = "X values", ylab = "Y values", col = "blue",
pch = 16)

QUANTILE-QUANTILE (Q-Q) PLOT


• The quantile-quantile plot is a graphical method for determining whether two samples of
data came from the same population or not. A q-q plot is a plot of the quantiles of the first
data set against the quantiles of the second data set. By a quantile, we mean the fraction (or
percent) of points below the given value.
• Q Q Plots (Quantile-Quantile plots) are plots of two quantiles against each other. A quantile
is a fraction where certain values fall below that quantile. For example, the median is a
quantile where 50% of the data fall below that point and 50% lie above it.
• The purpose of Q Q plots is to find out if two sets of data come from the same distribution.
A 45 degree angle is plotted on the Q Q plot; if the two data sets come from a common
distribution, the points will fall on that reference line.
A normal distribution, sometimes called the bell curve, is a distribution that occurs
naturally in many situations. For example, the bell curve is seen in tests like the SAT and
GRE. The bulk of students will score the average (C), while smaller numbers of students
will score a B or D. An even smaller percentage of students score an F or an A. This creates
a distribution that resembles a bell (hence the nickname). The bell curve is symmetrical.
Half of the data will fall to the left of the mean; half will fall to the right.
Sample code using R:
# Creating a numeric vector
> data <- c(10, 15, 20, 25, 30, 25, 20, 15, 10)
# Creating a QQ plot
> qqnorm(data, main = "QQ Plot of Data", col = "purple")

STEM-AND-LEAF PLOTS:
It shows the distribution of values in a dataset by dividing each value into a "stem" and a "leaf".
They are used to display the distribution of continuous data in a compact format.

You might also like