EDS Unit 2 ?
EDS Unit 2 ?
Definition of Dataset
A dataset is a collection of data objects organized in a structured format,
typically in rows and columns. Each row represents an instance (record), and
each column represents an attribute (feature or variable). Datasets are used for
analysis, prediction, and decision-making.
The dataset consists of rows and columns. Where the rows correspond the
data objects, and the columns correspond the attributes of the data objects.
Types of Attributes
1. Qualitative
2. Quantitative
https://medium.com/@netrajpatil12mati/data-objects-and-attribute-types-
704d7d9ea8a8
Qualitative Attributes
These attributes are descriptive and non-numerical, and are used to describe
characteristics that can't be easily measured.
Nominal Attributes
Definition: Represent categories or labels without a meaningful order or
ranking.
Binary Attributes
Definition: Attributes with only two possible values.
Asymmetric Attributes:
Ordinal Attributes
Definition: Represent categories with a meaningful order or ranking, but the
intervals between values are not defined.
Examples: Education levels (High School < College < Graduate), Likert
scale (Poor, Average, Good).
Key Point: Arithmetic operations are not applicable, but comparisons are.
Quantitative Attributes
These attributes are numerical and quantifiable, and are used to measure
values or counts.
Discrete Attributes
Definition: Numeric attributes with a finite or countable number of values.
Continuous Attributes
Numeric Attributes
Numeric attributes represent measurable quantities and can be classified as:
Interval-Scaled:
Ratio-Scaled:
3. Detecting Anomalies: Statistical tools like box plots and inter-quartile range
(IQR) help identify outliers, which could indicate errors, special cases, or
significant trends worth investigating.
These basic descriptions are essential for interpreting data accurately and
making informed, data-driven decisions.
Mean
The average value, calculated by summing all data points and dividing by the
number of data points.
Formula:
∑x
Mean = n
Example: For the data series [3, 5, 7, 9], the mean would be:
3+5+7+9 24
Mean = 4 = 4 = 6
2. Mean of a Discrete Series:
Formula:
Where:
∑ (f ⋅x)
Mean = ∑f
A continuous series deals with data that can take any value within a
given range, often represented in intervals or class groups.
Formula:
∑ (f ⋅m)
Mean = ∑f
Where:
Mode
The most frequently occurring value in the dataset.
The mode is the value (or class interval) that has the highest frequency.
Formula:
Mode = Value with highest frequency
Example: For the dataset values [2, 4, 6]with frequencies [3, 5, 2], the
mode is 4because it has the highest frequency (5).
The mode for a continuous series can be found using the following
formula:
Mode = L + ( (2f1f)−f
1 −f0
0 −f 2
) × h
Where:
Median
The middle value when data points are arranged in ascending order. If there is
an even number of values, the median is the average of the two middle values.
1. Absolute Measure
2. Relative Measure
1. Range:
2. Quartiles:
3. Variance:
A measure of how much the values in the dataset deviate from the
mean.
Where is each value, is the mean, and is the number of data points.
The square root of the variance. It shows how spread out the numbers
are.
The range between Q1 and Q3, representing the middle 50% of the
data.
Coefficients of Dispersion
1. Coefficient of Range:
Formula:
Max−Min
Coefficient of Range = Max+Min
Formula:
Standard Deviation
CV = Mean
× 100
Example: If mean = 20, standard deviation = 4:
4
CV = 20
× 100 = 20%
3. Coefficient of Mean Deviation:
Formula:
Mean Deviation
Coefficient of Mean Deviation = Mean
Formula:
Q3−Q1
Coefficient of Quartile Deviation = Q3+Q1
2. Pie Chart
Represents data as a circular chart divided into slices, where each slice is
proportional to the percentage of a category.
3. Histogram
Displays frequency distribution of continuous data, with adjacent bars to
indicate intervals.
4. Bar Charts
Vertical Bar Chart:
Bars are upright, and their height represents the value of each category.