Asha Karegowda DataAnalytics Unit1 Part 1 Notes
Asha Karegowda DataAnalytics Unit1 Part 1 Notes
=============================================================== 2
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Chapter 4
Topic to be covered in Data understanding
4 .Data Understanding
4.1 Attribute Understanding
4.2 Data Quality
4.3 Data Visualization Methods for One and Two Attributes
I. Data understanding:
In most cases, we assume that the data can be described in terms of a table or data
matrix whose rows contain the instances, records, or data objects and whose
columns represent the attributes, features, or variables.
The data might not be stored directly in one table but in different tables from which the
attributes of interest need to be extracted and joined into a single table.
An domain of an attribute domain is the set of possible values for the attribute.
One of the most basic is the scale type: an attribute can be categorical (nominal and
ordinal) and Numeric (interval and ratio).
=============================================================== 3
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
1. Categorical Data:
a) Nominal data
An attribute is called nominal if its domain is a finite set (usually text data).
No order among data
The possible values for a categorical attribute are often considered as classes or
categories. ( eg: covid: tested +ve , tested –ve)
Two nominal values or categories are either equal or different, but not more or
less similar.
Does not support arithmetic operations like addition, subtraction, multiplication
and division.
=============================================================== 4
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
b)Ordinal data:
Finite set of categorical attributes usually text and linear ordering imposed on
the domain.
Supports operations =, != , >, < , Does not support arithmetic operations
2. Numerical data:
The domains of a numerical attribute are numbers.
Numerical data are categorized as Discrete, Continuous, Interval, Ratio.
Continuous:
Takes real values (non integer or with decimal point).
Continuous Data represents measurements and therefore their values can’t be
counted but they can be measured.
Example:
Temperature, Height and weight of a person, can be describe by using intervals on the
real number line.
Drastic round-off errors or truncations can lead to problems in later steps of the
analysis for continuous data in particular for scientific application.
=============================================================== 5
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 6
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
But if you said, “It is twice as hot outside than inside,” you would be incorrect.
By stating the temperature is twice that outside as inside, you’re using 0 degrees
as the reference point to compare the two temperatures. Since it’s possible to
measure temperature below 0 degrees, you can’t use it as a reference point for
comparison. You must use an actual number (such as 16 degrees) instead.
Examples of ratio scaled data are age, money, height, distance, or duration.
For example, if you are 50 years old and your child is 25 years old, you can
accurately claim you are twice the age of your child
Whereas you cannot imply that the temperature is twice as warm outside because
it’s an interval scale, you can say you are twice another’s age because it’s a ratio
variable.
Distance can be measured in different units like meters, kilometers, or miles.
o But no matter which unit we choose, a distance of zero will always have
the same meaning.
o Especially ratios, which do not make sense for interval scales, are often
useful for ratio scales: the quotient of distances is independent of the
measurement unit, so that the distance 20 km is always twice as long as
the distance 10 km, even if we change the unit kilometers to meters or
miles.
o
Absolute scaled data : For a ratio scale, only the value zero has a canonical meaning and
the meaning of other values depends on the choice of the measurement unit, for an
absolute scale, there is a unique measurement unit. A typical example for an absolute
scale is any kind of counting procedure.
=============================================================== 7
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
For categorical attributes, problems with accuracy can result from misspellings like
“fmale” for a value of the attribute gender, and also from erroneous entries.
For a categorical attribute like gender for which only the values female and
male are admitted, “fmale” violates syntactic accuracy. { male, female}
For numerical attributes, syntactic accuracy does not only mean that the value
must be a number and not a string or text.
Also certain numerical values can be out of the range of syntactic accuracy.
Example. Let range interval be [ 0, 100] for percentage of votes for a candidate,
Negative values and values larger than 100 should not occur.
Attributes like weight or duration will admit only positive values, and
therefore negative values would violate syntactic accuracy.
For integer-valued attributes like the number of items a customer has bought,
floating-point values should be excluded.
Semantic accuracy means that a value might be in the domain of the corresponding
attribute, but it is not correct.
Example
When the attribute gender has the value female for the customer John Smith, then
this is not a question of syntactic accuracy, since female is a possible value of the
attribute gender. But it is obviously a wrong value for a person named “John”.
The true value of educational qualification for a person is BE and by mistake it is
entered as MTech.
The verification of semantic accuracy is much more difficult or often even impossible.
=============================================================== 9
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Completeness with respect to records means that the data set contains the necessary
information that is required for the analysis.
Reasons for incomplete data :
Some records might simply be missing for some technical reasons.
Data might have been lost because a few years ago the underlying database system was
changed and only those data records were transferred to the new database that were
considered to be important at that point in time.
(eg Consider as an example a bank that provides loans to private customers. If the aim
of the analysis is to predict for future applicants of loans whether they will return the
loan, we must take into account that the sample is biased in the sense that we only have
information about those customers who have been granted a loan. We should also have
records related to customers who are defaulters.)
c) Unbalanced data. As an example, consider a production line for goods for which an
automatic quality control is to be installed. Based on suitable measurements, a classifier
is to be constructed that sorts out parts with flaws or faults. The scrap rate in production is
usually very small, so that our data might contain far less than 1% examples for parts
with flaws or faults. (eg we should data for medical case of both healthy and diseased
patients. )
d)Timeliness refers to whether the available data are too old to provide up to date
information or cannot be considered as representative for predictions of future data.
Timeliness is often a problem in dynamically changing domains, where only recently
collected data provide relevant information, while older data can be misleading and can
indicate trends that have vanished or even reversed.
Semantic accuracy : Entry is in the domain but not correct. Needs more
information to be checked (e.g. “business rules”).
Example: John Smith is female
=============================================================== 10
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Unbalanced data: The data set might be biased extremely to one type of records.
Example: Defective goods are a very small fraction of all.
a) Bar chart: A bar chart is a simple way to depict the frequencies of the values of a
categorical attribute. It’s a one dimensional plot.
A simple example for a categorical attribute with six values a, b, c, d, e, and f is
shown on the left in Fig. 4.2.
Plot a bar chart for the following Plot a histogram for the following
Grade Frequency Temperature Frequency
a 38 -3 10
b 80 -2 40
c 20 -1 135
d 62 0 180
e 94 1 115
f 41 2 128
3 180
4 148
5 48
6 5
=============================================================== 11
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
b) Histogram:
o A histogram shows the frequency distribution for a numerical attribute.
o The range of the numerical attribute is discretized into a fixed number of intervals
(called bins), usually of equal width.
o There is no generally best choice for the number of bins, but there are certain
recommendations.
(i) Sturges’ rule proposes to choose the number k of bins according to the following
formula
where n is the sample size.
Although Sturges’ rule is still very often used as a default in various statistics
software packages, it is tailored to data from normal distributions and data sets
of moderate size.
=============================================================== 12
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
(iii)The number of bins can also be determined based on the width h of each bin:
(4.2)
The braces indicate the ceiling function. where x1,...,xn is the sample to be displayed.
Reasonable values for h are given by equation (4.3) where s is the sample standard
deviation, and equation (4.4) where IQR(x) is the interquartile range of the sample,
that is, the length of the interval which covers the middle 50% of the data.
o The histogram can be misleading when the number of bins is chosen too small.
o Choosing the number of bins too high usually leads to a very scattered
histogram in which it is difficult to distinguish true peaks from random
peaks.
o All of these methods (that is, (4.2), (4.3), and (4.4) for determining the number of
bins or the length of the bins) are highly sensitive to outliers, since they divide
the range between the smallest and the largest value of the sample into bins of
equal size.
o A single outlier can make this range extremely large, so that for a smaller
number of bins, the bins themselves become very large, and for a larger
number of bins, most of the bins can be empty.
o To avoid this problem, one can either leave out extreme values from the
sample (for instance, the 3% smallest and the 3% largest values) for
calculating and displaying the histogram, or one can deviate from the
principle of bins of equal length.
=============================================================== 13
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Eg. The depth of clarity of Lake Tahoe was measured at several different places with the results
in inches as follows:
15.4, 16.7, 16.9, 17.0, 20.2, 25.3, 28.8, 29.1, 30.4, 34.5,
36.7, 39.1, 39.4, 39.6, 39.8, 40.1, 42.3, 43.5, 45.6, 45.9,
48.3, 48.5, 48.7, 49.0, 49.1, 49.3, 49.5, 50.1, 50.2, 52.3
=============================================================== 14
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Bar Charts
A bar chart is made up of columns that represents a categorical variable.
The height of the column indicates the size of the group defined by the column
label.
The bar chart below shows average household income for the four "New" states - New
Jersey, New York, New Hampshire, and New Mexico.
The chart shows that per capita income is highest in New Jersey; lowest, in New Mexico.
Histograms
A histogram is made up of that represents a continuous, quantitative variable.
The column label can be a single value or a range of values.
The height of the column indicates the size of the group defined by the column
label.
Example : The histogram below shows per capita income for five age groups.
You can see from the chart that per capita income is greatest in the 45 to 54 age group.
=============================================================== 15
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 16
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 17
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
[ Note : The Shape of a Histogram A histogram is unimodal if there is one peak, bimodal if there
are two peaks and multimodal if there are many peaks. A nonsymmetrical histogram is called skewed
if it is not symmetric. If the upper tail is longer than the lower tail then it is positively skewed. If the
upper tail is shorter than it is negatively skewed. ]
Histogram
In principle, a histogram looks like a bar chart, with the only difference that the
domain of the underlying attribute is metric (numerical). As a consequence, it is
usually impossible to simply enumerate the frequencies of the individual attribute
values (because there are usually too many different values), but one has to form
counting intervals, which are usually called bins or buckets. The width (or, if the
domain is fixed, equivalently the number) of these bins has to be chosen by a user.
All bins should have the same width, since histograms with varying bin widths are
usually more difficult to read—for the same reasons why area charts are more
difficult to interpret than bar charts (see above). In addition, a histogram may only
provide a good impression of the data if an appropriate bin width has been chosen
and onto which values the borders of the bins fall.
=============================================================== 18
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
c) Boxplot :
Boxplots are a very compact way to visualize and summarize main
characteristics of a sample from a numerical attribute.
The box plot is a standardized way of displaying the distribution of data based
on the five number summary: minimum, first quartile, median, third quartile,
and maximum.
In the simplest box plot the central rectangle spans the first quartile to the
third quartile (the interquartile range or IQR).
The box itself corresponds to the interquartile range covering the middle 50%
of the data.
-------------------Q1-----------------------------Q2-------------------------Q3-------------------
=============================================================== 19
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Eg) For the given data construct box plot for ungrouped data
=============================================================== 20
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
2. Find the median, Q2 i.e. the middle data value when the scores are put in order.
½(n+1) positioned data in sorted data
5. Find the minimum, or smallest, data value, and the maximum, or largest, data value.
7. Multiply the IQR by 1.5. This is the maximum whisker length, denoted MWL.
8. Subtract the MWL from Q1. This is the Lower Fence. Reasonable data values should
be at or above the Lower Fence. Lower limit = Q1- ( 1.5*IQR)
9. Add the MWL to Q3. This is the Upper Fence. Reasonable data values should be at or
below the Upper Fence. Upper limit = Q3 + ( 1.5*IQR)
=============================================================== 21
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
10. Mark any data values below the Lower Fence or above the Upper Fence as possible
outliers.
11. If the minimum is a possible outlier, replace it by the smallest data value that is not a
possible outlier. Call this the new minimum.
12. If the maximum is a possible outlier, replace it by the largest data value that is not a
possible outlier. Call this the new maximum.
13. Draw a number line that extends from the original minimum data value to the original
maximum data value.
14. Mark the new minimum, Q1, the median, Q3, and the new maximum as short vertical
lines above their corresponding position on the number line. Use the minimum if
there is no new minimum. Use the maximum if there is no new maximum.
15. Connect the segments for Q1, the median, and Q3 with horizontal lines through their
top points and their bottom points.
16. Draw a line from the middle of the segment for Q1 to the middle of the segment for
the new minimum (if you have one) or otherwise to the segment for the minimum.
17. Draw a line from the middle of the segment for Q3 to the middle of the segment for
the new maximum (if you have one) or otherwise to the segment for the maximum.
18. Mark the Upper fence and lower fence. Mark the location of all your possible outliers
( less than lower fence and greater than upper fence are outliers) with asterisks (*).
Example: We will use the following data representing tornadoes per year in
Oklahoma from 1995 until 2004 (Sullivan, 2nd edition, p. 167), to construct a
modified box plot . n =1 0
79 47 55 83 145 44 61 18 78 62
Step 1: The data is put in order from smallest to largest.
18 44 47 55 61 62 78 79 83 145
Step 2: The median is the average of the middle two scores. (61 + 62)/2 = 61.5
Step 3: Compute Q1 = 1/4(n+1) = 2.75 . Avg of 2nd and 3rd data (44+47)/2 = 45.5
Step 4: Compute Q3 = ¾(n+1) = 8.25 . Avg of 8th and 9th data (79+83)/2 = 81
Step 5: The minimum is 18 and the maximum is 145.
Step 6: Now find the interquartile range: IQR = Q3 - Q1 = 81-45.5 = 35.5
Step 7: Next we find the Upper limit = Q3 + IQR x 1.5 = 81+ (35.5 x 1.5) = 134.25
Step 8: Next we find the lower limit = Q1- (35.5 x 1.5) = 45.5 -(35.5 x 1.5)= -7.75
=============================================================== 22
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Step 9: No data < lower limit hence no outliers on the left hand side
Step 10: Data 145 > upper limit i.e. 145> 134.25, hence 145 is a possible outlier.
Step 11: Since there are no data values below the lower bound, i.e no outliers on left ,
hence we leave the minimum unchanged.
Step 12: The original maximum was a possible outlier, so we use the maximum of the
remaining data, 83, as the new maximum.
Step 13: Draw a number line with a uniform scale that extends at least from the original
minimum to the original maximum, but not much farther.
Step 14: Mark the locations of the following five values with vertical line segments all
having the same length: the minimum, the first quartile, the median, the third quartile,
and the new maximum.
Step 15: Connect the tops of the line segments for the median and the other quartiles, and
then connect the bottoms of the same line segments to make the box.
Step 16 and 17: Draw a line from the first quartile to the minimum and another from the
third quartile to the new maximum to make the whiskers.
Step 18: Mark the location of Upper fence , lower fence and of the possible outlier at 145
with an asterisk.
=============================================================== 23
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
18 44 47 55 61 62 78 79 83 145
=============================================================== 24
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 25
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 26
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Example
Let the data range be 160, 170, 236, 269,271,278,283,291, 301, 303, and 400
Therefore n =11
=============================================================== 27
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
hence it is clear that any range above 98,5 or below 171 are outliers. Hence in the data
series 160,170, 236, 269,271,278,283,291, 301, 303, 400, outliers are 160,170, and 400.
These 3 values which lies on either of the extremes can be considered abnormal and
should be discarded from the entire series so that any analysis made on this series is not
influenced by these extreme values.
So the data series that should be considered for further observation or study after
discarding the outliers are as below.
Note : Suppose we had data as 175 instead of 170 for the above problem the boxplot will
be as depicted below with new min = 175 and left whisker
=============================================================== 29
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Note: In first boxplot large range of values between median and min value, also
the right of median is densely popultated (i.e less variation) and hence negatively
skewed data. Also variation can be obserbed on left since left whisker is longer
compared to right whisker.
In third boxpolot , there is large range of values between median and max value,
also the left of median is densely popultated (i.e less variation) and hence
positively skewed data. Also variation can be obserbed on righ since right
whisker is longer compared to keft whisker.
In second boxplot, the data is equally distributed on the either side of median, (i.e
Q1 and Q3 are at equidistant with respect to Q2). Further the length of both left
and right whisker is same indicating normal distribution of data.
=============================================================== 30
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
d) Pie chart
A pie chart is a one dimensional circular chart divided into sectors,
illustrating numerical proportion.
In a pie chart, the arc length of each sector (and consequently its central angle and
area), is proportional to the quantity it represents.
( http://www.mathsisfun.com/data/pie-charts.html )
Example: Imagine you just did a survey of your friends to find which kind of
movie they liked best. Plot the pie chart.
Here are the results of the survey:
Solution:
First, divide each value( i.e frequency ) by the total and multiply by 100 to get a percent.
Further we need to figure out how many degrees for each "pie slice" (correctly called a
sector) using Angle = (frequency /Total samples) *360
=============================================================== 31
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Relative
Frequency
Country Frequency %
(Freq/total
)*100
US 6 0.3 *100
Japan 7 0.35*100
Europe 2 0.1*100
Korea 1 0.05*100
None 4 0.2*100
Total 20
=============================================================== 32
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Construct Stem and leaf plot for the given data result in % scored by Section A students :
56, 78, 82, 82, 90, 94, 93, 67, 67, 69, 74, 77, 92, 88, 81, 83, 84, 77, 72
Step1: sort data : 56, 67, 67, 69,72, 74, 77, 77, 78, 81, 82, 82, 83, 84, 88, 90, 92, 93, 94
( not compulsory)
Step 2: Create the plot with the stems as the tens and the leaves as the ones.
=============================================================== 33
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Now we are ready to add the ones place from each of the values in the list we made.
Step 3: Add a key to the bottom of the stem and leaf plot.
31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65,
50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42
1. Prepare an ordered stem and leaf plot for the data and briefly describe
what it shows.
2. Are there any outliers? If so, which scores?
=============================================================== 34
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
3. Look at the stem and leaf plot from the side. Describe the distribution's
main features such as:
a. number of peaks
b. symmetry
c. value at the centre of the distribution (i.e median)
Answers
1. The lowest value is 4 and the highest is 67. Therefore, the stem and leaf
plot that covers this range of values looks like this:
DATA: 31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43,
65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42
Table 10. Math scores of 41 students Final Table 10. Math scores of 41 students
Stem Leaf Stem Leaf
0 4 0 4
1 9 8 1 89
2 346 2 346
3 1275495 3 1245579
4 958031452 4 012345589
5 015705064521 5 00011234455677
6 20573 6 02357
Key 3|4 represents 34 score with stem3
and leaf 4
=============================================================== 35
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Although there are only 41 observations, The left tail extends farther
from the data centre than the right tail. Therefore, the distribution is
skewed to the left or negatively skewed.
23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09
If we try to use the last digit, the hundredths digit, for these numbers, the stem-and-leaf
plot will be enormously long, because these values are so spread out. (With the numbers'
first three digits ranging from 232 to 270
So instead of working with the given numbers, We will round each of the numbers to the
nearest tenth, and then use those new values for my plot. Rounding gives me the
following list:
23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1
=============================================================== 36
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Stem Leaf
23 3
24 188
25 0369
26 33
27 1
Key “ 23|3” means 23.3
f) Scatter plot
A scatter plot displays a two-dimensional data set of metric attributes by interpreting
the sample values as coordinates of a point in a metric space .
A scatter plot is very well suited if one wants to see whether the two represented
quantities depend on each other or vary independently.
Scatter plot indicating No correlation (computed correlation will be zero) for above figure
Scatter plot below indicating positive correlation between sales of ice-cream with increased
temperature
=============================================================== 38
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
=============================================================== 39
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
As the slope of a hill increases, the amount of speed a walker reaches may decrease.
As more employees are laid off, satisfaction among remaining employees decreases.
If a train increases speed, the length of time to get to the final point decreases.
Example of no correlation
=============================================================== 40
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
The frequency of a particular event is the number of times that the event occurs. The relative
frequency is the proportion of observed responses in the category represented as pie chart. We make a
circle graph often called a pie chart of this data by placing wedges in the circle of proportionate size
to the frequencies.
Graphical representations serve the purpose to make tabular data more easily
comprehensible. The main tool to achieve this is to use geometric quantities—like
lengths, areas, and angles—to represent numbers, since such geometric properties
are more quickly interpretable for humans than abstract numbers. The most
important types of graphical representations are:
g) Pole/stick/bar chart
Numbers, which may be, for instance, the frequencies of different attribute values in
a sample, are represented by the lengths of poles, sticks, or bars. In this way a good
impression especially of ratios can be achieved (see Figs. A.1a and b, in which the
frequencies of Table A.2 are displayed).
=============================================================== 41
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Table A.2
Feature1 Value
1 1
2 6
3 9
4 5
5 4
=============================================================== 42
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Contingency Table
A1 marks A2 A3 A4 ∑
B1 section 8 3 5 2 18
B2 2 6 1 3 12
B3 4 1 2 7 14
∑ 14 10 8 12 44
=============================================================== 43
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================
Questions:
1. Types of data
2. Data quality measures
3. Data visualization methods
4. Plot boxplot, histogram, stem and leaf, piechart, mosaic plot etc
5. Problems on frequency distribution table, ogive plot
6. Problems on location, dispersion and shape measures for grouped and ungrouped data
=============================================================== 44