0% found this document useful (0 votes)
18 views44 pages

Asha Karegowda DataAnalytics Unit1 Part 1 Notes

Karegowda DataAnalytics Unit1 Part 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views44 pages

Asha Karegowda DataAnalytics Unit1 Part 1 Notes

Karegowda DataAnalytics Unit1 Part 1 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit I – Data Understanding VI sem BE OE41 Data Analytics

Teacher : Dr. Asha Gowda Karegowda


===================================================================================

Reference - Michael R. Berthod, Christian Borgelt, Frank Hoppner,


Guide to Intelligent Data Analysis, Springe Series

Difference between data and knowledge as follows:


Data : Knowledge
refer to single instances refers to classes of instances
describe individual properties describes general patterns, structures,
laws, principles, etc.
are often available in large consists of as few statements as
amounts possible
are often easy to collect or to is often difficult and time-consuming to
obtain find or to obtain
do not allow us to make allows us to make predictions and
predictions or forecasts forecasts

Intelligent Data Analysis:


 Any kind of data analysis can be associated with statistics.
 Statistics can be divided into descriptive and inferential statistics.
 Descriptive statistics summarizes data without making specific
assumptions about the data, often by characteristic values like the
(empirical) mean or by diagrams like histograms. Location,
Dispersion, shape measures and visualization.
 Inferential statistics provides more rigorous methods than descriptive
statistics that are based on certain assumptions about the data
generating random process. The conclusions drawn in inferential
statistics are only valid if these assumptions are satisfied.
 Data collection: experimental and observational studies.
 In an experimental study one can control and manipulate the data
generating process. For instance, if we are interested in the effects of
certain diets on the health status of a person, we might ask different
groups of people to stick to different diets. Thus we have a certain
control over the data generating process. In this experimental study,
we can decide which and how many people should be assigned to a
certain diet.
 In an observational study one cannot control the data generating
process. For the same dietary study as above, we might simply ask
people on the street what they normally eat. Then we have no control
about which kinds of diets we get data and how many people we will
have for each diet in our data.
=============================================================== 1
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Difference between data mining and Data analytics

Data mining is one of the step of knowledge discovery in databases (KDD) as


illustrated in figure below.

=============================================================== 2
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Chapter 4
Topic to be covered in Data understanding

4 .Data Understanding
4.1 Attribute Understanding
4.2 Data Quality
4.3 Data Visualization Methods for One and Two Attributes

I. Data understanding:
In most cases, we assume that the data can be described in terms of a table or data
matrix whose rows contain the instances, records, or data objects and whose
columns represent the attributes, features, or variables.
The data might not be stored directly in one table but in different tables from which the
attributes of interest need to be extracted and joined into a single table.

ID Name Age Salary


1 Abhi 34 13000
2 Alex 28 15000
3 Parikshit 20 18000
4 Ram 42 19020

An domain of an attribute domain is the set of possible values for the attribute.

One of the most basic is the scale type: an attribute can be categorical (nominal and
ordinal) and Numeric (interval and ratio).

=============================================================== 3
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

1. Categorical Data:
a) Nominal data
 An attribute is called nominal if its domain is a finite set (usually text data).
 No order among data
 The possible values for a categorical attribute are often considered as classes or
categories. ( eg: covid: tested +ve , tested –ve)
 Two nominal values or categories are either equal or different, but not more or
less similar.
 Does not support arithmetic operations like addition, subtraction, multiplication
and division.

Examples for categorical attributes:


o Binominal categorical attribute :
 gender with the values F (female) and M (male) in the customer
data set
 purchased product : yes or no
 Result :pass or fail
 Registered for a course? Yes or No

o Multinominal categorical attribute:


 Car color can take set of discrete values – red, green, blue …
 Marital status – married, divorced, single
 States names for India : Karnataka, Maharashtra, UP, HP..
 Blood group: A-positive, A-negative, B-positive, B-
negative, AB-positive, AB-negative, O-positive, O-
negative
 Languages : Hindi, English, French, German, Spanish etc.,

=============================================================== 4
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

b)Ordinal data:
 Finite set of categorical attributes usually text and linear ordering imposed on
the domain.
 Supports operations =, != , >, < , Does not support arithmetic operations

Examples for categorical attributes:


 An attribute for university degrees with the values none: B.Sc., M.Sc., and
Ph.D. represents an ordinal attribute. A Ph.D. is a higher degree than an M.Sc.,
and an M.Sc. is a higher degree than a B.Sc. However, the ordering does not say
that the difference between a Ph.D. and M.Sc. is the same as the difference
between an M.Sc. and a B.Sc.
 Income of person – low , moderate, high income.
 Grades scored by student – S,A, B, C, D ,E, F.
 Course Ratings: Poor, Fair, Good, Very Good, Excellent.
 Taste Preferences: First Choice, Second Choice, Last Choice
 TV show satisfaction rating :”extremely dislike”, “dislike”, “neutral”, “like”,
“extremely like”

2. Numerical data:
 The domains of a numerical attribute are numbers.
 Numerical data are categorized as Discrete, Continuous, Interval, Ratio.

Discrete Numerical attributes:


 They take integer values.
 This type of data can’t be measured but it can be counted.
 Example: Discrete numerical attributes often result from counting processes
o Number of students enrolled for a course
o Number of customers who have ordered from an on-line shop in the last
twelve months.
o Number of BE branches in an engineering institute
o Number of employees with experience more than 15 years

Continuous:
 Takes real values (non integer or with decimal point).
 Continuous Data represents measurements and therefore their values can’t be
counted but they can be measured.
Example:
Temperature, Height and weight of a person, can be describe by using intervals on the
real number line.
Drastic round-off errors or truncations can lead to problems in later steps of the
analysis for continuous data in particular for scientific application.

=============================================================== 5
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Numerical attributes can have an interval, a ratio, or an absolute scale.


a)Interval scaled data :
 An interval scale data shows the order and the exact difference in the value.
 It has no zero value. (zero had different meaning).
o In the case of interval scales, zero doesn’t mean the absence of value, but
is actually another number used on the scale, like 0 degrees celsius.
o Without a true zero, it is impossible to compute ratios.

 Negative numbers also have meaning.


 With interval data, we can add and subtract, but cannot multiply or divide.
 Interval scales hold no true zero and can represent values below zero. For
example, you can measure temperature below 0 degrees Celsius, such as -10
degrees.
 The intervals on the scale are equal in an interval variable. The scale is
equidistant.
 Any measurement of interval scale can be ranked, counted, subtracted, or added,
and equal intervals separate each number on the scale. However, these
measurements don’t provide any sense of ratio between one another.

Examples of interval scaled data:


 IQ Test: An individual cannot have a zero IQ, therefore satisfying the no
zero property of an interval variable. The level of an individual's IQ will be
determined, depending on which interval the score falls in.
 Time: Time passes as a good example of interval data if measured during the
day or using a 12-hour clock. The numbers on a wall clock are on an interval
scale since they are equidistant and measurable. For example, the difference
between 1 o’clock and 2 o’clock is the same as that between 2 o’clock and 3
o’clock.
 Temperature scales like Fahrenheit and Celsius degrees, zero refers to different
temperatures.
Example: 10 degrees C + 10 degrees C = 20 degrees C.
But 20 degrees C is not twice as hot as 10 degrees C, however, because
there is no such thing as “no temperature” when it comes to the Celsius
scale.
When converted to Fahrenheit, it’s clear: 10C=50F and 20C=68F, which
is clearly not twice as hot. For example, it does not make sense to say that
a temperature of 50F is twice that of 68F.

=============================================================== 6
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Note : Measuring temperature is an excellent example of interval scales. The


temperature in an air-conditioned room is 16 degrees Celsius, while the
temperature outside the room is 32 degrees Celsius. You can conclude the
temperature outside is 16 degrees higher than inside the room.

But if you said, “It is twice as hot outside than inside,” you would be incorrect.
By stating the temperature is twice that outside as inside, you’re using 0 degrees
as the reference point to compare the two temperatures. Since it’s possible to
measure temperature below 0 degrees, you can’t use it as a reference point for
comparison. You must use an actual number (such as 16 degrees) instead.

b) Ratio scaled numerical data:


 Are numerical data with order, they tell us the exact value between units
 They have clear meaning of zero value and thus allow us to compute
meaningful ratios.
 Value of zero will not change with units has same meaning.
 Ratio variables, on the other hand, never fall below zero. Height and weight
measure from 0 and above, but never fall below it.
o Ratio variables cannot take negative values like interval scaled data.
 These variables can be meaningfully added, subtracted, multiplied, divided
(ratios).

Examples of ratio scaled data are age, money, height, distance, or duration.
 For example, if you are 50 years old and your child is 25 years old, you can
accurately claim you are twice the age of your child
Whereas you cannot imply that the temperature is twice as warm outside because
it’s an interval scale, you can say you are twice another’s age because it’s a ratio
variable.
 Distance can be measured in different units like meters, kilometers, or miles.
o But no matter which unit we choose, a distance of zero will always have
the same meaning.
o Especially ratios, which do not make sense for interval scales, are often
useful for ratio scales: the quotient of distances is independent of the
measurement unit, so that the distance 20 km is always twice as long as
the distance 10 km, even if we change the unit kilometers to meters or
miles.
o
Absolute scaled data : For a ratio scale, only the value zero has a canonical meaning and
the meaning of other values depends on the choice of the measurement unit, for an
absolute scale, there is a unique measurement unit. A typical example for an absolute
scale is any kind of counting procedure.

=============================================================== 7
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

4.2 Data Quality


The saying “garbage in, garbage out” applies to data analysis as to any other area. The
results of an analysis cannot be better than the quality of the data, so that we should be
concerned about the data quality before we carry out any deeper analysis with the data.
Data quality refers to how well the data fit to their intended use.

There are various data quality dimensions.


a) Accuracy:
Accuracy is defined as the closeness between the value in the data and the true value.
For numerical attributes, accuracy means how exact the value in the data set is
compared to the true value.
 Noise or limited precision in measurements can lead to reduced accuracy for
numerical attributes.
 The magnitude of noise can be estimated when measurements for the same value
have been taken repeatedly.
 Accuracy of numerical values can also be affected by wrong or erroneous
measurements or simply by errors like transposition of digits when measurements
are recorded manually.
=============================================================== 8
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

For categorical attributes, problems with accuracy can result from misspellings like
“fmale” for a value of the attribute gender, and also from erroneous entries.

Syntactic and semantic accuracy:


Syntactic accuracy means the data value does not belong to the domain.

 For a categorical attribute like gender for which only the values female and
male are admitted, “fmale” violates syntactic accuracy. { male, female}

 For numerical attributes, syntactic accuracy does not only mean that the value
must be a number and not a string or text.

 Also certain numerical values can be out of the range of syntactic accuracy.
Example. Let range interval be [ 0, 100] for percentage of votes for a candidate,
Negative values and values larger than 100 should not occur.

 Attributes like weight or duration will admit only positive values, and
therefore negative values would violate syntactic accuracy.

 For integer-valued attributes like the number of items a customer has bought,
floating-point values should be excluded.

 Discovering problems of syntactic accuracy in a data set is a relatively easy task.


Once we know the domains of the attributes, we can easily verify, whether the
values lie in the corresponding domains or not.
 A simple measure for syntactic accuracy is the fraction of values that lies in the
domains of their corresponding attribute.

Semantic accuracy means that a value might be in the domain of the corresponding
attribute, but it is not correct.
Example
 When the attribute gender has the value female for the customer John Smith, then
this is not a question of syntactic accuracy, since female is a possible value of the
attribute gender. But it is obviously a wrong value for a person named “John”.
 The true value of educational qualification for a person is BE and by mistake it is
entered as MTech.

The verification of semantic accuracy is much more difficult or often even impossible.

=============================================================== 9
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

b) Completeness: Completeness can be divided into completeness with respect to


attribute values and completeness with respect to records.

Completeness with respect to attribute values refers to absence of missing values


When missing values are explicitly marked as such, then a simple measure for this
dimension of data quality is the fraction of missing values among all values.

Completeness with respect to records means that the data set contains the necessary
information that is required for the analysis.
Reasons for incomplete data :
Some records might simply be missing for some technical reasons.
Data might have been lost because a few years ago the underlying database system was
changed and only those data records were transferred to the new database that were
considered to be important at that point in time.
(eg Consider as an example a bank that provides loans to private customers. If the aim
of the analysis is to predict for future applicants of loans whether they will return the
loan, we must take into account that the sample is biased in the sense that we only have
information about those customers who have been granted a loan. We should also have
records related to customers who are defaulters.)

c) Unbalanced data. As an example, consider a production line for goods for which an
automatic quality control is to be installed. Based on suitable measurements, a classifier
is to be constructed that sorts out parts with flaws or faults. The scrap rate in production is
usually very small, so that our data might contain far less than 1% examples for parts
with flaws or faults. (eg we should data for medical case of both healthy and diseased
patients. )

d)Timeliness refers to whether the available data are too old to provide up to date
information or cannot be considered as representative for predictions of future data.
Timeliness is often a problem in dynamically changing domains, where only recently
collected data provide relevant information, while older data can be misleading and can
indicate trends that have vanished or even reversed.

In brief : Data Quality


 Syntactic accuracy : Entry is not in the domain. Can be checked quite easy.
Examples: fmale in gender, text in numerical attributes, ...

 Semantic accuracy : Entry is in the domain but not correct. Needs more
information to be checked (e.g. “business rules”).
Example: John Smith is female

 Completeness: with attributes (missing data)


With records( records must be sufficient and must cover all class type)
Example: Complete records are missing, the data is biased

=============================================================== 10
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

 Unbalanced data: The data set might be biased extremely to one type of records.
Example: Defective goods are a very small fraction of all.

 Timeliness: Is the available data up to date?

4.3.1 Visualization Methods for One and Two Attributes


One dimensional Plots: Bar chart, histogram, Box plots, Pie chart, stem and leaf
Two dimensional Plots: Scatter plots

a) Bar chart: A bar chart is a simple way to depict the frequencies of the values of a
categorical attribute. It’s a one dimensional plot.
A simple example for a categorical attribute with six values a, b, c, d, e, and f is
shown on the left in Fig. 4.2.

Plot a bar chart for the following Plot a histogram for the following
Grade Frequency Temperature Frequency
a 38 -3 10
b 80 -2 40
c 20 -1 135
d 62 0 180
e 94 1 115
f 41 2 128
3 180
4 148
5 48
6 5

=============================================================== 11
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

b) Histogram:
o A histogram shows the frequency distribution for a numerical attribute.
o The range of the numerical attribute is discretized into a fixed number of intervals
(called bins), usually of equal width.
o There is no generally best choice for the number of bins, but there are certain
recommendations.

Different approaches to find number of bins

(i) Sturges’ rule proposes to choose the number k of bins according to the following
formula
where n is the sample size.

Although Sturges’ rule is still very often used as a default in various statistics
software packages, it is tailored to data from normal distributions and data sets
of moderate size.

(ii) Square-root of n, where n is the number of samples, k is


number of bins

=============================================================== 12
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

(iii)The number of bins can also be determined based on the width h of each bin:

(4.2)

The braces indicate the ceiling function. where x1,...,xn is the sample to be displayed.

Reasonable values for h are given by equation (4.3) where s is the sample standard
deviation, and equation (4.4) where IQR(x) is the interquartile range of the sample,
that is, the length of the interval which covers the middle 50% of the data.

Points to take care with histogram

o The histogram can be misleading when the number of bins is chosen too small.

o Choosing the number of bins too high usually leads to a very scattered
histogram in which it is difficult to distinguish true peaks from random
peaks.

o All of these methods (that is, (4.2), (4.3), and (4.4) for determining the number of
bins or the length of the bins) are highly sensitive to outliers, since they divide
the range between the smallest and the largest value of the sample into bins of
equal size.

o A single outlier can make this range extremely large, so that for a smaller
number of bins, the bins themselves become very large, and for a larger
number of bins, most of the bins can be empty.

o To avoid this problem, one can either leave out extreme values from the
sample (for instance, the 3% smallest and the 3% largest values) for
calculating and displaying the histogram, or one can deviate from the
principle of bins of equal length.

=============================================================== 13
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Eg. The depth of clarity of Lake Tahoe was measured at several different places with the results
in inches as follows:

15.4, 16.7, 16.9, 17.0, 20.2, 25.3, 28.8, 29.1, 30.4, 34.5,
36.7, 39.1, 39.4, 39.6, 39.8, 40.1, 42.3, 43.5, 45.6, 45.9,
48.3, 48.5, 48.7, 49.0, 49.1, 49.3, 49.5, 50.1, 50.2, 52.3

We use a frequency distribution table with class intervals of length 5.


N = 31

Number of bins = (max-min) /h ;


(52.3- 15.4)/ 5 = 7.38 = 8 bin of width h = 5

Cumulative Relative Frequency


Class Interval Frequency (F) Frequency Cumulative Relative Frequency
(F/ N)
15 - < 20 4 4 0.129 0.129
20 - < 25 1 5 0.032 0.161
25 - < 30 3 8 0.097 0.258
30 - < 35 2 10 0.065 0.323
35 - < 40 6 16 0.194 0.516
40 - < 45 3 19 0.097 0.613
45 - < 50 9 28 0.290 0.903
50 - < 55 3 31 0.097 1.000
Total 31 1.000

=============================================================== 14
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Bar Charts
 A bar chart is made up of columns that represents a categorical variable.
 The height of the column indicates the size of the group defined by the column
label.

The bar chart below shows average household income for the four "New" states - New
Jersey, New York, New Hampshire, and New Mexico.

The chart shows that per capita income is highest in New Jersey; lowest, in New Mexico.

Histograms
 A histogram is made up of that represents a continuous, quantitative variable.
 The column label can be a single value or a range of values.
 The height of the column indicates the size of the group defined by the column
label.

Example : The histogram below shows per capita income for five age groups.

You can see from the chart that per capita income is greatest in the 45 to 54 age group.

=============================================================== 15
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

The Difference Between Bar Charts and Histograms

Basis for Histogram


Bar graph
Comparison
Comparison of discrete Distribution of non-discrete variables
Indicates
variables i.e range of data
Presents Categorical data Continuous , Quantitative data
Bars do not touch each Bars touch each other, hence there
Spaces other, hence there are are no spaces between bars
spaces between bars.
Elements are taken as Elements are grouped together, so
Elements
individual entities. that they are considered as ranges.
Numerical data, hence can be
categorical data and hence appropriate to talk about the
it is not appropriate to skewness of a histogram; that is, the
Skewness
comment on the skewness tendency of the observations to fall
of a bar chart more on the low end or the high end
of the X axis.
Width of bars Same Need not to be same

=============================================================== 16
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

=============================================================== 17
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

[ Note : The Shape of a Histogram A histogram is unimodal if there is one peak, bimodal if there
are two peaks and multimodal if there are many peaks. A nonsymmetrical histogram is called skewed
if it is not symmetric. If the upper tail is longer than the lower tail then it is positively skewed. If the
upper tail is shorter than it is negatively skewed. ]

Fig : histogram with unimodal ,bimodal and multimodal distribution

Histogram
In principle, a histogram looks like a bar chart, with the only difference that the
domain of the underlying attribute is metric (numerical). As a consequence, it is
usually impossible to simply enumerate the frequencies of the individual attribute
values (because there are usually too many different values), but one has to form
counting intervals, which are usually called bins or buckets. The width (or, if the
domain is fixed, equivalently the number) of these bins has to be chosen by a user.
All bins should have the same width, since histograms with varying bin widths are
usually more difficult to read—for the same reasons why area charts are more
difficult to interpret than bar charts (see above). In addition, a histogram may only
provide a good impression of the data if an appropriate bin width has been chosen
and onto which values the borders of the bins fall.
=============================================================== 18
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

c) Boxplot :
 Boxplots are a very compact way to visualize and summarize main
characteristics of a sample from a numerical attribute.
 The box plot is a standardized way of displaying the distribution of data based
on the five number summary: minimum, first quartile, median, third quartile,
and maximum.
 In the simplest box plot the central rectangle spans the first quartile to the
third quartile (the interquartile range or IQR).

 The line in the middle of a boxplot indicates the sample median.

 The box itself corresponds to the interquartile range covering the middle 50%
of the data.

 The whiskers are drawn in the following way.


The maximum length of each whisker is 1.5 times the length of the
interquartile range. But if there is no data point at the maximum length of a
whisker, the corresponding whisker is shortened until it reaches the next data
point. Data points lying outside the whiskers are considered as outliers and are
indicated in the form of small circles.

Steps for plotting boxplot


Step 1: Sort the given data in ascending order.
Step 2. Compute median (Q2), Q1, Q3, IQR, upper limit, lower limit
Q1 = (n+1)/4,
Q2 = (n+1)/2 ,
Q3 = 3(n+1)/4,
IQR = Q3-Q1
Upper limit = Q3 + (1.5 * IQR)
Lower Limit = Q1 – (1.5 * IQR)
Step 3.Find the min and max value
Step 4.Find outliers: values < lower limit and values > upper limit
Step 5: if no outlier go to step 7
Step5. If min is outlier, then change min to next min value which is > lower limit
Step 6. If max is outlier, then change max to next value which is < upper limit
Step 7 : plot boxplot with min, max, Q1, Q2, Q3. Plot outliers if any as asterisk (*)

-------------------Q1-----------------------------Q2-------------------------Q3-------------------

=============================================================== 19
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Eg) For the given data construct box plot for ungrouped data

=============================================================== 20
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Detailed Boxplot Construction Steps

The following steps can be used to construct a box plot.

1. Put the data values in order.

2. Find the median, Q2 i.e. the middle data value when the scores are put in order.
½(n+1) positioned data in sorted data

3. Find Q1 = ¼(n+1) positioned data in sorted data.

4. Find Q3 = 3/4(n+1) positioned data in sorted data.

5. Find the minimum, or smallest, data value, and the maximum, or largest, data value.

6. Find Q3 minus Q1. This is the interquartile range, denoted IQR.

7. Multiply the IQR by 1.5. This is the maximum whisker length, denoted MWL.

8. Subtract the MWL from Q1. This is the Lower Fence. Reasonable data values should
be at or above the Lower Fence. Lower limit = Q1- ( 1.5*IQR)

9. Add the MWL to Q3. This is the Upper Fence. Reasonable data values should be at or
below the Upper Fence. Upper limit = Q3 + ( 1.5*IQR)
=============================================================== 21
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

10. Mark any data values below the Lower Fence or above the Upper Fence as possible
outliers.

11. If the minimum is a possible outlier, replace it by the smallest data value that is not a
possible outlier. Call this the new minimum.

12. If the maximum is a possible outlier, replace it by the largest data value that is not a
possible outlier. Call this the new maximum.

13. Draw a number line that extends from the original minimum data value to the original
maximum data value.

14. Mark the new minimum, Q1, the median, Q3, and the new maximum as short vertical
lines above their corresponding position on the number line. Use the minimum if
there is no new minimum. Use the maximum if there is no new maximum.

15. Connect the segments for Q1, the median, and Q3 with horizontal lines through their
top points and their bottom points.

16. Draw a line from the middle of the segment for Q1 to the middle of the segment for
the new minimum (if you have one) or otherwise to the segment for the minimum.

17. Draw a line from the middle of the segment for Q3 to the middle of the segment for
the new maximum (if you have one) or otherwise to the segment for the maximum.

18. Mark the Upper fence and lower fence. Mark the location of all your possible outliers
( less than lower fence and greater than upper fence are outliers) with asterisks (*).

Example: We will use the following data representing tornadoes per year in
Oklahoma from 1995 until 2004 (Sullivan, 2nd edition, p. 167), to construct a
modified box plot . n =1 0
79 47 55 83 145 44 61 18 78 62
Step 1: The data is put in order from smallest to largest.
18 44 47 55 61 62 78 79 83 145
Step 2: The median is the average of the middle two scores. (61 + 62)/2 = 61.5
Step 3: Compute Q1 = 1/4(n+1) = 2.75 . Avg of 2nd and 3rd data (44+47)/2 = 45.5
Step 4: Compute Q3 = ¾(n+1) = 8.25 . Avg of 8th and 9th data (79+83)/2 = 81
Step 5: The minimum is 18 and the maximum is 145.
Step 6: Now find the interquartile range: IQR = Q3 - Q1 = 81-45.5 = 35.5
Step 7: Next we find the Upper limit = Q3 + IQR x 1.5 = 81+ (35.5 x 1.5) = 134.25
Step 8: Next we find the lower limit = Q1- (35.5 x 1.5) = 45.5 -(35.5 x 1.5)= -7.75
=============================================================== 22
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Step 9: No data < lower limit hence no outliers on the left hand side
Step 10: Data 145 > upper limit i.e. 145> 134.25, hence 145 is a possible outlier.
Step 11: Since there are no data values below the lower bound, i.e no outliers on left ,
hence we leave the minimum unchanged.
Step 12: The original maximum was a possible outlier, so we use the maximum of the
remaining data, 83, as the new maximum.
Step 13: Draw a number line with a uniform scale that extends at least from the original
minimum to the original maximum, but not much farther.

Step 14: Mark the locations of the following five values with vertical line segments all
having the same length: the minimum, the first quartile, the median, the third quartile,
and the new maximum.

Step 15: Connect the tops of the line segments for the median and the other quartiles, and
then connect the bottoms of the same line segments to make the box.

Step 16 and 17: Draw a line from the first quartile to the minimum and another from the
third quartile to the new maximum to make the whiskers.

Step 18: Mark the location of Upper fence , lower fence and of the possible outlier at 145
with an asterisk.

=============================================================== 23
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

18 44 47 55 61 62 78 79 83 145

=============================================================== 24
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

=============================================================== 25
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

=============================================================== 26
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Example
Let the data range be 160, 170, 236, 269,271,278,283,291, 301, 303, and 400
Therefore n =11

=============================================================== 27
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Boxplot with modified min and max values.

160, 170, 236, 269,271,278,283,291, 301, 303, and 341

hence it is clear that any range above 98,5 or below 171 are outliers. Hence in the data
series 160,170, 236, 269,271,278,283,291, 301, 303, 400, outliers are 160,170, and 400.

These 3 values which lies on either of the extremes can be considered abnormal and
should be discarded from the entire series so that any analysis made on this series is not
influenced by these extreme values.

So the data series that should be considered for further observation or study after
discarding the outliers are as below.

236, 269,271,278,283,291, 301, 303


=============================================================== 28
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Note : Suppose we had data as 175 instead of 170 for the above problem the boxplot will
be as depicted below with new min = 175 and left whisker

=============================================================== 29
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Note: In first boxplot large range of values between median and min value, also
the right of median is densely popultated (i.e less variation) and hence negatively
skewed data. Also variation can be obserbed on left since left whisker is longer
compared to right whisker.
In third boxpolot , there is large range of values between median and max value,
also the left of median is densely popultated (i.e less variation) and hence
positively skewed data. Also variation can be obserbed on righ since right
whisker is longer compared to keft whisker.
In second boxplot, the data is equally distributed on the either side of median, (i.e
Q1 and Q3 are at equidistant with respect to Q2). Further the length of both left
and right whisker is same indicating normal distribution of data.
=============================================================== 30
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

d) Pie chart
 A pie chart is a one dimensional circular chart divided into sectors,
illustrating numerical proportion.
 In a pie chart, the arc length of each sector (and consequently its central angle and
area), is proportional to the quantity it represents.

Angle = (frequency /Total samples) *360

( http://www.mathsisfun.com/data/pie-charts.html )

Example: Imagine you just did a survey of your friends to find which kind of
movie they liked best. Plot the pie chart.
Here are the results of the survey:

Comedy Action Romance Drama SciFi TOTAL


4 5 6 1 4 20

Solution:
First, divide each value( i.e frequency ) by the total and multiply by 100 to get a percent.
Further we need to figure out how many degrees for each "pie slice" (correctly called a
sector) using Angle = (frequency /Total samples) *360

A Full Circle has 360 degrees, so we do this calculation:

Comedy Action Romance Drama SciFi TOTAL


4 5 6 1 4 20
100%
4/20 = 20% 5/20 = 25% 6/20 = 30% 1/20 = 5% 4/20 = 20%
Percentage
(4/20) × 360°
(4/20) × 360° (5/20) × 360° (6/20) × 360° (1/20) × 360° 360°
= 72°
= 72° = 90° = 108° = 18° Angle

=============================================================== 31
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Plot bar chart and pie chart

Relative
Frequency
Country Frequency %
(Freq/total
)*100
US 6 0.3 *100
Japan 7 0.35*100
Europe 2 0.1*100
Korea 1 0.05*100
None 4 0.2*100
Total 20

=============================================================== 32
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

US Japan Europe Korea None TOTAL


6 7 2 1 4 20
6/20 = 30% 7/20 = 35% 2/20 = 10% 1/20 = 5% 4/20 = 20% 100%
4/20 × 360°
6/20 × 360° 7/20 × 360° 2/20 × 360° 1/20 × 360°
= 72° 360°
= 108° = 126° = 36° = 18°

e)A stem and leaf plot


 A stem and leaf plot is one dimensional plot used to organize data as they are
collected. (i.e we need not sort the data)
 A stem and leaf plot looks something like a bar graph when it is turned on its
side.
 number in the data is broken down into a stem and a leaf, thus the name.
 The stem of the number includes all but the last digit.
 The leaf of the number will always be a single digit.
 shows how the data are spread—that is, highest number, lowest number,
most common number i.e. mode

Elements of a good stem and leaf plot


A good stem and leaf plot
 shows the first digits of the number (thousands, hundreds or tens) as the stem
and shows the last digit (ones) as the leaf.
 usually uses whole numbers. Anything that has a decimal point is rounded to
the nearest whole number. For example, test results, speeds, heights,
weights, etc.

Construct Stem and leaf plot for the given data result in % scored by Section A students :

56, 78, 82, 82, 90, 94, 93, 67, 67, 69, 74, 77, 92, 88, 81, 83, 84, 77, 72

Step1: sort data : 56, 67, 67, 69,72, 74, 77, 77, 78, 81, 82, 82, 83, 84, 88, 90, 92, 93, 94

( not compulsory)

Step 2: Create the plot with the stems as the tens and the leaves as the ones.

The stems will be 5, 6, 7, 8 and 9

=============================================================== 33
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Now we are ready to add the ones place from each of the values in the list we made.

Step 3: Add a key to the bottom of the stem and leaf plot.

Example 2 – Plot stem and leaf plot


The results of 41 students' math tests (with a best possible score of 70) are recorded
below:

31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65,
50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42
1. Prepare an ordered stem and leaf plot for the data and briefly describe
what it shows.
2. Are there any outliers? If so, which scores?

=============================================================== 34
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

3. Look at the stem and leaf plot from the side. Describe the distribution's
main features such as:
a. number of peaks
b. symmetry
c. value at the centre of the distribution (i.e median)
Answers
1. The lowest value is 4 and the highest is 67. Therefore, the stem and leaf
plot that covers this range of values looks like this:

DATA: 31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43,
65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42

Table 10. Math scores of 41 students Final Table 10. Math scores of 41 students
Stem Leaf Stem Leaf
0 4 0 4
1 9 8 1 89
2 346 2 346
3 1275495 3 1245579
4 958031452 4 012345589
5 015705064521 5 00011234455677
6 20573 6 02357
Key 3|4 represents 34 score with stem3
and leaf 4

2. Note: The notation 2|4 represents stem 2 and leaf 4.


3. The stem and leaf plot reveals that most students scored in the interval
between 50 and 59. The large number of students who obtained high results
could mean that the test was too easy, that most students knew the material
well, or a combination of both.
4. The result of 4 could be an outlier, since there is a large gap between 4 and
the next result, 18.
5. If the stem and leaf plot is turned on its side, it will look like the following:

=============================================================== 35
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

 The distribution has a single peak within the 50–59 interval.


 Mode is 50

 Although there are only 41 observations, The left tail extends farther
from the data centre than the right tail. Therefore, the distribution is
skewed to the left or negatively skewed.

 Since there are 41 observations, the distribution centre (the median


value) will occur at the 21st observation. Counting 21 observations up
from the smallest, the median or the centre is 48. (Note that the same
value would have been obtained if 21 observations were counted down
from the highest observation.)

Complete a stem-and-leaf plot for the following list of values:

23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09

If we try to use the last digit, the hundredths digit, for these numbers, the stem-and-leaf
plot will be enormously long, because these values are so spread out. (With the numbers'
first three digits ranging from 232 to 270

We would have thirty-nine leaves, most of which would be empty.)

So instead of working with the given numbers, We will round each of the numbers to the
nearest tenth, and then use those new values for my plot. Rounding gives me the
following list:

23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1

=============================================================== 36
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Then my plot looks like this:

Stem Leaf
23 3
24 188
25 0369
26 33
27 1
Key “ 23|3” means 23.3

f) Scatter plot
A scatter plot displays a two-dimensional data set of metric attributes by interpreting
the sample values as coordinates of a point in a metric space .
A scatter plot is very well suited if one wants to see whether the two represented
quantities depend on each other or vary independently.

Figure above indicating :


(a) Positive correlation ( corrleation coefficient >0 , strong correlation will have correlation
coefficient nearer to 1)
(b) Negative correlation ( corrleation coefficient <0 , strong correlation will have correlation
coefficient nearer to -1)
(c) No correlation (correlation coefficient = 0 (nearer to 0))
=============================================================== 37
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Scatter plot indicating No correlation (computed correlation will be zero) for above figure

Scatter plot below indicating positive correlation between sales of ice-cream with increased
temperature

Examples of positive correlation

Sales and quality of product


Salary and skills (technical + communication) acquired by the students
Salary and job experience
Height and weight

Plot below for


(a)Positive correlation between Grade points avg (GPA) and Achievement Motivation
(b)Negative correlation between GPA and number of absences

=============================================================== 38
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

=============================================================== 39
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Few more Examples for negative correlation

 As the temperature decreases, more heaters are purchased.

 As the slope of a hill increases, the amount of speed a walker reaches may decrease.

 As more employees are laid off, satisfaction among remaining employees decreases.

 If a train increases speed, the length of time to get to the final point decreases.

 Sales deceases when there is an increase in cost

Example of No correlation for IQ vs Shoe size

Example of no correlation

 Politics and education


 Number of visits to temple vs. marks scored

=============================================================== 40
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

The frequency of a particular event is the number of times that the event occurs. The relative
frequency is the proportion of observed responses in the category represented as pie chart. We make a
circle graph often called a pie chart of this data by placing wedges in the circle of proportionate size
to the frequencies.

Graphical representations serve the purpose to make tabular data more easily
comprehensible. The main tool to achieve this is to use geometric quantities—like
lengths, areas, and angles—to represent numbers, since such geometric properties
are more quickly interpretable for humans than abstract numbers. The most
important types of graphical representations are:

g) Pole/stick/bar chart
Numbers, which may be, for instance, the frequencies of different attribute values in
a sample, are represented by the lengths of poles, sticks, or bars. In this way a good
impression especially of ratios can be achieved (see Figs. A.1a and b, in which the
frequencies of Table A.2 are displayed).

h)Area and volume charts


Area and volume charts are closely related to pole and bar charts: the difference is
merely that they use areas and volumes instead of lengths to represent numbers and
their ratios (see Fig. A.2, which again shows the frequencies of Table A.2). However,
area and volume charts are usually less comprehensive (maybe except if the
represented quantities are actually areas and volumes), since human beings usually
have trouble comparing areas and volumes and often misjudge their numerical
ratios. This can already be seen in Fig. A.2: only very few people correctly estimate
that the area of the square for the value 3 (frequency 9) is three times as large as
that of the square for the value 5 (frequency 3).

i) Frequency polygons and line chart


A frequency polygon results if the ends of the poles of pole diagram are connected
by lines, so that a polygonal course results. This can be advantageous if the attribute
values have an inherent order and one wants to show the development of the
frequency along this order (see Fig. A.1c). In particular, it can be used if numbers
are to be represented that depend on time. This particular case is usually referred to
as a line chart, even though the name is not exclusively reserved for this case.

=============================================================== 41
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Table A.2
Feature1 Value
1 1
2 6
3 9
4 5
5 4

j) Pie and stripe chart


Pie and stripe charts are particularly well suited if proportions or fractions of a total,
for instance, relative frequencies, are to be displayed. In a pie chart proportions are
represented by angles, and in a stripe chart by lengths (see Fig. A.3).

k). Mosaic chart


Contingency tables (that is, two- or generally multidimensional frequency tables) can
nicely be represented as mosaic charts. For the first attribute, the horizontal
direction is divided like in a stripe diagram. Each section is then divided according to
the second attribute along the vertical direction—again like in a stripe diagram (see
Fig. A.4). Mosaic charts can have advantages over two-dimensional bar charts,
because bars at the front can hide bars at the back, making it difficult to see their
height, as shown in Fig. A.5. In principle, arbitrarily many attributes can be displayed
by subdividing the resulting mosaic pieces alternatingly along the horizontal and
vertical axis. However, even if one uses the widths of the gaps and colors in order to
help a viewer to identify attribute values, mosaic charts can easily become confusing
if it is tried to use to many attributes.

=============================================================== 42
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Contingency Table
A1 marks A2 A3 A4 ∑
B1 section 8 3 5 2 18
B2 2 6 1 3 12
B3 4 1 2 7 14
∑ 14 10 8 12 44

Example of Mosaic plot from data in Contingency Table

=============================================================== 43
Unit I – Data Understanding VI sem BE OE41 Data Analytics
Teacher : Dr. Asha Gowda Karegowda
===================================================================================

Gender Survived 1st Class 2nd Class 3rd Class Crew


No 118 154 422 670
Male
Yes 62 25 88 192
No 4 13 106 3
Female
Yes 141 93 90 20

Questions:
1. Types of data
2. Data quality measures
3. Data visualization methods
4. Plot boxplot, histogram, stem and leaf, piechart, mosaic plot etc
5. Problems on frequency distribution table, ogive plot
6. Problems on location, dispersion and shape measures for grouped and ungrouped data

=============================================================== 44

You might also like