Topic1 Summarizing and Visualizing Data PDF
Topic1 Summarizing and Visualizing Data PDF
A business school A claims that their average starting offers are more than that of another business school B. Is the claim true? A plant X has two assembly lines. Employees face one of the three kinds of accidents sprain, cut, burns. Do the accident patterns with respect to their type differ in the two assembly lines? A bank wants to assess the credit worthiness of its applicant. Should it pass the loan or reject it?
Descriptive statistics
Used to describe main features of a collection of data quantitatively Aim to summarize a data set quantitatively without employing a probabilistic formulation
Inferential statistics
Aims making conclusions using data that is subject to random variation Used for: Estimation; Hypothesis testing; Predicting/forecasting
Types of Variables
Quantitative Variables A quantitative variable can be described by a number for which arithmetic operations such as averaging make sense.
Qualitative Variables A qualitative (or categorical) variable simply records a quality. If a number is used for distinguishing members of different categories of a qualitative variable, the number assignment is arbitrary.
Scales of Measurement
Nominal Scale
e.g. North = 1, East = 2, South = 3, West = 4
Ordinal Scale
e.g. Very good = 4, Good = 3, Fair = 2, Unacceptable = 1
Interval Scale
v measurement the value of zero is assigned arbitrarily and therefore we cannot take ratios of two measurements. v but we can take ratios of intervals. v e.g. 100 deg C. is not twice as hot as 50 deg C.
Ratio Scale
v we can take ratios of those measurements. v the zero in this scale is an absolute zero. v e.g. money - a sum of Rs.100 is twice as large as Rs. 50.
8
A statistician had his head in an oven and his feet in ice, and he said that on the average he feels fine.
10
11
12
Measures of Dispersion
Dispersion indicates the variability or spread in a variable Most commonly used measures Variance; Standard deviation; Inter-quartile range
Variance - describes how far the values lie from the mean
2 =
(x - )
i i =1
13
Measures of Dispersion
Standard deviation square root of the variance
(x - )
i =1 i
Low data points tend to be very close to the mean; High data is spread out over a large range of values Sample standard deviation (s) is used as an estimator of
s=
(x - x)
i i =1
N-1
[what is a sample ? What is an estimator ?? why N-1 in the denominator instead of N ???]
14
Can be positive or negative or undefined Negative skew tail is to the left Positive skew tail is to the right
High kurtosis sharper peak and longer fatter tails Low kurtosis rounded peak and shorter thinner tails
15
Quartiles
Quartiles are the three values which divide the sorted data set into four equal parts
First quartile (Q1) = lower quartile cuts off lowest 25% of data Second quartile (Q2) = median cuts the data set into half Third quartile (Q3) = upper quartile cuts off highest 25% of data
16
Quartiles
No universal method for calculating quartiles One method: Lk = N x (k / 4); where k = 1 for Q1, 2 for Q2, 3 for Q3
v If Lk is a whole number, then Qk = average of the values corresponding to the positions Lk and Lk+1 v If Lk is a decimal, the Qk = value corresponding to the position rounded upto the higher whole number position
Other method:
v Median of the data set gives Q2 v Divide data set into two. [In case of odd data points in original set include median in both the halves]. The median of upper and lower halves gives Q3 and Q1 respectively
17
Quartiles
Interquartile range IQR = Q3 Q1
IQR is a more robust measure for variability It does not get affected much by skewness or outliers
18
Histogram
A graphical display of tabular frequencies
Shown in the form of adjacent rectangles Provides a good visual representation of the distribution of data
19
Histogram
Identifying relation between mean, median, and shape of a histogram
Symmetric: mean median Left (or negatively) skewed: mean < median (generally) Right (or positively) skewed: mean > median (generally)
20
Box Plot
Also known as box-and-whisker plot Provides a five number summary:
The smallest observation (minimum) Lower quartile (Q1) Median (Q2) Upper quartile (Q3) Largest observation (maximum)
Also indicates outlier observations, if any Spacings between the different parts of the box help indicate: the degree of dispersion (spread) and skewness in the data, and identify outliers
Box plots are very effective while comparing values in two or categories
more
21
Q2 Q1 max(Xmin , {Q1- [1.5 x IQR]}) *Note: Generally this is the value beyond which readings are considered as
outliers. However, there is no universal definition.
22
Scatter Plot
Displays values for two variables for a set of data Provides a visual representation of relationship between two variables
23
24
Bar graphs are often used to display categorical data where there is no emphasis on the percentage of a total represented by each category. The scale of measurement is nominal or ordinal.
25
An Illustration
NAME: Car Data TYPE: Multiple Regression SIZE: 804 observations, 12 variables
DESCRIPTIVE ABSTRACT: Data collected from Kelly Blue Book for several hundred 2005 used GM cars allows students to develop a multivariate regression model to determine their car value based on a variety of characteristics such as miles driven, make, model, engine size, interior style, cruise control, etc.
26
An Illustration
SOURCES:
For this data set, a representative sample of over eight hundred, 2005 GM cars were selected, then an algorithm was developed following the 2005 Central Edition of the Kelly Blue Book to estimate retail price.
27
An Illustration
VARIABLE DESCRIPTIONS:
v Price: suggested retail price of the used 2005 GM car in excellent condition. The condition of a car can greatly affect price. All cars in this data set were less than one year old when priced and considered to be in excellent condition. v Miles: number of miles the car has been driven v Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet v Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier v Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D
28
An Illustration
VARIABLE DESCRIPTIONS:
v Type: body type such as sedan, coupe, etc. v Cylinder: number of cylinders in the engine v Liter: a more specific measure of engine size v Doors: number of doors v Cruise: indicator variable representing whether the car has cruise control (1 = cruise) v Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded) v Leather: indicator variable representing whether the car has leather seats (1 = leather)
29