Notes (Chapter 1 - 3)
Notes (Chapter 1 - 3)
Data Sources
• EXISTING SOURCES
Data Mining
• is used to identify related products that customers who
have already purchased a specific product are also likely to
Statistical Analysis Using Microsoft Excel
purchase (and then pop-ups are used to draw attention to
those related products).
• the most effective data mining systems use automated
procedures to discover relationships in the data and predict
future outcomes, … prompted by only general, even vague,
queries by the user.
• the major applications of data mining have been made by
companies with a strong consumer focus such as retail,
financial, and communication firms.
• as another example, data mining is used to identify
customers who should receive special discount offers based
on their past purchasing volumes.
• Requirements:
• Statistical methodology (i.e. multiple regression, logistic
regression, and correlation) are heavily used.
• Computer science technologies are also needed in
relation to the involving artificial intelligence and
machine learning.
• Significant investment in time and money
• Model Reliability:
• Statistical model for a particular sample may not be
applicable to other data
• Data set can be partitioned into: training set (model
development) & test set (validating the model)
• over fitting the model can cause danger → misleading
associations & conclusions appear to exist
• careful interpretation of results and extensive testing is
important
Ethical Guidelines of Statistical Practice
• unethical behavior can take a variety of forms including:
• improper sampling
• Inappropriate analysis of the data
• Development of misleading graphs
• Use of inappropriate summary statistics
• Biased interpretation of the statistical results
• Be fair, thorough, objective, and neutral as you collect,
analyze, and present data.
• “Ethical Guidelines for Statistical Practice” → developed by
American Statistical Association
• It contains 67 guidelines organized into 8 topic area:
▪ Professionalism
▪ Responsibilities to Funders, Clients, Employers
▪ Responsibilities in Publications and Testimony
Analytics ▪ Responsibilities to Research Subjects
• Scientific process of transforming data into insight for ▪ Responsibilities to Research Team Colleagues
making better decisions. ▪ Responsibilities to Other Statisticians/Practitioners
• Types: ▪ Responsibilities Regarding Allegations of
• DESCRIPTIVE A. → Analytical techniques that describe Misconduct
what happened in the past ▪ Responsibilities of Employers Including
• PREDICTIVE A. → Analytical techniques that use models Organizations, Individuals, Attorneys, or Other
constructed from past data to predict future. It help Clients
assess the impact of one variable on another
• PRESCRIPTIVE A. → Analytical techniques that yield a
best course of action to take
Data Warehousing
• is capturing, storing, and maintaining the data and it is a
significant undertaking.
CHAPTER 2A: DESCRIPTIVE STATISTICS
(TABULAR AND GRAPHICAL DISPLAYS
Categorical Data
• FREQUENCY DISTRIBUTION
- is a tabular summary of data showing the number
(frequency) of observations in each of several non-
overlapping categories or classes.
- Objective: to provide insights about the data that cannot
be quickly obtained by looking only at the original data.
• BAR CHART
- is a graphical display for depicting qualitative data
- are used to identify the most important causes of
problems.
- horizontal axis → we specify the labels that are used for
each class
- vertical axis → frequency, relative frequency, or percent
frequency scale
- bar or fixed width → drawn above each class label, we
extend the height appropriately
- bars are separated → to emphasize the fact that each
class is separate.
- Pareto Diagram → When the bars are arranged in
descending order of height from left to right (with the
most frequently occurring cause appearing first) →
founder “Vilfredo Pareto”, an Italian economist.
• PIE CHART
- is a commonly used graphical display for presenting
relative frequency and percent frequency distributions
for categorical data.
• CUMULATIVE DISTRIBUTIONS
- Cumulative Frequency D. → shows the number of
items with values less than or equal to the upper limit
- Insights obtained from Percent Frequency of each class. (Last entry = total no. of observations)
Distribution: - Cumulative Relative FD. → shows the proportion of
• 40% of the audits required from 15 to 19 days items with values less than or equal to the upper limit
• Another 25% of the audits required 20 to 25 days of each class. (Last entry = 1.00)
• Only 5% of the audits required more than 30 days - Cumulative Percent FD → shows the percentage of
• DOT PLOT items with values less than or equal to the upper limit
- one of the simplest graphical summaries of data of each class. (Last entry = 100)
- Example: Sanderson and Clifford
- horizontal axis → range of data values
- then each data value is represented by a dot placed
above the axis
- Example: Sanderson and Clifford
• STEM-AND-LEAF DISPLAY
- shows both the rank order and shape of the
distribution of the data.
• HISTOGRAM
- It is similar to a histogram on its side, but it has the
- Common graphical display of quantitative data
advantage of showing the actual data values.
- Horizontal axis → variable of interest
- the first digits of each data item are arranged to the
- A rectangle is drawn above each class interval with its
left of a vertical line.
height corresponding to the interval’s frequency,
- to the right of the vertical line we record the last digit
relative frequency, or percent frequency.
for each item in rank order.
- has no natural separation between rectangles of
- each line (row) in the display is referred to as a stem.
adjacent classes.
- Each digit on a stem is a leaf.
- Example: Sanderson and Clifford
- Example: The number of questions answered correctly
on an aptitude test by 50 students analysed with the
help of a Stem – and – leaf display here. The relevant
data is given in the following table.
No. of questions answered correctly by 50 students
112 73 126 82 92 115 95 84 68 100
72 92 128 104 108 76 141 119 98 85
69 76 118 132 96 91 81 113 115 94
97 86 127 134 100 102 80 98 106 106
107 73 124 83 92 81 106 75 95 119
Data Dashboard
• Data dashboard → widely used data visualization tool
• It organizes and presents key performance indicators
(KPIs) used to monitor an organization or process.
• It provides timely, summary information that is easy to
read, understand, and interpret.
• Some additional guidelines include . . .
o Minimize the need for screen scrolling
o Avoid unnecessary use of color or 3D
o Use borders between charts to improve readability
CHAPTER 3A: DESCRIPTIVE STATISTICS
(NUMERICAL MEASURES)
Numerical Measures
• Sample statistics → if the measure is computed for data
from a sample.
• Population parameters → If the measures are computed
for data from a population.
• Point estimator → a sample statistic of the corresponding
population parameter.
Measures of Location
• MEAN [Excel Function → AVERAGE(data cell range)]
- most important measure of location
- provides a measure of central point
- the mean of a data set is the average of all the data
values
- The sample mean 𝑥̅ is the point estimator of the - Trimmed Mean
population mean µ. o another measure sometimes used when extreme
values are present
o it is obtained by deleting a percentage of the
smallest and largest values from a data set and
then computing the mean of the remaining values.
Sample Mean o Example: the 5% trimmed mean is obtained by
removing the smallest 5% and the largest 5% of the
data values and then computing the mean of the
remaining values.
• MODE [Excel Function → MODE.SNGL(data cell range)]
- is the value that occurs with greatest frequency.
- greatest frequency can occur at two or more different
Population Mean
values
- Example: Monthly Starting Salary - Bimodal → If the data have exactly two modes
A placement office wants to know the - Multimodal → If the data have more than two modes
average starting salary of business graduates. Monthly - Example: Monthly Starting Salary
starting salaries for a sample of 12 business school The only monthly starting salary that occurs more
graduates is provided here. than once is $3,880. Mode = 3,880
• INTERQUARTILE RANGE
- is the difference between the third quartile and the
first quartile.
- It is the range for the middle 50% of the data
- It overcomes the sensitivity to extreme data values
- Example: Monthly Starting Salary
3,950 + 4,050
= 4𝑘
2
3,880 + 3,850
= 3,865
2
o CORRELATION COEFFICIENT
- Correlation → is a measure of linear association
and not necessarily causation.
- Just because two variables are highly correlated, it
does not mean that one variable is the cause of
Box Plot the other.
- is a graphical summary of data that is based on a five-
number summary.
- A key to the development of a box plot is the computation
of the median and the quartiles Q1 and Q3.
- Box plots provide another way to identify outliers
- Example: Monthly Starting Salary
o A box is drawn with its ends located at the first and
third quartiles.
o A vertical line is drawn in the box at the location of - The coefficient can take on values between -1 and
the median (second quartile). +1
o Strong negative linear relationship → values
near -1
o Strong positive linear relationship → values
near +1
- The closer the correlation is to zero, the weaker
the relationship.
- Example: Stereo and Sound Equipment Store
The store’s manager wants to determine
the relationship between the number of weekend
television commercials shown and the sales at the
store during the following week