Probability and Statistics(SH552) Lecturer 1
Probability and Statistics(SH552) Lecturer 1
Probability and Statistics(SH552) Lecturer 1
Lecturer:1
Chapter:1(Descriptive Statistics 1.1)
Introduction to Statistics, Diagrammatic and graphical representation of data,
Measure of central tendency, Measure of dispersion or Measure of variation.
(Final exam 6 marks)
``
Syllabus
Course Objective:
To provide the students with practical knowledge of the principles and concept of
probability and statistics and their applications in engineering field.
• Statistics in planning.
• Statistics in state.
• Statistics in mathematics.
• Statistics in economics.
• Statistics in business and management.
• Statistics in accounting and auditing.
• Statistics in industry.
• Statistics in insurance.
• Statistics in social science.
• Statistics in biology and medical science.
• Statistics in astronomy.
• Statistics in physical science.
• Statistics in psychology and education.
• Statistics in war.
• Statistics in engineering.
Application of statistics in the field of
engineering
Statistical technique is used in
• Design of experiment to test and construct models of engineering components and system.
• Quality control and process control as a tool to manage conformance of specifications of
manufacturing process and their products.
• Time and method engineering to study repetitive operations in manufacturing in order to set
standards and find optimum manufacturing procedures.
• Reliability engineering to measures the ability of a system to perform for it’s intended
function and has tool for improving performance.
• Probabilistic design in the use of probability in product and system design.
• Analysis of electric and magnetic fields and their interaction with materials and structure.
• Digital signal processing and image processing in electronics engineering.
• Fluid flow and stress analysis in mechanical engineering.
• Computer graphics(zoom, rotation, transformation, animation, etc), Google search
algorithms in electronics engineering.
• Traffic engineering and modelling, structural engineering(trusses).
Functions of statistics
• To represent facts from numerical figures in a definite form.
• To condense huge and voluminous data by using appropriate statistical measure.
• To help in classification of data according to it’s nature.
• To help in formulating policies.
• To find relationship between different phenomena.
• To help in predicting future trends.
• To formulate and test hypothesis.
• To draw valid inference or conclusions.
Limitations of statistics
• There are three types of lies – lies, damn lies, and statistics.” – Benjamin Disraeli
• Statistical analyses have historically been a stalwart of the high tech and advanced
business industries, and today they are more important than ever. With the rise of
advanced technology and globalized operations, statistical analyses grant
businesses an insight into solving the extreme uncertainties of the market. Studies
foster informed decision-making, sound judgments and actions carried out on the
weight of evidence, not assumptions.
• As businesses are often forced to follow a difficult-to-interpret market road map,
statistical methods can help with the planning that is necessary to navigate a
landscape filled with potholes, pitfalls and hostile competition. Statistical studies
can also assist in the marketing of goods or services, and in understanding each
target markets unique value drivers. In the digital age, these capabilities are only
further enhanced and harnessed through the implementation of advanced
technology and business intelligence software. If all this true, what is the problem
with statistics?
• Actually, there is no problem per se – but there can be. Statistics are infamous for
their ability and potential to exist as misleading and bad data
continue
• What Is A Misleading Statistic?
• Misleading statistics are simply the misusage - purposeful or not - of a numerical data. The results
provide a misleading information to the receiver, who then believes something wrong if he or
she does not notice the error or the does not have the full data picture.
• Given the importance of data in today’s rapidly evolving digital world, it is important to be familiar
with the basics of misleading statistics and oversight. As an exercise in due diligence, we will review
some of the most common forms of misuse of statistics, and various alarming (and sadly, common)
misleading statistics examples from public life
Another way of creating misleading statistics, also linked with the choice of
sample discussed above, is the size of said sample. When an experiment
or a survey is led on a totally not significant sample size, not only will the
results be unusable, but the way of presenting them - namely as
percentages - will be totally misleading.
Asking a question to a sample size of 20 people, where 19 answer "yes"
(=95% say for yes) versus asking the same question to 1,000 people and
950 answer "yes" (=95% as well): the validity of the percentage is clearly
not the same. Providing solely the percentage of change without the total
numbers or sample size will be totally misleading. xkdc's comic illustrate
this very well, to show how the "fastest-growing" claim is a totally relative
marketing speech:
Likewise, the needed sample size is influenced by the kind of question you
ask, the statistical significance you need (clinical study vs business study),
and the statistical technique. If you perform a quantitative analysis,
sample sizes under 200 people are usually invalid
How Statistics Can Be Misleading
• Remember, misuse of statistics can be accidental or purposeful. While a malicious intent to blur lines with
misleading statistics will surely magnify bias, intent is not necessary to create misunderstandings. The
misuse of statistics is a much broader problem that now permeates through multiple industries and fields
of study. Here are a few potential mishaps that commonly lead to misuse.
• Faulty polling
• The manner in which questions are phrased can have a huge impact on the way an audience answers
them. Specific wording patterns have a persuasive effect and induce respondents to answer in a
predictable manner. For example, on a poll seeking tax opinions, let’s look at the two potential questions:
• - Do you believe that you should be taxed so other citizens don’t have to work? - Do you think that the
government should help those people who cannot find work?
• These two questions are likely to provoke far different responses, even though they deal with the same
topic of government assistance. These are examples of “loaded questions.”
• A more accurate way of wording the question would be, “Do you support government’s assistance
programs for unemployment?” or, (even more neutrally) “What is your point of view regarding
unemployment assistance?”
• The latter two examples of the original questions eliminate any inference or suggestion from the poller,
and thus, are significantly more impartial. Another unfair method of polling is to ask a question, but
precede it with a conditional statement or a statement of fact. Staying with our example, that would look
like this: “Given the rising costs to the middle class, do you support government assistance programs?”
• A good rule of thumb is to always take polling with a grain of salt, and to try to review the questions that
were actually presented. They provide great insight, often more so than the answers
Flawed correlation
• The problem with correlations is this: if you measure enough variables, eventually it will appear
that some of them correlate. As one out of twenty will inevitably be deemed significant without any
direct correlation, studies can be manipulated (with enough data) to prove a correlation that does
not exist or that is not significant enough to prove causation.
• To illustrate this point further, let’s assume that a study has found a correlation between an
increase in car accidents in the state of New York in the month of June (A), and an increase in bear
attacks in the state of New York in the month of June (B).
• That means there will likely be six possible explanations:
• - Car accidents (A) cause bear attacks (B) - Bear attacks (B) cause car accidents (A) - Car accidents
(A) and bear attacks (B) partly cause each other - Car accidents (A) and bear attacks (B) are caused
by a third factor (C) - Bear attacks (B) are caused by a third factor (C) which correlates to car
accidents (A) - The correlation is only chance
• Any sensible person would easily identify the fact that car accidents do not cause bear attacks. Each
is likely a result of a third factor, that being: an increased population, due to high tourism season in
the month of June. It would be preposterous to say that they cause each other... and that is exactly
why it is our example. It is easy to see a correlation.
• But, what about causation? What if the measured variables were different? What if it was
something more believable, like Alzheimer’s and old age? Clearly there is a correlation between the
two, but is there causation? Many would falsely assume, yes, solely based on the strength of the
correlation. Tread carefully, for either knowingly or ignorantly, correlation hunting will continue to
exist within statistical studies
Data fishing
• This misleading data example is also referred to as “data dredging” (and related to flawed
correlations). It is a data mining technique where extremely large volumes of data are analyzed for
the purposes of discovering relationships between data points. Seeking a relationship between data
isn’t a data misuse per se, however, doing so without a hypothesis is.
• Data dredging is a self-serving technique often employed for the unethical purpose of
circumventing traditional data mining techniques, in order to seek additional data conclusions that
do not exist. This is not to say that there is no proper use of data mining, as it can in-fact lead to
surprise outliers and interesting analyses. However, more often than not, data dredging is used to
assume the existence of data relationships without further study.
• Often times, data fishing results in studies that are highly publicized due to their important or
outlandish findings. These studies are very soon contradicted by other important or outlandish
findings. These false correlations often leave the general public very confused, and searching for
answers regarding the significance of causation and correlation.
• Likewise, another common practice with data is the omission, meaning that after looking at a large
data set of answers, you only pick the ones that are supporting your views and findings and leave
out those that contradict it. As mentioned in the beginning of this article, it has been shown that a
third of the scientists admitted that they had questionable research practices, including
withholding analytical details and modifying results...! But then again, we are facing a study that
could itself fall into these 33% of questionable practices, faulty polling, selective bias... It becomes
hard to believe any analysis
continue
• Misleading data visualization
• Insightful graphs and charts include very basic, but essential, grouping of elements. Whatever
the types of data visualization you choose to use, it must convey:
• - The scales used - The starting value (zero or otherwise) - The method of calculation (e.g., dataset
and time period)
• Absent these elements, visual data representations should be viewed with a grain of salt, taking
into account the common data visualization mistakes one can make. Intermediate data points
should also be identified and context given if it would add value to the information presented. With
the increasing reliance on intelligent solution automation for variable data point comparisons, best
practices (i.e., design and scaling) should be implemented prior to comparing data from different
sources, datasets, times and locations.
• Purposeful and selective bias
• The last of our most common examples for misuse of statistics and misleading data is, perhaps, the
most serious. Purposeful bias is the deliberate attempt to influence data findings without even
feigning professional accountability. Bias is most likely to take the form of data omissions or
adjustments.
• The selective bias is slightly more discreet for whom does not read the small lines. It usually falls
down on the sample of people surveyed. For instance, the nature of the group of people surveyed:
asking a class of college student about the legal drinking age, or a group of retired people about the
elderly care system. You will end up with a statistical error called “selective bias”.
1.2 Diagrammatic and graphical
representation of data
• The presentation of statistical data in the form of geometrical figure like
points, lines ,bars, rectangle, circles is called diagrammatic and graphical
representation of data. For example; bar diagram, pie diagram, histogram,
frequency polygon curve, Ogive, box plot etc.
IMPORTANCE:
1. Diagrams and graph give a bird eyes view of a set of numerical data. They
can present the data into simple and intelligible form.
2. Diagrams are generally more attractive and impressive than the
numerical data. They give delight to the eyes and leave an ever lasting
impression on the mind.
3. Diagrams help in deriving the required information in less time and
without any mental strain.
4. They facilitate comparison of two or more set of data at a time.
Difference between diagrams and graphs:
• Diagrams are constructed on plane paper whereas the graphs are on the graph
paper.
• Diagrams are used only for the comparison but graphs help in studying the
mathematical relationship between two variables.
• In diagrams , the numerical data are presented by bars, rectangles, circles,
cube etc. whereas in graphs, the data are presented in terms of points, curves
and lines.
• Diagrams may be of one, two or three dimensional but graph are generally be
of two dimensional.
• Construction of diagrams are not so easy but construction of graphs is easier
than diagrams.
• Presentation of frequency distribution in diagrams is not used but the
presentation of frequency distribution and time series in graphs is more
appropriate.
• The diagrams are rarely used by statistician and research worker but graphs
are frequently used.
Limitations of Diagrams and Graphs:
• They help in simplifying the textual and tabulated facts to statistical table but not vice
versa.
• They give only general idea of data so as to make it readily intelligible and thus furnish
only limited and approximate information.
• They are subjective in character and therefore, may be interpreted differently by
different people.
• All the diagrams and graphs are not easy to construct. Two and three dimensional
diagrams and ratio graphs required more time and great amount of expertise and skill
for their construction and interpretation and are not readily perceptible to non
mathematical person.
• In case of large figures , such a presentation fails to reveal small differences in them.
• A wrong type of diagrams and graphs may lead to very fallacious and misleading
conclusions.
• Diagrammatic presentation should be used only for comparison of different sets of
data which relate either to the same phenomenon or different which are capable to
measure.
DIAGRAMS AND GRAPHS
DIAGRAMS GRAPHS
• Simple bar diagram. • Histogram
• Multiple bar diagram. • Frequency polygon
• Sub-divided or components • Frequency curve
bar diagram. • Ogive
• Percentage bar diagram. -less than Ogive
• Pie diagram or circular -more than Ogive
diagram. . Box plot.
Ogive (Cumulative
frequency curve)