0% found this document useful (0 votes)
16 views61 pages

CE-613 - DOC - 01 Introduction To Course, Measurement Scale, Simulation, Graphs, Fallacies-1

Uploaded by

PRATHAM SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views61 pages

CE-613 - DOC - 01 Introduction To Course, Measurement Scale, Simulation, Graphs, Fallacies-1

Uploaded by

PRATHAM SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CE-613

Modeling, Analysis and Simulation


Major Topics
• Unit 1: Fundamentals of probability and distributions and applications
• Unit 2: Fundamental of Statistics, Hypothesis Testing, Analysis of Variance
• Unit 3: Linear Regression, General Linear Model
• Unit 4: Simulation and its Application
• Unit 5: Reliability Analysis and Risk Assessment
What will you learn
• Collect data on a problem and describe the data using graphical and descriptive
measures;
• Develop a probabilistic model and perform probability operations and evaluations;
• Perform statistical analyses of the data, such as histogram development, univariate
analysis, probability model selection, analysis of variance, hypothesis testing,
parameter estimation, confidence-interval estimation, and selection of sample sizes;
• Perform correlation and regression analyses for fitting a curve or a model to data;
• Perform random-number and random-variable generation for the Monte Carlo
simulation of selected random variables;
• Evaluate the reliability of a component of a system or the reliability of the entire
system;
• Perform risk analyses and risk-based decision making
• Increase the efficiency of the process by using computer (Excel, R language, Python,
others)
Probability and Statistics
Probability Statistics
Predicting the likelihood of future events Analysis of the frequency of past events
Population Given Need to Estimate (parameters)
Sample Need to find (not estimate) chance of certain Given
events
Example • Given the information about the balls in • Given the color of the ball (drawn
the closed bag from the bag) in your hand,
• Find the probability of drawing ball of • Infer (estimate) about the color of
certain color balls in the closed bag

Population Sample

Provided you have enough data


or appropriate sample size
Data Analytics or Predictive Technology
• Descriptive Analytics
• Descriptive Statistics (Distribution, central tendency and dispersion)
• Exploratory Data Analysis (interactively discover and visualize trends, behaviors,
and relationships in data)
• Diagnostic Analytics
• Why did it happen? What factors caused this crash or congestion condition?
• Predictive Analytics
• Critical to knowing about future events well in advance and implementing
corrective actions (sniffing crash prone condition)
• Prescriptive Analytics
• Draws upon descriptive, diagnostic, and predictive analytics
• Modeling and evaluating various what-if scenarios through simulation techniques
to answer what should be done to maximize the occurrence of good outcomes
while preventing the occurrence of potentially bad outcomes
Data Detective: Steps in Data Analytics
Identify
Problem and
Objective

Interpretation,
Develop Data
Discussion, and
Collection Plan
Conclusion

Analysis
Clean and Manage Data
(Descriptive -> Diagnostic -> Predictive)
Course Text Book
• Mostly for the guiding the syllabus
Data Measurement Scales: Qualitative
• Nominal/Categorical Scale (nominal ~ name)
• Label variables without quantitative value
• Data in category without specific order (Eye color, land use, transportation modes)
• Dichotomous: Failed and Passed, Male and Female
• Graphs: Bar, Pie,
• Ordinal Scale (ord ~ order)
• Label/Categorize in natural order
• Example => (1) “agree, neutral, disagree”, (2) strong, moderately strong, weak
• Size of steps between items is unequal/meaningless
• Graphs: Bar, Pie, Stem and Leaf
Data Measurement Scales: Quantitative
• Interval Scale (Interval = space in between)
• Labelling, in natural order and difference between categories is identical
• “Zero-point” does not mean absence of value. Zero is just a number on the scale by convention
• Negative value is possible
• Ratio is meaningless. We cannot say 20 deg C is twice as warmer as 10 deg C
• Example: Temperature in deg C. Time shown on a clock.
• Location of zero point is not fixed. No pre-decided starting point or a true zero value
• Graphs: Bar, Pie, Stem and Leaf, Box plot and Histogram
• Ratio scale
• Items have order, exact values between units and absolute zero and ratio are meaningful (Ratio
scale provides the most detailed information)
• Example: Age, Weight, Kelvin
• Since absolute zero exist, negative value do not exist.
• The zero point characteristic makes ratio meaningful
• Example: Temperature in Kelvin scale. Travel time in minutes/hours
• 40 kg is twice as heavy as 20 kg
• Graphs: Bar, Pie, Stem and Leaf, Box plot and Histogram, Line Plot, Scatter Plot
Data Measurement Scales
Nominal Ordinal Interval Ratio

Eye Color Level of Satisfaction Temperature Height

Named Named Named Named

Natural Order Natural Order Natural Order


Equal Interval between
Equal Interval between variables variables
Has a "true zero" value,
thus ratio between values
can be calculated
Types of Data
• Qualitative/Categorical
• Discrete: Only categories
• Graphs: Frequency distribution, Bar Plot, Pie Chart, Pareto Chart
• Quantitative/Numerical
• Discrete: data is countable and has only whole numbers
• Continuous: data is continuous
• Graphs: Histogram, Line Plot, Scatter Plot
What you can do with the data
Ability to ↓ Nominal Ordinal Interval Ratio

1 Order values
2 Counts, frequency of distribution
3 Estimate Mode
4 Estimate Median
5 Estimate Mean
6 Quantify the difference between each value
7 Add and subtract values
8 Multiple and divide values
9 Quantify true/absolute zero
Simulation for Data Analysis
• Real-world data may not cover extreme situations that are important in design
• Nothing stops extreme cases to occur in future
• But we do not know when and therefore any system is always at risk
• Uncertainty is everywhere in engineering
• We can simulate the extreme cases to check how the system would behave
• Is the system ready for those events?
• To find we need to simulate such cases and assess the impact in a “model environment” and
then select a course of action
• We need to assess how risk sensitive a system is by studying the consequences of
hazardous events. That is were simulation comes in to help utility uncertainty and
quantify risk
• We can use the method of simulation to sample and get realizations of the certain
outcomes and answer “what would happen in various scenarios”. This is also called
“What-if-analysis”.
Transformation of random number
• Linear or Non-linear
• Continuous or discrete
• Rolling a fair dice (each of the values 1 through 6) are equally probable. That is
a uniform distribution in a discrete sense since values other than those 6 are
not possible.
• Similarly there is the uniform distribution in the continuous sense where each
value between (and including) two values is equally probable (or likely to
occur)
• Tossing a balanced/unbiased coin: Head or Tail are equally probable.
• How do we generate flip of a coin using a dice? Think about even/odd
• Can we transform the random outcome(s) of a coin to simulate numbers
generated by a unbiased dice (6 sided)?
Simulating Data: Linear transformation
• Simulation is done by using a random number which can be mapped to a
variable of interest. E.g. Excel has the rand() function
• Example1 → If the height h of waves is between 0 to 5 meter. A random
number 𝑢 ∈ [0,1] can be used to generate h
• ℎ𝑚𝑖𝑛 = 0 and ℎ𝑚𝑎𝑥 = 5
• ℎ𝑟𝑎𝑛𝑑𝑜𝑚 = 𝑢 ∗ ℎ𝑚𝑎𝑥 − ℎ𝑚𝑖𝑛 + ℎ𝑚𝑖𝑛 (This is a linear transformation)
• If 𝑢 = 0.13, 0.69, 0.10 , ℎ = 0.65, 3.45, 0.50
• The generated random value would depend on the distribution from which it is
generated/drawn. If a distribution is not provided, the random variable is
generally drawn from a uniform distribution.
Simulating Data: Non-linear transformation
• Example2 → Stress at the extreme fibers of a structural steel
beam of a bridge is given by
𝑀𝑐
𝑠= ≤ 𝑓𝑦 ,
𝐼

• Here 𝑀 and 𝑐 are applied moment, distance of load from


neutral axis of extreme fibres and 𝐼 is centroid moment of
inertia of the cross section. 𝑓𝑦 is yield stress of the material
of beam.
• 𝑀, 𝑐, 𝐼, and 𝑓𝑦 are basic random variables that can assume any value within given ranges.
• We can simulate each individually e.g. 𝑀𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 = 𝑢 ∗ (𝑀𝑚𝑎𝑥 − 𝑀𝑚𝑖𝑛 ) + 𝑀𝑚𝑖𝑛 (Similarly 𝑐, 𝐼 and 𝑓𝑦 can be
simulated).
• 𝒖 is uniform continuous random variable
𝑀𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 ∗𝑐𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑
• Then using a non-linear transformation, the stress can be simulated as 𝑠𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 =
𝐼𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑

• 𝑠𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 is the transformed random variable


• We can then check failure criteria, 𝑠𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 > 𝑓𝑦
Transformation from continuous to discrete
• How can we can simulate a “fair” dice using a random number 𝑢 ∈ 0,1 ?
• Each value ( 1,2,3,…6) is equally likely with probability 1/6. The following
transformation can be used to simulate outcome 𝑌:
1 𝑖𝑓 𝑋 ≤ 1Τ6
2 𝑖𝑓 1Τ6 < 𝑋 ≤ 2Τ6

𝑋
3 𝑖𝑓 2Τ6 < 𝑋 ≤ 3Τ6
•𝑌=
4 𝑖𝑓 3Τ6 < 𝑋 ≤ 4Τ6
5 𝑖𝑓 4Τ6 < 𝑋 ≤ 5Τ6
6, 5Τ6 < 𝑋 ≤ 1
• The transformation can also be depicted in the form of graph using the
cumulative probability
Simulating a Dice
• X = sampled from a uniform continuous random generator ∈ [0,1]
• In a six faced (valued A through F) fair dice example could we do the
following transformation?
𝐴 𝑖𝑓 𝑋 ≤ 1Τ6
𝐵 𝑖𝑓 1Τ6 < 𝑋 ≤ 2Τ6
𝐶 𝑖𝑓 2Τ6 < 𝑋 ≤ 3Τ6
𝑌=
𝐷 𝑖𝑓 3Τ6 < 𝑋 ≤ 4Τ6
𝐸 𝑖𝑓 4Τ6 < 𝑋 ≤ 5Τ6
𝐹, 5Τ6 < 𝑋 ≤ 1
Simulating a Dice contd.
• X = sampled from a uniform continuous random generator ∈ [0,1]
• In the fair dice example could we do the following transformation?
1 𝑖𝑓 𝑋 ≤ 1Τ6
2 𝑖𝑓 1Τ6 < 𝑋 ≤ 2Τ6
6 𝑖𝑓 2Τ6 < 𝑋 ≤ 3Τ6
𝑌=
5 𝑖𝑓 3Τ6 < 𝑋 ≤ 4Τ6
4 𝑖𝑓 4Τ6 < 𝑋 ≤ 5Τ6
3, 5Τ6 < 𝑋 ≤ 1

• What measurement scale is Y measured in?


Simulating Coin Flip Using a Dice
• How can you simulate flipping of a coin using a six faced fair
dice?
Head, 𝑖𝑓 𝑋 ∈ {2,4,6}
• Coin flip, Y = ቊ
Tail 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Could we do the following? State why or why not?
Head, 𝑖𝑓 𝑋 ≤ 3
• Coin flip, Y = ቊ
Tail 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Simulating Dice using a Coin Flip
• Can we simulate a four faced dice using a Coin Flip?
• How about simulating a six faced dice?
Prob-1
• Assume the value of X is random and can take on values from 0 to 1
• Assume the following transformation graph relates X to the annual number of
fatal accidents (N) at an intersection:
• A random-number generator yields the following six values of X
• 0.37, 0.82, 0.64, 0.25, 0.02, 0.94
• How many fatal accidents occurred in each of the six years?
Prob-1 (Graphical solution)
• For X = {0.37, 0.82, 0.64, 0.25, 0.02, 0.94}, the values can be extracted from the graph
• You may come up with any other graphical representation as long as you can map X (from
uniform continuous distribution) to these frequencies

𝑓(𝑋 = 0.94) = 3
𝑓(𝑋 = 0.82) = 2
𝑓(𝑋 = 0.64) = 1

𝑓(𝑋 = 0.37) = 0
𝑓(𝑋 = 0.25) = 0

𝑓(𝑋 = 0.02) = 0
Prob-1 (Analytical Solution)
• Here is the transformation function
• 𝑓 𝑋 ≤ 0.4 = 0
• 𝑓 0.4 < 𝑋 ≤ 0.7 = 1
• 𝑓 0.7 < 𝑋 ≤ 0.9 = 2
• 𝑓 0.9 < 𝑋 ≤ 1.0 = 3
• This function can be used to compute number of fatal accidents corresponding to the generated
random-numbers (0.37, 0.82, 0.64, 0.25, 0.02, 0.94)
• Is this a linear or non-linear transformation? 𝑋 𝑓
0.37 0
0.82 2
0.64 1
0.25 0
0.02 0
0.94 3
Prob-2
• The probabilities of the largest magnitude of an earthquake in any decade are
as follows:
Magnitude 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9
Probability 0.78 0.13 0.04 0.03 0.01 0.01

• Construct a transformation graph that could transform a random number (E)


over the range from 0 to 1 to the magnitude of an earthquake (M), where M
takes on the center value for each interval of the magnitude
• For each of the following values of E, find the simulated value of M:
• E = {0.27, 0.62, 0.13, 0.49, 0.96, 0.06, 0.84}
Prob-2 (Analytical Solution)
• What is given is individual probabilities. The cumulative probabilities (CP) can
be computed.

M(Range)M P(M) P(value<=M)


3-4 3.5 0.78 0.78
4-5 4.5 0.13 0.91
5-6 5.5 0.04 0.95
6-7 6.5 0.03 0.98
7-8 7.5 0.01 0.99
8-9 8.5 0.01 1

• For E = {0.27, 0.62, 0.13, 0.49, 0.96, 0.06, 0.84} the corresponding values of M
are M={ 3.5, 3.5, 3.5, 3.5, 6.5, 3.5, 4.5}
• You can check this graphically using the following bar chart
Prob-2 (Graphical Solution)
• For E = {0.27, 0.62, 0.13, 0.49, 0.96, 0.06, 0.84} the corresponding values of M
are M={ 3.5, 3.5, 3.5, 3.5, 6.5, 3.5, 4.5}

𝑀(𝐸 = 0.96) = 6.5


𝑀(𝐸 = 0.84) = 4.5

𝑀(𝐸 = 0.62) = 3.5


𝑀(𝐸 = 0.49) = 3.5

𝑀(𝐸 = 0.27) = 3.5


𝑀(𝐸 = 0.13) = 3.5
𝑀(𝐸 = 0.06) = 3.5
Types of Graphs
Primary types of charts
• Pictograph
• Using pictures to represent counts
• Pie Chart
• Representing fraction of total
• Bar Chart
• Comparing counts
• Column Chart
• Comparing counts within multiple groups
• Tree Map
• Substitute for pie chart in that the area is represented by rectangular area rather than sectors of a circle
• Can display hierarchical data using nested rectangles
• Scatter Plot
• Comparing two or more numeric variables so as to identify the underlying relationships
• Line Chart
• Used to illustrate mathematical equations that might have been formulated using the scatter plot
Pictograph
Pie Chart: Pros and cons
• A pie chart displays data as a percentage of the whole. Each pie section should have a label and
percentage. A total data number should be included.
• Advantage
• Visually appealing
• Shows percent of total for each category

• Disadvantage
• No exact numerical data
• Hard to compare 2 data sets
• "Other" category can be a problem
• Total unknown unless specified
• Best for 2-3 categories
• Use only with discrete data
Bar/Column Chart

Column Chart with


groups
Pei Chart versus Bar Cart
Treemap
• Comparison is easy with the rectangles than with the sectors on a circle
Scatter Plot
•.
Line based on the scatter plot
Errors or Miscommunication with
Data Visualization
Start from nothing
• Bar charts are generally easy to read
• Simply compare the relative heights of the bars
• Truncation equals misrepresentation
Distorted reality
Ditch the pie
Size matters
Over the rainbow
Spare the ink
A dimension too far
Stick to the point
Keep it simple
• Annotation is the most straightforward, but
often the most neglected
A tale of two stories
• Charting two sets of data with one scale on
the left and another on the right can be
confusing, and suggests a relationship that
may not exist.
Stand on the right
• Annotation is the most straightforward, but
often the most neglected
Back to basics
• IF you have only one or two data points just
state them. Graph is not necessary.
Data Analysis Fallacies to Avoid
Texas Sharpshooter Bias
• Arises where a person has a large amount of data, but only focuses on a small
subset of this data. In many cases because this subset leads to the most
interesting conclusion.
• It is named after a fictitious sharpshooter who lets off a lot of shots at the side
of a barn, looks at it, finds a tight grouping of hits, paints a target around it,
and then claims to be a “great sharpshooter”.
• This bias is related to the clustering illusion, which is the tendency in human
cognition to interpret patterns where none actually exist.
• To Avoid: First formulate a hypothesis, and then test it. Do not use
the same information to both construct and test your hypothesis.
The Gambler’s Fallacy, or the Monte Carlo Fallacy
• Belief that if something happens more frequently than normal, it will
happen less frequently in the future.
• Consider a coin toss. After fifteen “heads” in a row, you might feel
there must be an end to the pattern. You may think the chances of
“tails” on the next flip are higher than before. However, this is not
true. The coin does not have any memory of past flips, and the
chances of flipping “heads” or “tails” do not change over time.
• To Avoid: Ensure you evaluate whether your assumptions are based
on statistical likeliness or more personal intuition.
Simpson’s Paradox or Yule-Simpson’s Effect
• A trend appears in different groups of data, but disappears when these groups are
combined.
• A famous example occurred in the 1970s when UC Berkeley was accused of sexism.
Female applicants were less likely to be accepted than male applicants. However,
when researchers tried to find the source of the problem, they noticed that for
individual departments, the acceptance rates were better for women than for men.
The paradox existed because a greater proportion of the female applicants were
applying to highly competitive departments that had lower acceptance rates for both
genders.
• To Avoid: Difficult to overcome beforehand. If you ever encounter this weird
phenomenon by finding a bias that reverses if you look at different groups within
your data, then know that you have not necessarily made a mistake. You may simply
have found an example of Simpson’s paradox.
Cherry Picking
• Practice of selecting results that fit your claim and excluding those
results which do not fit your claim.
• Most harmful example of dishonesty in data analysis, and it is one of
the most simple to overcome.
• May lead to poor choices and negative consequences.
• To Avoid: Make sure you use the entire body of results.
Data Dredging
• Data Dredging, Data Fishing, Data Snooping, Data Butchery.
• Misuse of data analysis to find patterns when there is no real underlying
causality/relationship.
• This is done by performing many tests and only looking at the ones that
come back with an interesting result.
• With Cherry Picking you pick the data that is most interesting, and with Data
Dredging you pick the conclusion that is most interesting.
• The Texas Sharpshooter Bias is actually a specific version of Data Dredging,
and so the solution is the same:
• To Avoid: First formulate a hypothesis, and then test it. Do not use the same
data to both construct and test your hypothesis.
Survivorship Bias or Survival Bias
• Error of focusing on people or things that make it past a selection process and overlooking
those that did not make it past a selection process, because of their lack of visibility.
• It is named after the common fallacy we all experience when sharing information following a
dangerous incident. You could be fooled into thinking that the incident under discussion was
not particularly dangerous because everyone you communicate with afterwards survived.
However, it may be that a number of people died in the incident. These people would not be
able to add their voice to the conversation, which leads to the bias.
• In highly competitive careers you often hear movie stars, athletes, and musicians tell the
story of how the determined individual who pursues their dreams will beat the odds.
However, there is much less focus on the people that are similarly skilled and determined
but do not succeed due to factors beyond their control. This leads to the false perception
that anyone can achieve great things, whereas the reality is often a lot less equal.
• To Avoid: When concluding something about the data that has survived a selection process,
make sure you do not generalise this conclusion for the entire population. The incident may
not be harmless just because the survivors survived. You must also consider those did not
survive.
False Causality
• Often when two variables correlate, our brains tend to make up a story, and
find causation.
• “Children who watch a lot of TV are the most violent.” easily leads us to believe
that TV makes children more violent, but it could easily be the other way
around (violent children like watching TV more than less violent children).
There could be an underlying cause (bored children tend to be more violent,
and tend to watch more TV), or it could be a complete coincidence.
• When you see correlation, there can be no conclusion made regarding the
existence or the direction of a cause-and-effect relationship. Luckily, in
practice, you often do not even need to know about causality.
• To Avoid: Knowing about correlation is in a lot of cases enough. And if you do
need to know about causation, then you need to do more research.
Sampling Bias
• Bias that occurs when your data sample does not accurately represent the
population.
• A classic example occurred in the 1948 US presidential election. The Chicago
Tribune printed the headline “Dewey defeats Truman”, expecting Thomas
Dewey to become the next US president. However, they had not considered
that their survey was done via telephone and that only the most prosperous
part of the population owned a telephone. This caused a bias in the data, and
so they were mistaken about Dewey’s victory.
• To Avoid: To prevent yourself from falling for the Sampling Bias, make sure
that your data sample represents the population accurately to generalize
sample findings to population.
Hawthorne Effect or Observer Effect
• Named after experiments done in the Hawthorne factories of Western Electric.
• Effect that something changes just because you are observing it. This is
something you will encounter often when collecting data on human research
subjects
• Scientists researched the influence of working conditions, like light and
heating, on the productivity of the workers. To their surprise, both the group
where they did change the working conditions and the group where they did
not change anything, showed better productivity during the experiment. After
some additional research, the scientists concluded that both groups’
productivity increased because they found the scientists’ interest motivating
and stimulating.
• To Avoid: It is still good practice to ask yourself whether you are in any way
affecting the data by collecting it.
McNamara Fallacy
• Mistake of making a decision based solely on metrics and ignoring all others.
• It is named for Robert McNamara, the US secretary of defense (1961 – 1968),
who measured success in the Vietnam War by enemy body count. This meant
that other relevant insights like the mood of the US public and feelings of the
Vietnamese people were largely ignored.
• To Avoid: Although data and numbers can tell you a lot, you should not obsess
over optimising numbers while ignoring all other information.
Overfitting
• A sophisticated explanation will often describe your data better than a simple
one. However, a simple explanation is usually more representative of the actual
world than a complex explanation. Simpler models are usually more robust,
and so are a better fit for new data.
• The more complex the explanation, the better it works for the data that you
already have. However, it is likely to involve you needing to explain random
variations that you captured in your data. As soon as you add more or new
data, the random variations break down.
• Best-known fallacy. It is also an easy one to fall for, and a difficult one to
prevent.
• To Avoid: If you find yourself coming up with very good results on your test
data, but not when you proceed to test these theories on new data, then you
might be overfitting. In that case redo model formulation.

You might also like