Noakhali Science and Technology University
Noakhali Science and Technology University
Noakhali Science and Technology University
Sonapur, Noakhali.
Department of Environmental Science & Disaster Management
Submitted to
ImrulKayes
Assistant Professor,
Submitted by
Roll: ASH1609075M
Session: 2015-16
Part 1
Linear regression:
Linear regression is the one of the most widely used statistical techniques in the life and
earth sciences. It is used to model the relationship between a response (also called
dependent) variable y and one or more explanatory (also called independent or predictor)
variables x1,x2…xn.
Example, we could use linear regression to test whether temperature (the explanatory
variable) is a good predictor of plant height (the response variable).
In simple linear regression, with a single explanatory variable, the model takes the form:
y=α+βx+ε
whereα is the intercept (value of y when x = 0), β is the slope (amount of change in y for
each unit of x), and ε is the error term. It is inclusion of the error term, also called the
stochastic part of the model, that makes the model statistical rather than mathematical.
The error term is drawn from a statistical distribution that captures the random variability
in the response. In standard linear regression this is assumed to be a normal (Gaussian)
distribution.
The linearin linear model does not imply a straight-line relationship but rather that the
response is a linear (additive) combination of the effects of the explanatory variables.
However, because we tend to start by fitting the simplest relationship, many linear models
are represented by straight lines.
A linear regression is just a special case of a linear model, where both the response and
predictor variables are continuous.
Example:
One of the most common graphs in science plots one measurement variable on the x
(horizontal) axis vs. another on the y (vertical) axis. For example, here are two graphs. For
the first, I dusted off the elliptical machine in our basement and measured my pulse after
one minute of ellipticizing at various speeds:
Speed, kph Pulse, bpm
0 57
1.6 69
3.1 78
4 80
5 85
6 87
6.9 90
7.7 92
8.7 97
12.4 108
15.3 119
y = ßo + ß1x + Ɛ
y is the predicted value of the dependent variable (y) for any given value of the
independent variable (x).
B0 is the intercept, the predicted value of y when the x is 0.
B1 is the regression coefficient – how much we expect y to change as x increases.
x is the independent variable (the variable we expect is influencing y).
e is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.
Multiple linear regression
Multiple linear regression is the most common form of linear regression analysis. As a
predictive analysis, the multiple linear regression is used to explain the relationship between
one continuous dependent variable and two or more independent variables. The
independent variables can be continuous or categorical.
Y = mx1+mx2+mx3+b
Where,
b= constant
Nonlinear regression
Nonlinear regression is a form of regression analysis in which data is fit to a model and then
expressed as a mathematical function. Simple linear regression relates two variables (X and Y)
with a straight line (y = mx + b), while nonlinear regression relates the two variables in a
nonlinear (curved) relationship.
Nonlinear regression modeling is similar to linear regression modeling in that both seek to track
a particular response from a set of variables graphically. Nonlinear models are more complicated
than linear models to develop because the function is created through a series of approximations
(iterations) that may stem from trial-and-error. Mathematicians use several established methods,
such as the Gauss-Newton method and the Levenberg-Marquardt method.
Analysis of variance
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an
observed aggregate variability found inside a data set into two parts: systematic factors
and random factors. The systematic factors have a statistical influence on the given data
set, while the random factors do not. Analysts use the ANOVA test to determine the
influence that independent variables have on the dependent variable in a regression study.
One-Way ANOVA
A one-way ANOVA is a type of statistical test that compares the variance in the group means
within a sample whilst considering only one independent variable or factor. It is a hypothesis-
based test, meaning that it aims to evaluate multiple mutually exclusive theories about our data.
Hypothesis
In a one-way ANOVA there are two possible hypotheses.
The null hypothesis (H0) is that there is no difference between the groups and equality
between means.
The alternative hypothesis (H1) is that there is a difference between the means and
groups.
A two-way ANOVA is, like a one-way ANOVA, a hypothesis-based test. However, in the two-
way ANOVA each sample is defined in two ways, and resultingly put into two categorical
groups.The two-way ANOVA therefore examines the effect of two factors (month and gender)
on a dependent variable – in this case weight, and also examines whether the two factors affect
each other to influence the continuous variable.
For example, thinkingabout walruses, researchers might use a two-way ANOVA if their question is:
“Are walruses heavier in early or late mating season and does that depend on the gender of the
walrus?” In this example, both “month in mating season” and “gender of walrus” are factors –
meaning in total, there are two factors. Once again, each factor’s number of groups must be
considered – for “gender” there will only two groups “male” and “female”.
Hypothesis
Because the two-way ANOVA consider the effect of two categorical factors, and the effect of the
categorical factors on each other, there are three pairs of null or alternative hypotheses for the
two-way ANOVA. Here, we present them for our walrus experiment, where month of mating
season and gender are the two independent variables.
Dependent variable – here, “weight”, should be continuous – that is, measured on a scale
which can be subdivided using increments (i.e. grams, milligrams)
Two independent variables – here, “month” and “gender”, should be in categorical,
independent groups.
Sample independence – that each sample has been drawn independently of the other
samples
Variance Equality – That the variance of data in the different groups should be the same
Normality – That each sample is taken from a normally distributed population
Three-Way ANOVA
Example
For example, a pharmaceutical company, may do a three-way ANOVA to determine the effect of
a drug on a medical condition. One factor would be the drug, another may be the gender of the
subject, and another may be the ethnicity of the subject. These three factors may each have a
distinguishable effect on the outcome. They may also interact with each other. The drug may
have a positive effect on male subjects, for example, but it may not work on males of a certain
ethnicity. Three-way ANOVA allows the scientist to quantify the effects of each and whether the
factors interact.
In statistics, a fixed effects model is a statistical model in which the model parameters are fixed
or non-random quantities. This is in contrast to random effects models and mixed models in
which all or some of the model parameters are random variables. In many applications
including econometrics] and biostatistics a fixed effects model refers to a regression model in
which the group means are fixed (non-random) as opposed to a random effects model in which
the group means are a random sample from a population.
Analysis of variance:
Analysis of variance (ANOVA) is a collection of statistical models and their associated
estimation procedures (such as the "variation" among and between groups) used to analyze the
differences among group means in a sample. ANOVA was developed by the statistician Ronald
Fisher. The ANOVA is based on the law of total variance, where the observed variance in a
particular variable is partitioned into components attributable to different sources of variation. In
its simplest form, ANOVA provides a statistical test of whether two or more population means
are equal, and therefore generalizes the t-test beyond two means.
One Way ANOVA:
A one way ANOVA is used to compare two means from two independent (unrelated) groups
using the F-distribution. The null hypothesis for the test is that the two means are equal.
Therefore, a significant result means that the two means are unequal.
Situation : You have a group of individuals randomly split into smaller groups and completing
different tasks. For example, you might be studying the effects of tea on weight loss and form
three groups: green tea, black tea, and no tea.
In a one-way ANOVA, variability is due to the differences between groups and the differences
within groups. In factorial ANOVA, each level and factor are paired up with each other
(“crossed”). This helps you to see what interactions are going on between the levels and factors.
If there is an interaction then the differences in one factor depend on the differences in another.
Let’s say you were running a two-way ANOVA to test male/female performance on a final
exam. The subjects had either had 4, 6, or 8 hours of sleep.
A one way ANOVA will tell you that at least two groups were different from each other. But it
won’t tell you which groups were different.
A Two Way ANOVA is an extension of the One Way ANOVA. With a One Way, you have one
independent variable affecting a dependent variable. With a Two Way ANOVA, there are two
independents. Use a two way ANOVA when you have one measurement variable (i.e. a
quantitative variable) and two nominal variables. In other words, if your experiment has a
quantitative outcome and you have two categorical explanatory variables, a two way ANOVA is
appropriate.
A two-way factorial ANOVA would help you answer the following questions:
1. Is sex a main effect? In other words, do men and women differ significantly on their
exam performance?
2. Is sleep a main effect? In other words, do people who have had 4,6, or 8 hours of sleep
differ significantly in their performance?
3. Is there a significant interaction between factors? In other words, how do hours of sleep
and sex interact with regards to exam performance?
4. Can any differences in sex and exam performance be found in the different levels of
sleep?
The results from a Two Way ANOVA will calculate a main effect and an interaction effect. The
main effect is similar to a One Way ANOVA: each factor’s effect is considered separately. With
the interaction effect, all factors are considered at the same time. Interaction effects between
factors are easier to test if there is more than one observation in each cell. For the above
example, multiple stress scores could be entered into cells. If you do enter multiple observations
into cells, the number in each cell must be equal.
Two null hypotheses are tested if you are placing one observation in each cell. For this example,
those hypotheses would be:
For multiple observations in cells, you would also be testing a third hypothesis.
H03: The factors are independent or the interaction effect does not exist.
Three-way ANOVA:
A three-way ANOVA has three factors (independent variables) and one dependent variable. For
example, time spent studying, prior knowledge, and hours of sleep are factors that affect how
well you do on a test
Factors and Levels in a Three-Way ANOVA:
Let’s say you wanted to find out if there is an interaction between income, age, and gender for
how much anxiety job applicants experience at job interviews. The amount of anxiety is the
outcome, or the variable that can be measured. The categorical variables Gender, Age, and
Income are the three factors.
Factors can also be split into levels. In the above example, income could be split into three
levels: low, middle and high income. Age could be split into multiple levels (e.g. under 30, 30-
50, over 50). Gender could be split into three levels: male, female, and transgender. If you’re
working with treatment groups, you’ll want to include all possible combinations of all factors. In
this example there would be 3 x 3 x 3 = 27 treatment groups.
Part 2
Generalized Linear Model:
The term general linear model (GLM) usually refers to conventional linear regression models for
a continuous response variable given continuous and/or categorical predictors. It includes
multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only). The form
is yi∼N(xi Ti β,vσ2),yi∼N(xi T β,σ2), where xi xi contains known co variates and ββ contains
the coefficients to be estimated. These models are fit by least squares and weighted least squares
using, for example: SAS Proc GLM or R functions lsfit() (older, uses matrices) and lm() (newer,
uses data frames).
The term generalized linear model (GLIM or GLM) refers to a larger class of models
popularized by McCullagh and Nelder (1982, 2nd edition 1989).
The generalized linear models (GLMs) are a broad class of models that include linear regression,
ANOVA, Poisson regression, log-linear models etc. The table below provides a good summary
of GLMs following Agresti (ch. 4, 2013):
1. Random Component – refers to the probability distribution of the response variable (Y); e.g.
normal distribution for Y in the linear regression, or binomial distribution for Y in the binary
logistic regression. Also called a noise model or error model. How is random error added to
the prediction that comes out of the link function?
2. Systematic Component - specifies the explanatory variables (X1, X2, ... Xk) in the model, more
specifically their linear combination in creating the so called linear predictor; e.g., β0 +
β1x1 + β2x2 as we have seen in a linear regression, or as we will see in a logistic regression in
this lesson.
3. Link Function, η or g(μ) - specifies the link between random and systematic components. It
says how the expected value of the response relates to the linear predictor of explanatory
variables; e.g., η = g(E(Yi)) = E(Yi) for linear regression, or η = logit(π) for logistic
regression.
Assumptions:
Following are examples of GLM components for models that we are already familiar, such as
linear regression, and for some of the models that we will cover in this class, such as logistic
regression and log-linear models.
Limitation:
Linear function, e.g. can have only a linear predictor in the systematic component
Responses must be independent.
Although many software packages still refer to certain procedures as ‘GLM’ the concept of a general
linear model is seen by some debate about the general part.
Stroup prefers the term generalized linear mixed model(GLMM) of which GLM is a sub type. GLMM
combine GKMs with mixed models, which allow random effects models (GLMs only fixed effects).
However, GLMM is a new approach.
GLMMs are still part of the statistical frontier ant not all the answers about how to use them are known
(even by experts) Bolker
Mixed effect model is mainly a statistical model of containing both fixed effects and random effects.
These models are useful in a wide variety of disciplines in the physical, biological and social sciences.
They are particularly useful in setting where respected measurements are made on the same statistical
units (longitudinal study) or where measurements are made on clusters of related statistical units.
Because of their advantage in dealing with missing values mixed effects models are after preferred over
more traditional approaches such as repeated measures ANOVA.
Thus, mixed effect modeling is used in different sectors of environmental science which are following
below-
The dynamics of specific plant parts are individual plants within a community are key determinants of
plant and vegetation structure. For example, the long-lived leaves of some conifers may develop into
very deep and dense canopies compared with those of nearly deciduous hardwood trees that sheds
twin leaves annually,
Plant function
Structure-dynamics
Plants competition Arrivals Microbesmycorr
parasites mutuatiss pollinators hizarhizosphere
herbivores pathogen
Models are created for plant structure population dynamics disturbance limit by climatic and
catastrophic events.
In this case firstly spatio temporal data consist of Landsat images of the study area from the past years
are collect then the data are polygons with numerical data (year and area) and categorial data (land use
and soil type)
Secondly, the analyzed data using four mixed models able to incorporate both the fixed and the random
effects underlying the clustered data.
The proposed models allowed analyzed complex data structure, such as multilevel data taking into
account particularities of each land types as a function od the year the models are fitted to identify land
use charge over time.
Forest inventory data often consists of measurements takes on field plots as well as values predicted
from statistical models e.g. tree biomass
As mixed models is the forestry and their application are generally understanding the use of mixed
models are increasing in forest management day by day.
Human infrastructures can modify ecosystems thereby affecting the occurrence and spatial distribution
of organisms as well as ecosystem functionality. In that cases large-scale long-term effect of important
human alterations of benthic habitats with an integrated approach combining engineering and
ecological modeling, which is impossible without the help of mixed effect model investigation.
Mixed effect modeling is also helpful to show the species distribution in an area. It helps to know the
change of distribution, habitat, food habit etc. of the species
So, above all area the different sectors, which are nothing or complex to distribute and investigate
without mixed effects modeling. It ‘s clear that mixed effects modeling is very much essential in
environment science.
Here, time is just a way in which one can relate the entire phenomenon to suitable reference points.
Time can be hours, days or years.
A time series depicts the relationship between two variables time is one of those variables and the
second is very necessary that the relationship always shows increment in the change of the variable with
reference to time. The relation is not always decreasing too.
The most important use of studying time series is that it helps us to predict the future behavior
of the variation based on past experience.
It is helpful for busines planning as it helps in comparing the actual current performance with
expect one.
From time series, we get to study the past behavior of the phenomenon or the variable under
consideration.
We can compare the changes in the values of different variables at different times or places etc.
Trend
Seasonal variations
Cycle variations
Random or irregular movements.
Components of time
series
Short- term movements
Cyclic variation
Random or irregular
movement
Trend:
The trend shows the general tendency of the data to increase during a long period of time. A trend is a
smooth, general long-term average tendency. It’s not always necessary that the increase or decrease is
in the same direction throughout the given period of time.
If we plot the time series values on a graph in If the pattern he data of clustering shows the
accordance with time t. the pattern of the data types of more curve line then it is nonlinear
clustering shows the types straight or less rand (curvilinear)
line is linear
These are the rhythmic forces which operate in a regular and periodic manner over a spon of less than
year.
They have the same or almost the same pattern dewing a period of 12 months. This variation will be
present in a time series if the data are recorded howly, daily weekly, quarterly or monthly.
Periodic fluctuation: cycle variation
The variation in a time series which operate themselves over a span of more than one year of are the
cyclic variations. This oscillatory movement has a period= of oscillation of more than a year. One
complete period is a cycle. This cyclic movement is sometimes called the business cycle.
It is a four phases cycle comprising of the phase of prosperity recession depression and recovery.
There is another factor which causes the variation in the variation under study they are not regular
variation and are purely random or irregular they are mainly unforeseen uncontrolledunpredictable and
are erratic. Those forest are earthquake wars, flood, famines and any other disaster.
Mathematically,
Yt = f(t)
Here, yt is the value of variable under study at time t. if the population is the variable under study at the
various time series is
If Yt is the time series value at time t,T t St, Ct and Rt are the trend value, seasonal, cycle and random
fluctuations at time t respectively according to the additive model, a time series can be expresses as
Yt = Tt + St + Ct + Rt
This model assumes that all four components of the time series act independently of each other.
The multiplicative model assumes that the various componets in a timeseries operate proportionately to
each other accordingly to this model,
Yt = Tt * St * Ct * Rt
Mixed models:
Yt = Tt + St + Ct Rt
Assignment-2
Model Evaluation, Analysis and Optimization
Model Evolution: Model Evaluation is an integral part of the model development process. It
helps to find the best model that represents our data and how well the chosen model will work
in the future. Evaluating model performance with the data used for training is not acceptable in
data science because it can easily generate overoptimistic and over fitted models.
Basically there are two methods of evaluating models in data science, Hold-Out and Cross-
Validation. To avoid over fitting, both methods use a test set (not seen by the model) to
evaluate model performance.
Hold-Out: In this method, the mostly large dataset is randomly divided to three subsets:
1. Training set
2. Validation set
3. Test set
Cross-Validation:
When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data
into k subsets of equal size. We build models k times, each time leaving out one of the subsets
from training and use it as the test set. If k equals the sample size, this is called "leave-one-out".
The Importance of Model Evaluation Technique:
1. Confidence intervals: Confidence Interval are used to assess how reliable a
statistical estimate is.
3. Gain and Lift Chart: Lift is a measure of the effectiveness of a predictive model
calculated as the ratio between the results obtained with and without the predictive
model.
6. ROC curve: Unlike the lift chart, the ROC curve is almost independent of the
response rate.
8. Predictive Power: This metric was developed internally at Data Science Central by
our data scientist.
Above these discussion we can understand about the importance of Model Evaluation Technique.
There are mainly four techniques of graphical methods of model evaluation. They are:
1. Geographical Analysis
2. Quantitative Analysis
3. Sensitivity Analysis
4. Uncertainty Analysis
Lets describes this techniques of graphical methods of model evaluation in
below:
Geographical Analysis:
Quantitative Analysis:
Sensitivity Analysis:
Model components describes the conditions of the run (the input variables and fixed
parameters) are adjusted so that he changes of the respond of the components can be
assessed.
The output from the analysis is a plot of changes in the simulated values against changes
in the model components.
The validity of the model response can be assessed in a qualitative way from expert
judgment and in a quantitative way against experimental data.
Figure: Sensitivity Analysis
Uncertainty Analysis:
It determines how much uncertainty is introduced into the model output by each
component of the model.
It determines a series of different starting values and the simulation is run for each and
for different combination.
The values of the components of the model could be defined as input variables or fixed
parameters.
Sensitivity Analysis:
Sensitivity analysis is the study of how the Uncertainty in the output of a mathematical model or
system can be apportioned to different sources of uncertainty in its inputs.
It determines how highly correlated the model result is to the value of a given input component.
Importance:
1. Decision Making
2. Communication
4. Model Development.
Uncertainty analysis:
Uncertainty analysis investigates the uncertainty of variables that are used in decision making.
Uncertainty analysis identifies the model components for which variability in the inputs exerts
the greatest influence on variability.
Importance:
It is important because of –
1. If the variability of the input is known or can be estimated, then an uncertainty analysis
can be conducted.
2. The input parameter is varied within the range of its statistical distribution and the
variability in the model output is measured.
3. More complex forms of uncertainty analysis involve using partial differentiation of the
model in its aggregated form, often called differential analysis.
1. An analysis of Co-incidence
2. An analysis of Association
An analysis of Co-incidence tells us how different the simulated and the measured values are.
An analysis of Association tells us how well trends in measured values are simulated. Elaborate
different techniques of quantitative analysis to calculate the accuracy of simulations. A
quantitative analysis of a model simulation tells us how well the simulated values match
measured data.Two types of analysis are most frequently used;
Simulated
Value
a) Measured Value
Simulated
Value
b) Measured Value
Simulated
Value
Fig: An illustration of coincidence and association between measured and simulated values.
(a) High coincidence and association (good model)
(b) High coincidence but low association (bad model)
(c) High association but low coincidence (bad model)
(ii) the most crucial component of a quantitative analysis is the measured data against which
simulations are compared. The nature of the available data dictates which statistics are most
appropriate and determines how much information can be obtained about the model.
The accuracy of the measurements limits the accuracy of the model evaluation. A model can
only be evaluated properly against independent data that is, data that was not used to develop the
model.
(iii) A quantitative analysis should compare simulated results to independent measured data: A
quantitative analysis of model performance should use independent measurements for the full
range of conditions for which the model is to be used.
A collection of statistics used by modelers to evaluate their models.
A scheme for deciding which statistics to use.
The association is good & the errors are small.
Model also shows no significant ideas; so it is almost ready to use.
(iv) Coincidence can be expressed:
The total difference (root mean squared error)
The bias (relative error or mean difference)
The error excluding error due to variations in the measurements (lack of fit)
(v) The significance of the coincidence can be determined by-
A comparison to the 95% confidence interval.
A direct comparison to the P value obtained in the t test or the F test.
(vi) The association can be expressed as the simple correlation co-efficient: The simple
correlation co-efficient can have any value from -1 to +1.
A correlation coefficient of +1 indicates a perfect positive correlation that is; the
simulated values are strongly associated with measured values that the model is
performing well.
A simple correlation coefficient of -1 indicates a perfect negative correlation.
A simple correlation coefficient of 0 indicates no correlation between the simulated and
measured values.
(vii) The significance of the association can be determined using a t-test.
Importance index:
Quantitative expression of importance of one input value is determining the uncertainty of the
model output
n
∑ ( Ii−I )2
i
Important index = n
∑ ( Pi−P)2
i
Relative deviation:
n
√ ∑ (Pi−P)2
i
N −1
P
n = the total deviation number of values in the sample.
References:
1. Wikipedia;
2. Abramowitz, M. & Stegun, I.A. (eds.) (1965) Handbook of mathematical
functions. Dover Publications, New York, 1046pp;
3. Chatfield, C. (1983) Statistics for technology. 3 Edition. Chapman & Hall,
London. 381pp;
4. Cooper, B.E. (1968) The integral of Student’s t-distribution. Algorithm
AS3. Applied Statistics 17: 189–190.