0% found this document useful (0 votes)
13 views11 pages

Chapter 2

Chapter 2 provides a literature review on logistic regression, defining key terms and explaining statistical models such as general linear and generalized linear models. It discusses the purpose and assumptions of logistic regression, including its application in predicting binary outcomes and the importance of odds ratios. The chapter also covers model estimation techniques, limitations of logistic regression, and the fitting process using maximum likelihood methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Chapter 2

Chapter 2 provides a literature review on logistic regression, defining key terms and explaining statistical models such as general linear and generalized linear models. It discusses the purpose and assumptions of logistic regression, including its application in predicting binary outcomes and the importance of odds ratios. The chapter also covers model estimation techniques, limitations of logistic regression, and the fitting process using maximum likelihood methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CHAPTER 2

LITERATURE REVIEW

Important definitions:

(i) Logistic Regression is a statistical analysis model used to model binary response variable
function using logistic function.
(ii) Odds ratio are simply the ratios of proportions for two possible outcomes.
(iii) Sexually Transmitted Infections are infections that are transmitted from one person to
another sexually (orally, anal etc.)
(iv) Binary data is data that can be catergorised into two groups. It is represented by zero or
one.

Statistical Review

Regression Analysis

Regression analysis is a technique that is used to explore the relationship between two or more
variables. It is concerned with the construction of models in which one variable called the dependent or
response variable (represented by Y) is described in terms of the independent or regressor/ explanatory
or independent variable. The dependent variable can depend on one or more dependent variable.

General Linear Models

It is a statistical model that can be written as Y=XB+E where Y is the dependent variable, X is the design
matrix for predictor variables, B is the matrix that contains parameters and E is the matrix of error
terms.

Assumptions of General Linear Models

(i) Vector X is non random

(ii)Y and E are random vectors of independent elements.

(iii) Error terms have constant variances.

(iv)Error terms are normally distributed.

(v) Error terms have zero mean at all X values.

Linear Models

Linear regression model for n observations y1,y2…..yn, given that our response variable is continuous is
given in the form
yi =B0+B1X1i+B2X2i+……+BKXKi+Ei

From this model the dependent variable for ith observation yi = 1….n is linearly dependent on the values
of k independent variables xi……xk through parameters B0……..BK. There is assumptions that error terms
e have a mean zero and constant variance, the reason being that they indicate residual variation.
Explanatory variables are also assumed to have constant variance. Expectation of Y is given by

E[Yi]=B0+B1X1i and is variance is Var[Y] =sigmasquared therefore y follows normal distribution with E[YI]
and sigmasquared. Linear models includes qualitative variables known as factors and it considers limited
set of values which are called levels of factors. General linear models is not compatible when our
response variable is binary so we have to use generalized linear model.

Generalized Linear Models

Generalized linear models uses link function. The link functions in generalised linear models links the
mean response to the linear predictors in the model. Assuming that p is the proportion of subject’s
response to a stimulus intensity X, g(p)=XB is a generalized linear for p in terms of X, where is the link
function that relates the stochastic elements of Y.

Regression with Binary Response

Let Yi be a variable which is binary which has two possible outcomes which can be presented by 0 and
1.The aim is to achieve the categorical set of regressor variable that is

Yi=B0+B1X1i+B2X2+….+BPXip+ei

For i=1,2….n and Yi ranges between 0 and 1 and expectation of Yi can be taken as pi that is the
proportion of observation at all dependent variables. Pi = P(Yi = 1) which means that 1 - Pi = P( Yi = 0) for i
= 1, 2, …. n that is the for n distinct data points there are n probability of which is a parameter of
Bernoulli distribution.

Bernoulli distribution

Is the discrete probability of a random variable which take values 0 and 1 for success and failure
respectively. It is a special case of binomial distribution. Suppose we have Y as a random variable which
assumes one of the possible outcome that is success or failure say P( Y = Success)= p and P(Y = failure)=
1-p then Y is said to follow Bernoulli distribution p and it is given by

fx= px ( 1- p )1-x

Logistic transformation

It is widely used in medical / health field because we will be modelling odds. Instead of modelling P we
model log( p/ 1- p) which we will write as logit p.

log( p/1-p ) = logit(p)= B0+B1X1i+….+ BPXpi

It uses of odds of success that is odds P( Success )/P(Failure) . It was used to analyze response that has a
finite interval say ( 0,1 ).
Logistic Regression Model

It was developed in 1940 as an alternative to Fishers 1936 classification method for linear discriminant
analysis. When the dependent variable is dichotomous (binary), logistic regression is the optimal
regression strategy to use. The logistic regression, like other regression studies, is a predictive analysis.
Logistic regression is a data analysis technique that is used to define and explain the connection
between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level
independent variables. It is mainly used in social science and medical sciences. Logistic regression can be
multinomial where it deals with situations which have three or more outcomes and in binomial there
are two outcomes. Binary logistic regression can be coded as 0 or 1, where 0 is failure and 1 is success.

Normally, logistic regression is used to forecast the chances of success based on the values of the
independent variable. It also employs one or more independent variables, which can be either
continuous or categorical. The logistic regression model, which employs the Bernoulli distribution, may
also be used to predict binary outcomes of the response variable rather than continuous outcomes.

The natural logarithm of the odds dependent variable can be used in logistic regression to generate
continuous criteria as the modified version of the dependent variable. Then we assert that the logit
transformation is the link function in logistic regression, despite the fact that the dependent variable in
logistic regression is binomial, and that the logit is the continuous parameter on which regression is
undertaken.

We may fit the logit of success to predictor factors using linear regression analysis. The exponential
function can be used to convert predicted logit values back to projected odds. As a continuous variable,
logistic regression may also be used to estimate the odds. There are three different types of logistic
regression which are multinomial, binary and ordinal logistic regression.

The Purpose of Logistic Regression

Logistic regression is used when our set of data is categorical. When our data is dichotomous or
categorical, logistic regression is used instead of linear regression. Logistic regression may be used to
automatically construct dummy variables, making it easier to implement. The basic objective of logistic
regression is to predict categorical interactions, and the results of logistic regression are expressed in
the form of odd ratios. Its second significance is that it demonstrates how strongly associated the
variables are to one another and how strongly they are related.

Assumptions of logistic regression models

Logistic regression differs from linear regression in that it does not make some fundamental
assumptions that linear and general linear models (as well as other ordinary least squares algorithm-
based models) hold so close. These are the assumptions;

i) It does not require linear relationship between regressor and response variable.

ii) Error terms do not need to be normally distributed.

iii) The variance of the error term is not constant, there is no homoscedasticity.

iv) In logistic regression the response variable is not measured on an interval or ratio scale.
However logistic regression shares some assumptions with linear regression such as

i) Observation independence - Logistic regression mandates that the observations be independent of


one another. To put it simply, the observations should not be based on repeated measurements or
matched data.

ii) Absence of multicollinearity - Logistic regression necessitates a low level of multicollinearity among
the independent variables. This implies that the independent variables should not be too associated
with one another.

iii) Linearity of regressor variables and log odds - Logistic regression is based on the assumption that the
independent variables are linear and that the chances are logarithmic. Although this analysis does not
need a linear relationship between the dependent and independent variables, it does require that the
independent variables be linearly related to the log odds.

iv) Assumptions of the large sample size - Finally, logistic regression generally needs a high sample size.
A general rule of thumb is that you should have at least 10 events with the least likely result for each
independent variable in your model. For example, if you have five independent variables and the
predicted probability of your least frequent result is 0.10, you'll require a sample size of 500 (10*5
/0.10).

The logistic regression model is given by

2.0.6

2.0.7

The linear logistic model is a member of the class of generalized models where g(p) is the link function
and logit(pi) = ( pi/1-pi)

Limitations of Logistic regression model

Logistic regression does not need multivariate normal distributions, but it does need random
independent sampling and linearity between X and the logit. The model is more likely to be correct
towards the center of the distributions and less accurate at the extremes. Although P(Y=1) may be
estimated for every combination of values, not all possibilities may occur in the population.

When important variables are omitted from models, they might become misleading. Using SPSS's
hierarchical options, it is simple to examine the impact of additional variables. Adding extraneous
factors, on the other hand, may dilute the impacts of more significant factors. Although multicollinearity
may not result in biased estimates, standard errors for coefficients increase, and the unique contribution
of overlapping variables may become extremely modest and difficult to identify statistically.

More data is always better. When the sample size is small, models might become unstable. Focus out for
outliers who might cause correlations to become distorted. Some value combinations may be relatively
weakly represented in correlated variables, especially in small samples. When estimates are based on
cells with modest expected values, they are unstable and lack power. Perhaps small categories can be
meaningfully collapsed. Plot the data to ensure if the model is correct. Is it necessary to interact? Take
caution not to misinterpret odds ratios for risk ratios.
Binary Logistic Regression

Statistical modelling

Modeling's main goal is to provide a mathematical representation of the relationship between an


observed variable and a set of predictor variables. The mathematical models are utilized for a variety of
purposes like:

(i) Forecasting the dependent variable based on the predictor variables' values

(ii) Comprehending how the predictor variables affect or connect to the dependent variable

Odds Ratios

Definition: Odds

It is the ratio of the likelihood of an event occurring to the likelihood of it not occurring.

Definition: Retrospective Study

It is an etiological research in which subjects with (cases) and without (controls) a certain disease are
compared.

Definition: Confounding Variable

According to Collett, 1991 it is a variable that completely or partially accounts for the apparent
association between a disease and an exposure factor.

Odds ratio defined as the frequency with which distinct outcomes occur. The odds ratio is used in
statistics to measure how strongly the existence of A is connected with the presence or absence of B in a
particular population. The odds ratio is also used to examine the relative probability of the occurrence of
an interest outcome given exposure to the variable interest. They are usually written as a:b in favor of
the first result, which means that the first result occurs a number of times for every single occurrence of
the second result. Odds = probability divided by one minus probability.

Is one of the three primary methods in epidemiology for determining the relationship between an
outcome and exposure. The Risk Relative (RR) and Absolute Risk Reduction (ARR) are the other two
basic approaches of assessing association. The relative risk of a disease is a measure of how likely an
individual who is exposed to a specific cause is to get a disease compared to someone who is not
exposed. Relative risk is similar to odds ratio, except instead of odds, it utilizes probabilities, and it's
mainly used in clinical studies.

However, because this is a retrospective research, it catches the usage of odds ratio. For each unit
increase in the predictor, the odds ratio assesses the change in the probability of membership in the
target group. It is derived using the exponent's regression coefficient. The chances are calculated and
shown as ExpB in the Statistical Package for Social Sciences (SPSS). It indicates how much changing the
associated metric by one unit affects the odds ratio. If it is larger than one, the likelihood of a positive
result increases. If the predictor is less than one, each increase in the predictor reduces the likelihood of
the result occurring. There is no link between the variable if the odds is one.
Model Estimation

Fitting a model to a set of data requires first estimating the model's unknown parameters. The two most
common general methods of estimation used in fitting linear regression models are method of least
squares and maximum likelihood. Unknown parameters can be estimated in two ways which are point
estimation and interval estimation .The methods used for point are:

(i) Maximum Likelihood estimation

(ii) Methods of moments

(iii) Method of least squares

(iv) Judgemental methods

Maximum likelihood function

It gives estimates for model parameters. The maximum likelihood model determines the set of model
parameters and maximizes the likelihood function given a fixed set of data and underlying model.
Assume a sample xi...xn of independent identically distributed observations from a distribution with an
unknown probability density function f0(.). However, it is summarized that the function fnote belongs to
a certain family of distributions f(./@:@) (where @ is a vector of parameters from this family), known as
the parametric model, and that fnote =f(./@note) the value @note is unknown and is referred to as the
true values of parameter vectors.

Method of moments

Assume yi....yn are n independent observations with E(Yi)= summationBXij and var(yi)= sigmasquared for
i=1,2,...n. The least squares estimates of the unknown parameters in the model are values of B0.....Bn,
which minimizes the error of squared deviations of the observations from their predicted value given
by

2.3.9 TR.

Differentiating S with regard to each of the unknown parameters, equating derivatives to zero, and
solving the resultant set of linear equations yields the least square estimate.

Fitting the logistic regression model

Once we have a model (the logistic regression model), we must fit it to a collection of data in order to estimate

the parameters B0 and B1.In a linear regression, we said that the straight line fitting the data may be

generated by reducing the distance between each dot on a plot and the regression line. In fact, we minimize

the sum of the squares of the distance between the dots and the regression line (squared in order to avoid

negative differences). This is known as the least sum of squares approach. We find b0 and b1 that have the

smallest sum of squares.


The procedure is more difficult in logistic regression. It's known as the greatest likelihood technique.

Maximum likelihood will provide values of 0 and 1 that maximize the chances of acquiring the data set.

Iterative computation is required, which is easy to measure with most computer software.

Given the unknown parameters (b0 and b1), we use the likelihood function to estimate the chance of

obtaining the data. A "likelihood" is a probability, more particularly the chance that the observed values of the

dependent variable can be predicted based on the observed values of the independent variables. The

likelihood, like any other probability, ranges from 0 to 1.

In practice, it is easier to work using the likelihood function's logarithm. The log-likelihood function will be

utilized for inference testing when comparing various models. The log probability ranges between 0 and minus

infinity (it is negative because the natural log of any number less than 1 is negative).The log likelihood is

calculated as follows:

2.0.14 MM

The likelihood varies on the unknown success probabilities Pi which depends on Bs through the eqution

Pi=exp(B0 + B1X1i+….+ BpXpi)/1+exp(B0+B1X1i+…+BpXpi)

Logarithm likelihood function is given by

2.3.13 TR

The derivatives with repective to (r+1) unknown parameters are given by

2.3.14 TR

Evaluating the derivatives betahat and equating them to zero yields a set of r+1 nonlinear equations in
the unknown parameters betahatj that can be numerically solved. There are two ways for evaluating
these nonlinear equations which are

(i) Newton Raphson

(ii) Fisher’s Method of scoring

Newton Raphson

It is a generalized linear model fitting algorithm. the log likelihood function r+1 derivatives with respect
to B0.... Bp are also known as efficient scores and are indicated by the vector

2.3.15 TR
Let H(B) be a matrix of second partial derivatives of logL(B), with the (j,k)th element of H(B) equal to
dlogL(B)/BjBk, where j=0,1....r and k=0,1....r. H(B) is known as the Hessian matrix. Taking into account
U(B)hat assessed at the maximum likelihood estimate of B, Bhat uses Taylor series to extend U(B)hat
about Bstar where Bstar is close to Bhat and we have

2.3.16 TR

Fisher Scoring method

It use the information matrix rather than the Hessian matrix, the (j,k)th element of which is TR for
j=1,2...r and k=1,2...r, and this matrix is indicated by I(B). We get 2.3.17 TR by replacing the Hessian
matrix with the iterative technique described above.

In general, various iterative schemes provide different outcomes, but in the case of the logit technique,
they converge to the same maximum likelihood estimate of B.

Model selection

The task of selecting a statistical model from a group of candidate models is known as model selection. If
numerous models are available, it is necessary to assess the performance of alternative models using
various statistical indicators. The Aikake Information Criteria (AIC) and the Bayesian Information Criteria
(BIC) are the most widely utilized measurements.

Aikake information criterion

Hirotsugu Aikake established the AIC in 1971 as a measure of the relative quality of a statistical model
for a given set of data. It provides a method for selecting a model from among nested models, but it
does not give a model test in the sense of testing a null hypothesis. The favored model is the one that
most closely depicts the reality, hence it has the lowest AIC value. AIC is calculated as AIC=-2lnL+2p,
where lnL is the fitted model's log-likelihood and p signifies the number of parameters contained in the
model.

Bayesian Information Criterion

BIC, also known as the Schwartz Criteria, is a model selection criterion for a given number of models. It is
based on the likelihood function and is quite similar to the AIC. The lower the BIC, the better the model.
BIC = -2L+plog(n)

where L is the log likelihood assessed under m, p is the number of parameters, and n is the number of
rating classes.

The interpretation of the fitted logistic model

The logistic model's interpretation is based mostly on the estimated parameters in the model, which are
explanatory variable coefficients. If the estimated parameters are positive, it indicates they have a
positive influence on the response variable; if they are negative, it indicates they have a negative
influence on the response variable.

The Wald Test


The Wald Chi - Squared test is another name for the Wald Test (Agresti A. 1990). It is a method for
determining if explanatory variables in a model are significant. Significant means variables that add
nothing to the model can be removed without changing the model in any significant way.

Ho : B1= 0 vs H1 : B1=/ 0

Wald statistic is given by

W = ( Bihat/sigmaBi )

Where W is the X2 with 1 degrees of freedom. We reject H0 at a significant level if W is greater or equal
to X2(i)

If the test indicates that the parameters for specific explanatory variables are zero, the variables can be
deleted from the model; however, if the parameters are not zero, the variables should be included in
the model.

Goodness of fit of a Linear Logistic Model

(a) Test for the correctness of the link function.

(i) The Hosmer and Lemeshow test

(The Hosmer and Lemeshow statistic compares observed and predicted numbers of outcome
occurrences in subgroups with equal expected probabilities of outcomes. The goodness of fit is
demonstrated by similar counts of observed and forecasted counts in all subgroups, with a difference
close to zero. The test statistic is computed by first ranking the fitted values and categorizing them into c
groups of about equal size. The observed and expected numbers in each group are computed, and an X2
goodness of fit t test is performed.

Null hypothesis is given by 2.3.22 seb

Test statistic 2.3.22

Reject H0 if X2 cal > X2 crit at the level of significance, indicating that the link function is not valid.

(ii) Hinkley Test

Hinkley (1985) proposed a simple technique to determine if the link function is credible. Fit the model
ni=log(pi) first, then add ni2 as a new predictor variable to determine if it is significant using the wald
chi-square test. If the estimated linear predictor in the model is insignificant, the logit(pi), which is the
link function, is said to be appropriate.

(b) Overall model adequacy

Following the fitting of the logistic model, the Pearson 2 statistic, the Nagelkerke R2 statistic, and the
Cox and Snell R2 statistic can be utilized to measure goodness of fit.

(i)Pearson chisquare
In General Linear Models, the Pearson 2 statistic is the most commonly used measure of goodness of t.
It is denoted by

(eqn)

The Pearson chisquare statistic is based on a two-sided distribution with n - p degrees of freedom,
where n represents the number of instances and p signifies the number of parameters. The statistic is
evaluated against the null hypothesis that the model is sufficient.We reject null hypothesis if alpha
significance level if (rejection criteria)

(ii) Cox and Snell R2 statistic

2.0.30

Where k is the number of explanatory variables in the logistic model, and the null model contains just
constants. Since the cox and snell R2 value cannot reach 1, the cox and snell statistic is adjusted to the
Negelkele R2 statistic, which can reach the values of 1

(iii) Negelkele R2

It is given by

2.0.32 MM

While the value of R2 increases as the model fits better. This R2 is not interpreted in the same manner
as the R2 in a regression model, however the larger the R2, the better the model.

(iv) McFadden R2 statistic

R2=logLhat(Mfull)/logLhat(Mintercept)

Where Mfull= model for predictors , Mintercept = model without predictors and Lhat = estimated likelihood

The likelihood ranges from 0 to 1 therefore the log of a likelihood is less or equal to 0. A small ratio
likelihood indicates that the full model is better than the intercept model. The model with greater
likelihood implies higher McFadden.

Related Studies

Sexually transmitted infections (STIs), often known as sexually transmitted diseases (STDs), are illnesses
distributed primarily through person-to-person contact. Some STIs, however, can be transmitted
through non-sexual contact with infected blood and tissue, breast feeding, or even childbirth. The most
prevalent STIs include gonorrhea, syphilis, genital warts and herpes, chlamydia, and HIV. STIs are more
dangerous to women since, if left untreated, they can lead to infertility, low birth weight, and even
cancer. Many STIs do not cause symptoms, and the most effective preventive method is abstinence
from sexual activities.

Several research on the prevalence of STIs and their varied causes have been conducted in Africa.
According to the Centre for Disease Control and Prevention (2013), teen ladies had the highest STI
incidence. Wilkinson's (2010) study on undetected STIs in South African women confirmed this. Warner
and Abdool Karim For their approach and analysis, the researchers used survey methodologies,
descriptive statistics, and prediction models. The study found that 13943 women aged 25 to 19 in the
Hlabisa region were infected with at least one STI on any given day. Although 52 percent of these
women experienced symptoms, only 65 percent sought treatment.

O.Ohene and O.Aloto (2008) conducted another study in Ghana, this time looking at variables related
with sexually transmitted illnesses in Ghanaian women. The researchers wanted to find out whether
characteristics were linked to a history of sexually transmitted illness in women aged 15 to 24. They
make use of data from the Ghana Demographic Health Survey conducted in 2003. They used the chi-
square and t-test to compare individuals with a history or symptoms of STIs to those who denied such a
history on demographic, individual, and partner level variables.

Oster (2013) offered a comprehensive model of HIV epidemic expansion that was reproduced utilizing
data behavior, transmission rates, and other epidemic characteristics in another study at Harvard
University. The models that resulted demonstrated the relevance of transmission rates and sexual
behavior in epidemic development. The study observed that the HIV transmission rate is substantially
greater for those who have other untreated STI while utilizing stimulation models to forecast HIV
prevalence in the United States and Sub-Saharan Africa. Sexual behavior was to blame for the higher
rates in Sub-Saharan Africa.

Ayo Stephen Adebowade (2013) from Nigeria at University of Ibadon researched on the social risk
factors for sexually transmitted infections among young females between 15-24 who had sexual
intercourse. Logistic regression and chi-square was used to analyze data. Results showed that females
between 20-24 contracted STIs in last 12 months. Socio demographic factors such as age, wealth index,
marital status and age. They concluded that the impact of wealth index and HIV and AIDS awareness as
importance predictors of acquiring STI. Ways that can reduce risk of STIs transmission are provision of
condoms, teaching the importance of abstinence and improving the knowledge of HIV and AIDS.

You might also like