0% found this document useful (0 votes)
60 views23 pages

Unit 5 Dev 2023

DEV NOTES

Uploaded by

Dr.ARTHEESWARI S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views23 pages

Unit 5 Dev 2023

DEV NOTES

Uploaded by

Dr.ARTHEESWARI S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

AD3301 DEV UNIT 5

MAILAM ENGINEERING COLLEGE


Mailam – 604 304
(Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai
& TATA Consultancy Services Accredited Institution)

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


II YEAR / III SEM
AD3301 DATA EXPLORATION AND VISUALIZATION
SYLLABUS
UNIT V
MULTIVARIATE AND TIME SERIES ANALYSIS

SYLLABUS: Introducing a Third Variable – Causal Explanations – Three-


Variable Contingency Tables and Beyond – Longitudinal Data –
Fundamentals of TSA – Characteristics of time series data – Data Cleaning –
Time-based indexing – Visualizing – Grouping – Resampling..
PART A
1. Define Multivariate Analysis.
Multivariate Analysis is a set of statistical model that examine
patterns in multidimensional data by considering at once, several data
variable.
2. What is Simpson’s paradox?
Simpsons Paradox is a statistical phenomenon that occurs when the
subgroups are combined into one group. The process of aggregating data can
cause the apparent direction and strength of the relationship between two
variables to change.
3. Define regression analysis.
Regression analysis is a method for predicting the values of a
continuously distributed dependent variable from an independent, or
explanatory, variable.
4. What is the principle of logistic regression model?
The principles behind logistic regression are very similar and the
approach to building models and interpreting the models is virtually identical.
5. Distinguish longitudinal data from time series data.
It is important to distinguish longitudinal data from the time series
data. Although time series data can provide a picture of aggregate change, it
is only longitudinal data that can provide evidence of change at the level of
the individual.
6. Define longitudinal data.
Longitudinal data is data that is collected sequentially from the same
respondents over time. This type of data can be very important in tracking
trends and changes over time by asking the same respondent’s questions in
several waves carried out of time.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 1


AD3301 DEV UNIT 5

7. What is time series data?


Time series data is a collection of observations obtained through
repeated measurements over time. Plot the points on a graph, and one of
your axes would always be time.
8. Define time series analysis.
 Time series analysis is the collection of data at specific intervals over a
period to identify trends, seasonality, and residuals to aid in forecasting a
future event.
 Time series analysis involves inferring what has happened to a series of
data points in the past and attempting to predict future values.
9. What are the fundamentals of TSA.
 Generate a normalized dataset randomly:
 Generation of the dataset using the numpy library.
 Plotting the time series data using the seaborn library.
 By plotting the list using the time series plot, an interesting graph that
shows the change in values over time is obtained.
10. What is data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.
11. What are steps for data cleaning for outliers?
 Checking the shape of the dataset
 Few entries can also be checked inside the data frame.
 The data types of each column are reviewed in the df_power
12. What is Time-based Indexing?
Time-based indexing is a very powerful method of the pandas
library when it comes to time series data. Having time-based indexing
allows using a formatted string to select data.
13. What is grouping time series data?
Grouped time series data involves more general aggregation structures
than hierarchical time series. The structure of grouped time series does not
naturally disaggregate in a unique hierarchical manner.
14. Explain briefly about resampling time series data.
Resampling is used in time series data. This is a convenience method
for frequency conversion and resampling of time series data. Although it
works on the condition that objects must have a datetime-like index for
example, DatetimeIndex, PeriodIndex, or TimedeltaIndex.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 2


AD3301 DEV UNIT 5

PART B
1. Explain third variable and describe Causal Explanations in detail with
suitable example.
Explain in detail about Introducing a third variable.
Ways of holding a third variable constant while assessing the relationship
between two others.
 Third Variable
o X1 denotes a predictor variable, Y denotes an outcome variable, and
X2 denotes a third variable that may be involved in the X1 , Y
relationship.
o For example, age (X1) is predictive of systolic blood pressure (SBP) (Y)
when body weight (X2) is statistically controlled.
o A third-variable effect (TVE) refers to the effect conveyed by a third-
variable to an observed relationship between an exposure and a
response variable of interest.
o Depending on a causal relationship from the exposure variable to the
third-third variable and then to the response, the third-variable
(denoted as M) is often called a mediator (when there are causal
relationships) or a confounder (no causal relationship is involved).
o In third-variable analysis, besides the pathway that directly connect
the exposure variable with the outcome, explore the exposure → third-
variable → response or X → M → Y pathways.

 Causal explanation
o It explains how and why an effect occurs, provides information
regarding when and where the relationship can be replicated.
o Causality refers to the idea that one event, behavior, or belief will
result in the occurrence of another subsequent event, behavior, or
belief, it is about cause and effect.
o X causes Y is, if X changes, it will produce a change in Y.
o Independent variables - may cause direct changes in another
variable.
o Control variables - remain unchanged during the experiment.
o Causation - describes the cause-and-effect relationship.
o Correlation - Any relationship between two variables in the
experiment.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 3


AD3301 DEV UNIT 5

o Dependent variables - may change or are influenced by the


independent variable.

Direct and indirect effects


o Causal relations are commonly modeled as a Directed Acyclic Graph
(DAG), where a node represents a data dimension and a link
represents the dependency between two connected dimensions.
o The arrows of the links indicate the direction of the cause-effect
relationship.
o A path is a sequence of arrows connecting two nodes regardless of the
direction.
o There are three types of paths:
o The chain graph: T → X → Y
o The Fork: T ← X → Y
o The Immorality: T → X ← Y
o Causal Diagrams - depict the causal relationships between variables.

Figure 5.2 – Causal Example


o The above figure illustrates
independence of A and B
direct dependence of C on A and B
direct dependence of E on A and C
direct dependence of F on C and
direct dependence of D on B

Simpson's paradox
o In some cases the relationship between two variables is not simply
reduced when a third, prior, variable is taken into account but indeed
the direction of the relationship is completely reversed.
o This is often known as Simpson's paradox (named after Edward

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 4


AD3301 DEV UNIT 5

Simpson).
o Simpson's paradox can be succinctly summarized as follows: every
statistical relationship between two variables may be reversed by
including additional factors in the analysis.

2. Explain in detail about the fundamentals of TSA – Time Series Analysis.


 Time Series Analysis (TSA) involves the study of data collected over time
to understand patterns, trends, and behaviors.
 The key fundamentals of TSA:
1. Time Series Definition:
o A time series is a series of data points indexed in time order.
o It is a sequence of observations or measurements taken at
successive, evenly spaced intervals.
2. Components of Time Series:
o Trend: The long-term movement or general direction in the data. It
indicates whether the data is increasing, decreasing, or stable over
time.
o Seasonality: Patterns that repeat at fixed intervals, often related to
calendar time, like daily, weekly, or yearly cycles.
o Cyclic Patterns: Repeating patterns that are not strictly periodic but
occur at irregular intervals.
o Irregular/Random Fluctuations (Noise): Unpredictable variations
that are not part of the trend, seasonality, or cyclic patterns.
3. Importance of TSA:
o Prediction and Forecasting: TSA is used to predict future values
based on historical patterns.
o Anomaly Detection: Identify unusual or unexpected patterns that
deviate from the norm.
o Pattern Recognition: Understand and characterize underlying trends
and cycles in the data.
4. Common TSA Techniques:
o Moving Averages: Smooth out short-term fluctuations to highlight
trends.
o Exponential Smoothing: Assign different weights to different
observations, emphasizing recent data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 5


AD3301 DEV UNIT 5

o ARIMA (AutoRegressive Integrated Moving Average): A popular


model combining autoregression, differencing, and moving averages
for forecasting.
5. Data Exploration and Visualization:
o Time Plots: Visualize the time series data to observe trends and
patterns.
o Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF) Plots: Examine correlation between current and
past observations.
6. Stationarity:
o Stationary Series: A time series is considered stationary when
statistical properties like mean and variance remain constant over
time. Many time series models assume stationarity.
7. Modeling and Forecast Evaluation:
o Model Building: Develop models based on identified components and
patterns.
o Model Evaluation: Assess model accuracy using metrics like Mean
Absolute Error (MAE), Mean Squared Error (MSE), or others.
8. Software and Tools:
o Statistical Packages: Use tools like R or Python with libraries such as
Pandas, NumPy, and Statsmodels.
o Visualization Tools: Employ plotting libraries like Matplotlib or
Seaborn for graphical representation.

3. Explain in detail about the TSD – Time Series Data.


1. Characteristics of Time Series Data:
 Temporal Order: Time series data is ordered chronologically, with
observations recorded over successive time intervals.
 Trend: Long-term movement or directionality in the data.
 Seasonality: Regular patterns that repeat at fixed intervals.
 Noise: Random fluctuations that are not part of the trend or
seasonality.
2. Data Cleaning for Time Series:
 Missing Values: Address and impute missing data points, which is
common in time series datasets.
 Outliers: Identify and handle outliers that may distort the analysis.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 6


AD3301 DEV UNIT 5

 Consistent Frequency: Ensure a consistent time interval between


observations.
3. Time-based Indexing:
 Datetime Index: Assign a datetime index to the time series data,
allowing for easy temporal slicing and manipulation.
 Pandas Library: Utilize libraries like Pandas in Python to work
efficiently with time-based indexing.
4. Visualizing Time Series Data:
 Line Plots: Display the trend and fluctuations over time.
 Seasonal Decomposition: Separate the time series into components
(trend, seasonality, and residual) for clearer analysis.
 Autocorrelation Function (ACF) and Partial Autocorrelation
Function (PACF) Plots: Examine autocorrelation between
observations at different time lags.
5. Grouping in Time Series Analysis:
 Aggregation: Group data based on time periods (e.g., daily, weekly)
and aggregate values for analysis.
 Rolling Windows: Analyze trends over moving time windows,
providing a smoothed view of the data.
6. Resampling in Time Series Analysis:
 Upsampling: Increase the frequency of data (e.g., from daily to hourly)
by interpolation.
 Downsampling: Decrease the frequency of data (e.g., from hourly to
daily) by aggregation.

Time Series Data Visualization using Python


Importing the Libraries
 Numpy – A Python library that is used for numerical mathematical
computation and handling multidimensional ndarray, it also has a very
large collection of mathematical functions to operate on this array.
 Pandas – A Python library built on top of NumPy for effective matrix
multiplication and dataframe manipulation, it is also used for data
cleaning, data merging, data reshaping, and data aggregation.
 Matplotlib – It is used for plotting 2D and 3D visualization plots, it
also supports a variety of output formats including graphs for data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 7


AD3301 DEV UNIT 5

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Loading The Dataset


 To load the dataset into a dataframe use the pandas read_csv()
function.
 head() function print the first five rows of the dataset.
 ‘parse_dates’ parameter in the read_csv function to
convert the ‘Date’ column to the DatetimeIndex format.
Python
# reading the dataset using read_csv
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")

# displaying the first five rows of dataset


df.head()
Output:

Unnamed: 0 Open High Low Close Volume Name


Date
2006-01-03 NaN 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 NaN 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 NaN 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 NaN 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 NaN 43.10 43.66 42.82 43.42 16268338 AABA

Dropping Unwanted Columns


 Drop columns from the dataset that are not important for our
visualization.
Python
# deleting column
df.drop(columns='Unnamed: 0')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 8


AD3301 DEV UNIT 5

Output:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA

Plotting Line plot for Time Series data.


Python
df['Volume'].plot()
Output:

Plot all other columns using a subplot.

Python
df.plot(subplots=True, figsize=(4, 4))

Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 9


AD3301 DEV UNIT 5

The line plots used above are good for showing seasonality.

Seasonality: In time-series data, seasonality is the presence of variations


that occur at specific regular time intervals less than a year, such as weekly,
monthly, or quarterly.

Resampling:
 Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population
parameter.
 Resampling for months or weeks and making bar plots is another very
simple and widely used method of finding seasonality.

Resample and Plot The Data


Python
# Resampling the time series data based on monthly 'M' frequency
df_month = df.resample("M").mean()

# using subplot
fig, ax = plt.subplots(figsize=(6, 6))

# plotting bar graph


ax.bar(df_month['2016':].index,
df_month.loc['2016':, "Volume"],
width=25, align='center')
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 10


AD3301 DEV UNIT 5

There are 24 bars in the graph and each bar represents a month.

Differencing: Differencing is used to make the difference in values of a


specified interval. By default, it’s one, we can specify different values for
plots. It is the most popular method to remove trends in the data.

Python

df.Low.diff(2).plot(figsize=(6, 6))

Output:

4. Explain in detail about data cleaning.


Clean the dataset for outliers:
1. Checking the shape of the dataset:
Code:
df_power.shape
Output:
(4383, 5)
The dataframe contains 4,283 rows and 5 columns.
2. Few entries can also be checked inside the dataframe.
The last 10entries can be examined by using the following

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 11


AD3301 DEV UNIT 5

Code:
df_power.tail(10)
Output:

3. The data types of each column are reviewed in the


df_power dataframeby: df_power.dtypes
Output:
Date object
Consumption
float64 Wind
float64
Solar float64
Wind+Solar
float64 dtype:
object
 The Date column has a data type of object. This is not correct.
So, the next step is to correct the Date column, as shown here:

convert object to datetime format


df_power['Date'] = pd.to_datetime(df_power['Date'])
 It should convert the Date column to Datetime format.
 This can beverified using the following code:
df_power.dtypes

Output:
Date
dateti
me64[
ns]
Consu

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 12


AD3301 DEV UNIT 5

mption
float64
Wind
float64
Solar
float64
Wind+Sol
ar float64
dtype:
object

The Date column has been changed to the correct data type.
 The index of the dataframe can be changed to the
Date column:df_power =
df_power.set_index('Date') df_power.tail(3)

Output:

 In the preceding screenshot, the Date column has been set as


DatetimeIndex.
 This can be simply verified by using the code snippet given here:
df_power.index
Output:
DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03',
'2006-01-04', '2006-01-05', '2006-01-06', '2006-01-07',
'2006-01-08', '2006-01-09', '2006-01-10', ... '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26',
'2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30',
'2017-12-31'],
dtype='datetime64[ns]',name='Date', length=4383,freq=None)

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 13


AD3301 DEV UNIT 5

Since the index is the DatetimeIndex object, it can be used to analyze the
dataframe. To make our lives easier, more columns need to be added to the
dataframe.
Adding Year, Month, and Weekday Name:
Add columns with year, month,
and weekday name
df_power['Year'] =
df_power.index.year
df_power['Month'] =
df_power.index.month
df_power['Weekday Name'] = df_power.index.weekday_name
Let's display five random rows from the dataframe:
Display a random sampling of 5 rows
df_power.sample(5, random_state=0)

Output:

 Three more columns are —Year, Month, and Weekday Name.


 Adding these columns helps to make the analysis of data easier.

Time-based indexing

 Time-based indexing is a very powerful method of the pandaslibrary


when it comes to time series data.

 Having time-based indexing allows using a formatted string toselect data.


Code:
df_power.loc['2015-10-02']
Output:
Consumption 1391.05

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 14


AD3301 DEV UNIT 5

Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
 The pandas dataframe loc accessor is used.
 In the preceding example, the date is used as a string to select arow.
 All sorts of techniques can be used to access rows just as we
cando with a normal dataframe index.

Visualizing time series


Consider the df_power dataframe to visualize the time series dataset:
 The first step is to import the seaborn and
matplotlib libraries: import
matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':
(11, 4)})
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150

 Next, a line plot of the full-time series of Germany’s daily

electricity Consumption is generated:


df_power[‘Consumption’].plot(linewidth=0.5)

Output:

 In the above screenshot, the y-axis

shows the electricity consumption and the x-axis shows the


year.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 15


AD3301 DEV UNIT 5

However, there are too many datasets to cover all the years.
Using the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes =
df_power[cols_to_plot].plot(marker='.',
alpha=0.5, linestyle='None',figsize=(14,
6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')

Output:

 The output shows that electricity consumption can be


broken downinto two distinct patterns:
o One cluster roughly from 1,400 GWh and above
o Another cluster roughly below 1,400 GWh
 Moreover, solar production is higher in summer and lower
inwinter.

 Over the years, there seems to have been a strong increasing


trendin the output of wind power.

 Investigation of a single year to have a closer look: ax


= df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
Output:

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 16


AD3301 DEV UNIT 5

From the preceding screenshot, the consumption of electricity


for2016 can be seen clearly.
The graph shows a drastic decrease in the consumption of
electricityat the end of the year (December) and during August.
The month of December 2016 can be examined with the following
codeblock:
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o',
linestyle='-') ax.set_ylabel('Daily Consumption (GWh)');

Output:

 As shown in the preceding graph, electricity consumption is


higheron weekdays and lowest at the weekends.

 The consumption can be observed for each day of the month.

 To see how consumption plays out in the last week of


December, itcan be zoomed in further.

 In order to indicate a particular week of December, a specific


daterange can be supplied as shown here:
ax=df_power.loc['2016-12-23':'2016-12-
30','Consumption'].plot(marker='o',
linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');

 As illustrated in the preceding code, the electricity


consumption between 2016-12-23 and 2016-12-30 can be
observed.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 17


AD3301 DEV UNIT 5

Output:

 As illustrated in the preceding screenshot, electricity


consumption was lowest on the day of Christmas, probably
because people were busy partying.

 After Christmas, consumption increased.

5. Explain about grouping time series data.


The data can be grouped by different time periods and box plots
canbe presented:
First, the data can be grouped by months, and then the box plots
canbe used to visualize the data:

fig, axes = plt.subplots(3, 1, figsize=(8, 7), sharex=True)

for name, ax in zip(['Consumption', 'Solar',


'Wind'], axes): sns.boxplot(data=df_power,
x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name) if ax != axes[-1]:
ax.set_xlabel('')

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 18


AD3301 DEV UNIT 5

Output:

The preceding plot illustrates that electricity consumption is


generallyhigher in the winter and lower in the summer.
Wind production is higher during the summer.
Moreover, there are many
outliers associated with electricity
consumption, wind production, and solar production.
Next, the consumption of electricity can be grouped by the day of
the week, and presented in a box plot:
sns.boxplot(data=df_power, x='Weekday Name', y='Consumption');

Output:

 The preceding screenshot shows that electricity consumption


is higher on weekdays than on weekends.

 Interestingly, there are more outliers on the weekdays.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 19


AD3301 DEV UNIT 5

6. Explain about resampling time series data.


 It is often required to resample the dataset at lower or higher
frequencies.
 This resampling is done based on aggregation or grouping operations.
 For example, the data can be resampled based on the weekly meantime
series as follows:
1. To resample the data, the following code
can be used: columns =
['Consumption', 'Wind', 'Solar',
'Wind+Solar']
power_weekly_mean =
df_power[columns].resample('W').mean()
power_weekly_mean

Output:

 The above screenshot shows that the first row, labeled 2006-01-01,
includes the average of all the data.

 The daily and weekly time series can be plotted to compare the
dataset over the six-month period.
2. Consider the last six months of 2016. Let's start by initializing the
variable: start, end = '2016-01', '2016-06'
3. To plot the graph, the following code can be used:
fig,ax = plt.subplots()

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 20


AD3301 DEV UNIT 5

ax.plot(df_power.loc[start:end, 'Solar'],marker='.', linestyle='-',

linewidth=0.5,

label='Daily') ax.plot(power_weekly_mean.loc[start:end,
'Solar'], marker='o', markersize=8, linestyle='-',
label='Weekly MeanResample')

ax.set_ylabel('Solar Production in
(GWh)') ax.legend();

Output:

 The preceding screenshot shows that the weekly mean time series is
increasing over time and is much smoother than the daily time series.

7. Explain the Three-variable contingency tables and beyond.


Causal path models for three variables
 Each arrow linking two variables in a causal path diagram represents
the direct effect of one variable upon the other, controlling all other
relevant variables.
 When assessing the direct effect of one variable upon another, any third
variable which is likely to be causally connected to both variables and
prior to one of them should be controlled.
 Coefficient b in figure 5.8 shows the direct effect of being in a voluntary
association on the belief that most people can be trusted.
 To find its value, focus attention on the proportion who say that most
people can be trusted, controlling for level of qualifications.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 21


AD3301 DEV UNIT 5

Figure 5.8 social trust by membership of voluntary association and level of


qualifications: casual path diagram.

More complex models: going beyond three variables

Logistic regression models


 Regression analysis is a method for predicting the values of a continuously
distributed dependent variable from an independent, or explanatory,
variable.
 The principles behind logistic regression are very similar and the approach
to building models and interpreting the models is virtually identical.
 However, whereas regression (more properly termed Ordinary Least
Squares regression, or OLS regression) is used when the dependent
variable is continuous, a binary logistic regression model is used when the
dependent variable can only take two values.
 In many examples this dependent variable indicates whether an event
occurs or not and logistic regression is used to model the probability that
the event occurs.
 In the example discussing above, therefore, logistic regression would be
used to model the probability that an individual believes that most people
can be trusted.
When using a single explanatory variable, such as volunteering, the logistic
regression can be written as

8. Explain in detail about longitudinal data.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 22


AD3301 DEV UNIT 5

o It is important to distinguish longitudinal data from the time series


data. Although time series data can provide with a picture of aggregate
change, it is only longitudinal data that can provide evidence of change
at the level of the individual.
o Time series data could perhaps be understood as a series of snapshots
of society, whereas longitudinal research entails following the same
group of individuals over time and linking information about those
individuals from one time point to another.
 Collecting longitudinal data
Prospective and retrospective research designs
o Longitudinal data are frequently collected using a prospective
longitudinal research design, i.e. the participants in a research study
are contacted by researchers and asked to provide information about
themselves and their circumstances on a number of different occasions.
o This is often referred to as a panel study. However, it is not necessary to
use a longitudinal research design in order to collect longitudinal data
and there is therefore a conceptual distinction between longitudinal
data and longitudinal research.
o Indeed, the retrospective collection of longitudinal data is very common.
In particular, it has become an established method for obtaining basic
information about the dates of key life events such as marriages,
separations and divorces and the birth of any children (i.e. event
history data).
o This is clearly an efficient way of collecting longitudinal data and
obviates the need to re-contact the same group of individuals over a
period of time.
o A potential problem is that people may not remember the past
accurately enough to provide good quality data. While some authors
have argued that recall is not a major problem for collecting
information about dates of significant life events, other research
suggests that individuals may have difficulty remembering dates
accurately, or may prefer not to remember unfavorable episodes or
events in their lives.
o Large-scale quantitative surveys often combine a number of different
data collection strategies so they do not always fit neatly into the
classification of prospective or retrospective designs. In particular,
longitudinal event history data are frequently collected retrospectively
as part of an ongoing prospective longitudinal study.

PREPARED BY: Dr. S. ARTHEESWARI, Prof. / AI&DS 23

You might also like