Unit 5 Dev 2023
Unit 5 Dev 2023
PART B
1. Explain third variable and describe Causal Explanations in detail with
suitable example.
Explain in detail about Introducing a third variable.
Ways of holding a third variable constant while assessing the relationship
between two others.
Third Variable
o X1 denotes a predictor variable, Y denotes an outcome variable, and
X2 denotes a third variable that may be involved in the X1 , Y
relationship.
o For example, age (X1) is predictive of systolic blood pressure (SBP) (Y)
when body weight (X2) is statistically controlled.
o A third-variable effect (TVE) refers to the effect conveyed by a third-
variable to an observed relationship between an exposure and a
response variable of interest.
o Depending on a causal relationship from the exposure variable to the
third-third variable and then to the response, the third-variable
(denoted as M) is often called a mediator (when there are causal
relationships) or a confounder (no causal relationship is involved).
o In third-variable analysis, besides the pathway that directly connect
the exposure variable with the outcome, explore the exposure → third-
variable → response or X → M → Y pathways.
Causal explanation
o It explains how and why an effect occurs, provides information
regarding when and where the relationship can be replicated.
o Causality refers to the idea that one event, behavior, or belief will
result in the occurrence of another subsequent event, behavior, or
belief, it is about cause and effect.
o X causes Y is, if X changes, it will produce a change in Y.
o Independent variables - may cause direct changes in another
variable.
o Control variables - remain unchanged during the experiment.
o Causation - describes the cause-and-effect relationship.
o Correlation - Any relationship between two variables in the
experiment.
Simpson's paradox
o In some cases the relationship between two variables is not simply
reduced when a third, prior, variable is taken into account but indeed
the direction of the relationship is completely reversed.
o This is often known as Simpson's paradox (named after Edward
Simpson).
o Simpson's paradox can be succinctly summarized as follows: every
statistical relationship between two variables may be reversed by
including additional factors in the analysis.
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Output:
Open High Low Close Volume Name
Date
2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA
2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA
2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA
2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA
2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA
Python
df.plot(subplots=True, figsize=(4, 4))
Output:
The line plots used above are good for showing seasonality.
Resampling:
Resampling is a methodology of economically using a data sample to
improve the accuracy and quantify the uncertainty of a population
parameter.
Resampling for months or weeks and making bar plots is another very
simple and widely used method of finding seasonality.
# using subplot
fig, ax = plt.subplots(figsize=(6, 6))
There are 24 bars in the graph and each bar represents a month.
Python
df.Low.diff(2).plot(figsize=(6, 6))
Output:
Code:
df_power.tail(10)
Output:
Output:
Date
dateti
me64[
ns]
Consu
mption
float64
Wind
float64
Solar
float64
Wind+Sol
ar float64
dtype:
object
The Date column has been changed to the correct data type.
The index of the dataframe can be changed to the
Date column:df_power =
df_power.set_index('Date') df_power.tail(3)
Output:
Since the index is the DatetimeIndex object, it can be used to analyze the
dataframe. To make our lives easier, more columns need to be added to the
dataframe.
Adding Year, Month, and Weekday Name:
Add columns with year, month,
and weekday name
df_power['Year'] =
df_power.index.year
df_power['Month'] =
df_power.index.month
df_power['Weekday Name'] = df_power.index.weekday_name
Let's display five random rows from the dataframe:
Display a random sampling of 5 rows
df_power.sample(5, random_state=0)
Output:
Time-based indexing
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
The pandas dataframe loc accessor is used.
In the preceding example, the date is used as a string to select arow.
All sorts of techniques can be used to access rows just as we
cando with a normal dataframe index.
Output:
However, there are too many datasets to cover all the years.
Using the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes =
df_power[cols_to_plot].plot(marker='.',
alpha=0.5, linestyle='None',figsize=(14,
6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')
Output:
Output:
Output:
Output:
Output:
Output:
The above screenshot shows that the first row, labeled 2006-01-01,
includes the average of all the data.
The daily and weekly time series can be plotted to compare the
dataset over the six-month period.
2. Consider the last six months of 2016. Let's start by initializing the
variable: start, end = '2016-01', '2016-06'
3. To plot the graph, the following code can be used:
fig,ax = plt.subplots()
linewidth=0.5,
label='Daily') ax.plot(power_weekly_mean.loc[start:end,
'Solar'], marker='o', markersize=8, linestyle='-',
label='Weekly MeanResample')
ax.set_ylabel('Solar Production in
(GWh)') ax.legend();
Output:
The preceding screenshot shows that the weekly mean time series is
increasing over time and is much smoother than the daily time series.