0% found this document useful (0 votes)
11 views3 pages

Dependence and Interdependence Methods

Uploaded by

carucast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views3 pages

Dependence and Interdependence Methods

Uploaded by

carucast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

DEPENDENCE AND INTERDEPENDENCE METHODS

Introduc6on to dependence and interdependence methods


1. Supervised and unsupervised learning

• Supervised learning:
§ Supervised learning is a type of machine learning where the model is trained on labeled data. In this context,
the data used for training includes both the input features (independent variables) and the correct output
labels (dependent variables). The goal of supervised learning is to learn a mapping from inputs to the correct
outputs (predic@ons).
§ Features:
o Labeled data: the training dataset includes both the input data and the corresponding output labels.
o Predic@on task: the model is trained to predict the output based on input features.
o Goal: to learn a rela@onship between the input and the output so that the model can make accurate
predic@ons on unseen data.
§ Focuses on predic@ng an outcome (dependent variable) from input features (independent variables), making
it inherently related to dependence methods.

• Unsupervised learning:
§ Unsupervised learning is a type of machine learning where the model is trained on unlabeled data. In this
case, the training data only includes the input features, but no corresponding output labels. The model tries
to iden@fy paGerns, structures, or rela@onships within the data without explicit guidance about what the
correct output should be.
§ Features:
o Unlabeled data: the training dataset contains only the input data without any known output labels.
o PaGern discovery: the goal is to explore the data and iden@fy underlying paGerns or groupings (data
structures).
o Goal: to find structure or rela@onships within the data, such as grouping similar data points or reducing
the dimensionality of the data.
§ Focuses on finding paGerns or structures in data without a dependent variable, and oIen deals with
independence methods (e.g., clustering, feature reduc@on) or discovering the rela@onships among features.

2. Dependence methods

• These methods analyze causal or associa@ve rela@onship between variables à supervised learning.
• There are two types of variables:
§ Independent variable.
§ Dependent variable (which value depends on the independent variable value).
• Works for doing predic@on (of the dependent variable in func@on of the independent variable values).
• Types or dependence methods:
§ Regression:
o Linear regression: examines linear regression between two con@nuous (numerical) variables, one
dependent variable and one independent variable.
o Logis@c regression: examines egression between one categorical (not numerical) dependent variable
and one con@nuous (numerical) independent variable.
o Mul@ple regression: examines the rela@onship between one con@nuous (numerical) dependent
variable and two or more independent variables (numerical and/or categorical).
o Analysis of variance ANOVA: examines how one or more independent categorical (not numerical)
variables affect a con@nuous (numerical) dependent variable.

§ Hypothesis tes@ng methods (sta@s@cal inference)


o T-student: determine if there is sta@s@cally significant difference between the means of two groups.
o Analysis of variance ANOVA: determine if there is sta@s@cally significant difference between the means
of three or more groups:
- One-way ANOVA: compares the means of more than two independent groups based on one factor.
- Two-way ANOVA: compares the means of more than two independent groups based on two
factors.
- Repeated measures ANOVA: used when the same subjects are tested mul@ple @mes (for example
before, during, and aIer a treatment).
§ Probability based methods related to logis@c regression:
o Probability scale: refers to the range of values that probabili@es can take, typically between 0 and 1. It is
a way of expressing the likelihood of an event occurring, being 0 an event that never occur and 1 an
event that always occur. Probability scale is oIen used in dependence methods, par@cularly when the
dependent variable is a binary outcome (such as success/failure, yes/no).
o Odds scale: The odds scale is used to express the likelihood of an event occurring in rela@on to it not
occurring. It is oIen used in the context of binary outcomes (such as success/failure, yes/no, etc.). It’s
used to measure or model the rela@onship between a dependent variable (usually a binary outcome)
and one or more independent variables.
o Odds ra@o (OD): is a measure of associa@on that compares the odds of an event occurring in one group
to the odds of it occurring in another group. It is commonly used in the analysis of binary outcomes
(like success/failure or yes/no) to quan@fy how the odds of an event differ between two groups or
condi@ons. It’s used to measure or model the rela@onship between a dependent variable (usually a
binary outcome) and one or more independent variables.
§ Survival analysis: the dependent variable is typically @me to an event (also called survival @me), and it oIen
involves censoring, which means that some individuals may not have experienced the event by the end of the
study period. The “event” typically refers to an outcome like death, disease occurrence, equipment failure, or
any other event of interest. The primary goal of survival analysis is to model and analyze the @me it takes for
this event to occur, and to iden@fy factors that influence this dura@on. Survival analysis is generally considered
a dependence method. In survival analysis, the focus is oIen on understanding how independent variables
(predictors) influence the dependent variable (survival @me). The independent variables may include factors
such as age, gender, treatment type, and other covariates that are hypothesized to affect the survival @me.
o Kaplan-Meier curve: A non-parametric method for es@ma@ng the survival func@on.
o Parametric models of survival analysis are con@nuous func@ons that are obtained from the preexisted
data, then more accurate apprecia@on of the curve is achieved, and inferences are easily determined:
- Weibull models.
- Logarithmical models.
- Exponen@al models.
o Cox Propor@onal Hazards model: A semi-parametric regression model used to explore the rela@onship
between the survival @me and one or more predictor (independent) variables.
§ Classifica@on methods: are dependence methods because they are used to predict the value of a dependence
variable based on one or more independent variable (predictors). It is a supervised learning, the model is
trained using labeled data, meaning that each training example has a corresponding target or dependent
variable (the output or label). In other words, from a known data, which the events present a known
type/class we generate an algorithm that tries to classify the rest of the events looking for paGerns in the
known data (data that we know their classifica@on), then we introduce our unknow data to the model
algorithm which is going to try to classify our unknown data based on the previous classifica@on that has
performed with the known data:
o K-Nearest Neighbors (KNN): calculates the distance between the query point and training data points.
Based on the majority class of the nearest neighbors, it assigns the class label to the query point.
o Bayesian methods: classifies useing the probability distribu@on of the features and their rela@onship to
the class labels to predict the most likely class for new data points.
o Decision trees: the tree is constructed by selec@ng the feature that best separates the data at each
node. It con@nues to split the data based on feature values un@l it reaches a leaf node that corresponds
to a predicted class label.
o Random forest: The model creates many decision trees (each trained on a random subset of features
and data) and aggregates their predic@ons through a majority vote for classifica@on (or averaging for
regression).
3. Interdependence methods

• These methods do not assume causality (associa@ve rela@onship between the variables).
• All variables are equally important (not dependent and independent variables).
• Are used to explore data architecture and iden@fy paGerns or associa@ons that allows us to structure the data à
unsupervised learning.
• Types or independence methods:
§ Independence methods used to analyze linear regression:
o Correla@on (r): measures the strength and direc@on of the rela@onship between two variables without
assuming causality or establishing a dependent rela@onship between them. It’s used in linear
regression.
o Goodness of fit (measures how well a model’s predicted values match with the observed values, how
accurately predicts the dependent variables based on the independent variables):
- Determina@on coefficient (R²): measures the percentage (%) of independent variable (Y) that you
can explain with the theore@cal linear model that you have built doing linear regression between
you independent and dependent variables.
- Adjusted R-squared: is a varia@on of R² that adjusts for the number of predictors (independent
variables) in the model. It’s used to avoid overes@ma@ng for goodness of fit when mul@ple
independent variables are included.
- Sum of Squared Errors (SSE): is a measure of the total error in a regression model. It is the sum of
the squared differences between the actual (observed) vales and the predicted values.
- Root Mean Squared Error (RMSE): measure of the average magnitude of the residuals (errors)
between the observed values and the predicted values. t is the square root of the average of the
squared errors.
- Null model (or intercept-only model): s a baseline model that only predicts the mean (average) of
the dependent variable for all observa@ons. It assumes that the independent variables have no
effect on the dependent variable.
§ Probability based methods:
o Chi-Square (χ²): determine if there is a significant associa@on or rela@onship between two categorical
variables. It compares the observed frequencies of data to the frequencies we would expect if the
variables were independent. It tests whether the distribu@on of one variable is related to the
distribu@on of the other variable. Chi-square test of independence is considered an independence
method because it is specifically designed to test whether two categorical variables are independent of
each other, in other words, this method is used to test whether two categorical variables are
independent or dependent, not if we can predict or modeling a rela@onship between a dependent
variable and one or more independent variables.
§ Conglomerate analysis: these are not methods that try to classify, they try to understand data structure and
validate that it’s not an artefact (the events inside the groups should present high intragroup homogeneity).
Pretends to group data based on their homogeneity within the sample heterogeneity, not to predict the
classifica@on a data event in a certain group (classifica@on methods):
o K-means clustering: it is a par@@oning method, meaning that it divides the data into a set number of
clusters (denoted by k), where each data point belongs to one and only one cluster. The main goal of K-
Means is to par@@on the data into k groups such that the data points within each group (or cluster) are
as similar as possible. It could be visualized by heatmap.
o Hierarchical clustering: s a clustering method that builds a tree-like structure called a dendrogram to
represent the hierarchical rela@onships between clusters. Unlike K-Means, hierarchical clustering does
not require the number of clusters to be specified in advance. The algorithm produces a hierarchy of
clusters, which can be cut at any level to obtain a desired number of clusters. It could be visualized by
heatmap.
§ Reduc@onal dimensionality: it’s used to reduce the number of features or variables in a dataset while
retaining as much of the important informa@on as possible.
o Principal component analysis (PCA): it’s a linear dimensionality reduc@on method that iden@fies the
direc@ons (called principal components) along which the data varies the most and projects the data
into a lower-dimensional space along those direc@ons. PCA axis explains the variability between the
samples.
o t-Distributed Stochas@c Neighbor Embedding (t-SNE) and Uniform Manifold Approxima@on and
Projec@on (UMAP): they’re a non-linear dimensionality reduc@on method.

You might also like