Dependence and Interdependence Methods
Dependence and Interdependence Methods
• Supervised learning:
§ Supervised learning is a type of machine learning where the model is trained on labeled data. In this context,
the data used for training includes both the input features (independent variables) and the correct output
labels (dependent variables). The goal of supervised learning is to learn a mapping from inputs to the correct
outputs (predic@ons).
§ Features:
o Labeled data: the training dataset includes both the input data and the corresponding output labels.
o Predic@on task: the model is trained to predict the output based on input features.
o Goal: to learn a rela@onship between the input and the output so that the model can make accurate
predic@ons on unseen data.
§ Focuses on predic@ng an outcome (dependent variable) from input features (independent variables), making
it inherently related to dependence methods.
• Unsupervised learning:
§ Unsupervised learning is a type of machine learning where the model is trained on unlabeled data. In this
case, the training data only includes the input features, but no corresponding output labels. The model tries
to iden@fy paGerns, structures, or rela@onships within the data without explicit guidance about what the
correct output should be.
§ Features:
o Unlabeled data: the training dataset contains only the input data without any known output labels.
o PaGern discovery: the goal is to explore the data and iden@fy underlying paGerns or groupings (data
structures).
o Goal: to find structure or rela@onships within the data, such as grouping similar data points or reducing
the dimensionality of the data.
§ Focuses on finding paGerns or structures in data without a dependent variable, and oIen deals with
independence methods (e.g., clustering, feature reduc@on) or discovering the rela@onships among features.
2. Dependence methods
• These methods analyze causal or associa@ve rela@onship between variables à supervised learning.
• There are two types of variables:
§ Independent variable.
§ Dependent variable (which value depends on the independent variable value).
• Works for doing predic@on (of the dependent variable in func@on of the independent variable values).
• Types or dependence methods:
§ Regression:
o Linear regression: examines linear regression between two con@nuous (numerical) variables, one
dependent variable and one independent variable.
o Logis@c regression: examines egression between one categorical (not numerical) dependent variable
and one con@nuous (numerical) independent variable.
o Mul@ple regression: examines the rela@onship between one con@nuous (numerical) dependent
variable and two or more independent variables (numerical and/or categorical).
o Analysis of variance ANOVA: examines how one or more independent categorical (not numerical)
variables affect a con@nuous (numerical) dependent variable.
• These methods do not assume causality (associa@ve rela@onship between the variables).
• All variables are equally important (not dependent and independent variables).
• Are used to explore data architecture and iden@fy paGerns or associa@ons that allows us to structure the data à
unsupervised learning.
• Types or independence methods:
§ Independence methods used to analyze linear regression:
o Correla@on (r): measures the strength and direc@on of the rela@onship between two variables without
assuming causality or establishing a dependent rela@onship between them. It’s used in linear
regression.
o Goodness of fit (measures how well a model’s predicted values match with the observed values, how
accurately predicts the dependent variables based on the independent variables):
- Determina@on coefficient (R²): measures the percentage (%) of independent variable (Y) that you
can explain with the theore@cal linear model that you have built doing linear regression between
you independent and dependent variables.
- Adjusted R-squared: is a varia@on of R² that adjusts for the number of predictors (independent
variables) in the model. It’s used to avoid overes@ma@ng for goodness of fit when mul@ple
independent variables are included.
- Sum of Squared Errors (SSE): is a measure of the total error in a regression model. It is the sum of
the squared differences between the actual (observed) vales and the predicted values.
- Root Mean Squared Error (RMSE): measure of the average magnitude of the residuals (errors)
between the observed values and the predicted values. t is the square root of the average of the
squared errors.
- Null model (or intercept-only model): s a baseline model that only predicts the mean (average) of
the dependent variable for all observa@ons. It assumes that the independent variables have no
effect on the dependent variable.
§ Probability based methods:
o Chi-Square (χ²): determine if there is a significant associa@on or rela@onship between two categorical
variables. It compares the observed frequencies of data to the frequencies we would expect if the
variables were independent. It tests whether the distribu@on of one variable is related to the
distribu@on of the other variable. Chi-square test of independence is considered an independence
method because it is specifically designed to test whether two categorical variables are independent of
each other, in other words, this method is used to test whether two categorical variables are
independent or dependent, not if we can predict or modeling a rela@onship between a dependent
variable and one or more independent variables.
§ Conglomerate analysis: these are not methods that try to classify, they try to understand data structure and
validate that it’s not an artefact (the events inside the groups should present high intragroup homogeneity).
Pretends to group data based on their homogeneity within the sample heterogeneity, not to predict the
classifica@on a data event in a certain group (classifica@on methods):
o K-means clustering: it is a par@@oning method, meaning that it divides the data into a set number of
clusters (denoted by k), where each data point belongs to one and only one cluster. The main goal of K-
Means is to par@@on the data into k groups such that the data points within each group (or cluster) are
as similar as possible. It could be visualized by heatmap.
o Hierarchical clustering: s a clustering method that builds a tree-like structure called a dendrogram to
represent the hierarchical rela@onships between clusters. Unlike K-Means, hierarchical clustering does
not require the number of clusters to be specified in advance. The algorithm produces a hierarchy of
clusters, which can be cut at any level to obtain a desired number of clusters. It could be visualized by
heatmap.
§ Reduc@onal dimensionality: it’s used to reduce the number of features or variables in a dataset while
retaining as much of the important informa@on as possible.
o Principal component analysis (PCA): it’s a linear dimensionality reduc@on method that iden@fies the
direc@ons (called principal components) along which the data varies the most and projects the data
into a lower-dimensional space along those direc@ons. PCA axis explains the variability between the
samples.
o t-Distributed Stochas@c Neighbor Embedding (t-SNE) and Uniform Manifold Approxima@on and
Projec@on (UMAP): they’re a non-linear dimensionality reduc@on method.