Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
()
About this ebook
This book develops multivariate predictive or dependency analysis techniques (supervised learning techniques in the modern language of Machine Learning) and more specifically classification techniques from a methodological point of view and from a practical point of view with applications through Python software. The following techniques are studied in depth: Generalised Linear Models (Logit, Probit, Count and others), Decision Trees, Discriminant Analysis, K-Nearest Neighbour (kNN), Support Vector Machine (SVM), Naive Bayes, Ensemble Methods (Bagging, Boosting, Voting, Stacking, Blending and Random Forest), Neural Networks, Multilayer Perceptron, Radial Basis Networks, Hopfield Networks, LSTM Networks, RNN Recurrent Networks, GRU Networks and Neural Networks for Time Series Prediction. These techniques are a fundamental support for the development of Artificial Intelligence.
Read more from César Pérez López
AI Techniques and Tools Through Python. Supervised Learning: Classification Methods, Ensemble Learning and Neural Networks Rating: 0 out of 5 stars0 ratingsMathematics for Data Science: Linear Algebra with Matlab Rating: 0 out of 5 stars0 ratingsMultivariate Data Analysis Techniques Using Python. Dimension Reduction, Classification and Segmentation Rating: 0 out of 5 stars0 ratingsTechniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON Rating: 0 out of 5 stars0 ratingsSymbolic Mathematics in Data Science. Algebra, Calculus, and Geometry with Matlab Rating: 0 out of 5 stars0 ratingsMachine Learning. Supervised Learning Techniques and Tools: Nonlinear Models Exercises with R, SAS, Stata, Eviews and SPSS Rating: 0 out of 5 stars0 ratings
Related to Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
Related ebooks
AI and ML for Coders: AI Fundamentals Rating: 0 out of 5 stars0 ratingsScikit-Learn Unleashed: A Comprehensive Guide to Machine Learning with Python Rating: 0 out of 5 stars0 ratingsCore Concepts in Statistical Learning Rating: 0 out of 5 stars0 ratingsThe Data Science Workshop: A New, Interactive Approach to Learning Data Science Rating: 0 out of 5 stars0 ratingsPython Machine Learning: Introduction to Machine Learning with Python Rating: 0 out of 5 stars0 ratingsGetting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsdata science course training in india hyderabad: innomatics research labs Rating: 0 out of 5 stars0 ratingsMachine Learning with Spark and Python: Essential Techniques for Predictive Analytics Rating: 0 out of 5 stars0 ratingsData Mining Models: Techniques and Applications Rating: 0 out of 5 stars0 ratingsPython Machine Learning: A Beginner's Guide to Scikit-Learn Rating: 0 out of 5 stars0 ratingsArtificial Intelligence for Managers: Leverage the Power of AI to Transform Organizations & Reshape Your Career (English Edition) Rating: 0 out of 5 stars0 ratingsPython Machine Learning Illustrated Guide For Beginners & Intermediates:The Future Is Here! Rating: 5 out of 5 stars5/5A Practical Approach for Machine Learning and Deep Learning Algorithms: Tools and Techniques Using MATLAB and Python Rating: 0 out of 5 stars0 ratingsMachine Learning Unraveled: Exploring the World of Data Science and AI Rating: 0 out of 5 stars0 ratingsBig Data: Statistics, Data Mining, Analytics, And Pattern Learning Rating: 0 out of 5 stars0 ratingsUltimate Python Libraries for Data Analysis and Visualization Rating: 0 out of 5 stars0 ratingsMachine Learning With Python Programming : 2023 A Beginners Guide Rating: 2 out of 5 stars2/5Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition) Rating: 0 out of 5 stars0 ratingsHands-On Machine Learning with Microsoft Excel 2019: Build complete data analysis flows, from data collection to visualization Rating: 0 out of 5 stars0 ratingsUnleashing the Power of Data: Innovative Data Mining with Python Rating: 0 out of 5 stars0 ratingsArtificial Intelligence Algorithms Rating: 0 out of 5 stars0 ratingsPython for Data Science: A Practical Approach to Machine Learning Rating: 0 out of 5 stars0 ratingsMastering Scala Machine Learning Rating: 0 out of 5 stars0 ratingsContemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow Rating: 0 out of 5 stars0 ratingsBuilding a Recommendation System with R: Learn the art of building robust and powerful recommendation engines using R Rating: 0 out of 5 stars0 ratingsRise of the Machines: Exploring Artificial Intelligence: The IT Collection Rating: 0 out of 5 stars0 ratingsApplied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers Rating: 0 out of 5 stars0 ratingsPython Feature Engineering Cookbook: A complete guide to crafting powerful features for your machine learning models Rating: 0 out of 5 stars0 ratings
Computers For You
Data Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Computer Science I Essentials Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Algorithms For Dummies Rating: 4 out of 5 stars4/5Technical Writing For Dummies Rating: 0 out of 5 stars0 ratingsUX/UI Design Playbook Rating: 4 out of 5 stars4/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5The Musician's Ai Handbook: Enhance And Promote Your Music With Artificial Intelligence Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning Rating: 5 out of 5 stars5/5A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings
Reviews for Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
0 ratings0 reviews
Book preview
Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation - César Pérez López
1.1
INTRODUCTION TO MULTIVARIATE ANALYSIS
When researchers have many variables measured or observed in a usually very large collection of individuals, they intend to study them together, and turn to Multivariate Data Analysis. They are faced with a variety of techniques and must select the most appropriate for their data but, above all, for their scientific objective.
The researcher will have to consider whether he assigns equivalent importance to all his variables, i.e. whether no single variable stands out as the main dependent on the research objective. If so, because he is simply dealing with a set of different aspects observed and collected in his sample, he can turn to what might be called descriptive multivariate or interdependence analysis techniques (unsupervised learning techniques in the modern parlance of Machine Learning) for their treatment en bloc.
If it would not be scientifically acceptable an equivalent importance in the variables it handles, because some variable stands out as the main dependent in the objective of the research, it will have to use multivariate predictive techniques or dependency analysis (supervised learning techniques in the modern language of Machine Learning) considering the dependent variable as the variable explained by the other explanatory independent variables, and trying to relate all the variables by means of a possible equation or model that links them. The method of choice would then be Regression, generally with all the variables quantitative. If the dependent variable is a dichotomous qualitative variable (1,0; yes or no), it can be used as a classifier, studying its relationship with the rest of the classifying variables by means of a Generalised Linear Model . If the observed qualitative dependent variable found the assignment of each individual to previously defined groups (two, or more than two), it can be used to classify new cases in which the group to which they probably belong is unknown, in which case we are dealing with techniques such as Decision Trees, Neural Networks, Nearest Neighbour (kNN) or Naive Bayes, which solve the important problem of assignment according to a quantitative profile of classificatory variables.
1.2
CLASSIFICATION OF TECHNIQUES FROM A MACHINE LEARNING PERSPECTIVE
A more complete overview of multivariate data analysis techniques from a modern Machine Learning perspective would be as follows:
Escala de tiempo El contenido generado por IA puede ser incorrecto.This book develops multivariate predictive or dependency analysis techniques (supervised learning techniques in the modern language of Machine Learning) and more specifically classification techniques from a methodological point of view and from a practical point of view with applications through Python software. The following techniques are studied in depth: Generalised Linear Models (Logit, Probit, Count and others), Decision Trees, Discriminant Analysis, K-Nearest Neighbour (kNN), Support Vector Machine (SVM), Naive Bayes, Ensemble Methods (Bagging, Boosting, Voting, Stacking, Blending and Random Forest), Neural Networks, Multilayer Perceptron, Radial Basis Networks, Hopfield Networks, LSTM Networks, RNN Recurrent Networks, GRU Networks and Neural Networks for Time Series Prediction. These techniques are a fundamental support for the development of Artificial Intelligence.
2Chapter 2
GENERALISED LINEAR MODELS. DISCRETE CHOICE MODELS: LOGIT, PROBIT AND COUNT MODELS
2.1
SUPERVISED LEARNING: GENERALISED LINEAR MODEL
The generalised linear model extends the general linear model, so that the dependent variable y is linearly related to the factors and covariates by a certain link function g: . If µi = E[yi] thenη i= g(ui) = xi'β
2.1.1 Elements of a generalised linear model:
The response variables, y1,...,yn, are assumed to follow a common distribution that is a member of the exponential family.
A set of explanatory variables, x1,...,xp, and parameters β0,β1,...,βp defined as follows:
A monotone link function g such that
where µi = E[yi].
We can also write the generalised linear model as:
In addition, the model allows the dependent variable to have a non-normal distribution. The generalised linear model covers the most commonly used statistical models, such as linear regression for normally distributed responses, logistic models for binary data, log-linear models for count data, complementary log-log models for interval-censored survival data, as well as many other statistical models through the general formulation of the model itself.
The possibility of specifying a specific distribution for the dependent variable other than the normal distribution and the possibility of specifying a link function other than the identity is the main improvement of the generalised linear model with respect to the general linear model. If the distribution of the dependent variable is normal and the link function is the identity, this is the general linear model.
2.2
LIMITED DEPENDENT VARIABLE AND COUNT MODELS: LOGIT, PROBIT, POISSON, AND NEGATIVE BINOMIAL.
Discrete choice models directly predict the probability of an event that has two or more possibilities of occurrence. Since the values of a probability are between zero and one, predictions made with discrete choice models must be bounded to fall in the range between zero and one. The general model that satisfies this condition has the functional form:
It is noted that if F is the distribution function of a random variable, then P varies between zero and one.
In the particular case in which the function F is the logistic function, we will be dealing with the Logit or Logistic Regression model, whose functional form will be the following:
It is noted that:
The logit model can therefore also be expressed in the form:
The link function turns out to be which is called the logit link function and belongs to the binomial family.
In the particular case where the function F is the distribution function of a unit normal, the Probit model has the following functional form:
We have:
The logit model can therefore also be expressed in the form:
where is the distribution function of a normal (0,1).
The link function turns out to be which is called the probit link function and also belongs to the binomial family.
On the other hand, count data models are those that have as dependent variable a discrete variable that takes a finite or infinite numberable set of non-negative integer values. Poisson and Negative Binomial regression models are the most common of this type.
The Poisson regression model assumes that each yi is a realisation of a random variable with Poisson distribution of parameterλ and that this parameter is related to the vector of regressors xi. The basic equation of the model is:
The most common formulation ofλ is logarithm-linear, i.e:
Ln(λ ) =β X ⇔ λ = exp(β X)
Therefore the link function is Ln(λ ) called the log (logarithmic) link function and belongs to the Poisson family.
The Negative Binomial regression model assumes that each yi is a realisation of a random variable with a Negative Binomial distribution of parameters μ and k. The probability function of this distribution is:
y = 0,1,2,...
It has to
The parameter 1/k is a dispersion parameter, so if 1/k → 0 then Var(Y ) → µ and the negative binomial distribution converges to a Poisson distribution.
On the other hand, for a fixed value of k this distribution belongs to the natural exponential family, so that a negative binomial GLM model can be defined. In general, a logarithm-type function is used.
2.3
DISTRIBUTIONS OF THE EXPONENTIAL FAMILY
The random variable y is said to be a member of the exponential family of distributions if its probability density function, f(y;θ), can be expressed as:
If a(y) = y, the above distribution is said to be in its canonical form and b(θ) is called the natural parameter of the distribution.
Let y be a random variable with probability density function f(y,θ) a member of the exponential family. Then, using the natural parametrization, we can write:
where θ is the natural or canonical location parameter and φ the dispersion parameter.
Below is a schematic showing the elements of the different distributions belonging to the exponential family in terms of the overall probability density of a random variable of the exponential family of distributions.
For each of these distributions a General Linear Model belonging to the Distribution family and with link function θ can be defined.
2.4 DISCRETE CHOICE MODELS
The functional expression of the multiple regression analysis model is . Multiple regression admits the possibility of working with discrete rather than continuous dependent variables to allow the modelling of discrete phenomena. When the dependent variable is a discrete variable reflecting individual decisions in which the choice set consists of separate and mutually exclusive alternatives, we are dealing with discrete choice models. When the dependent variable is discrete and takes only a small number of values, it does not make sense to treat it as if it were a continuous variable and it is usually of interest to characterise the probability that an agent takes a certain discrete decision, conditional on the values of certain explanatory variables. These distribution functions that characterise probabilities for each value of the explanatory variables are usually non-linear and do not usually have an analytical solution, so it is usually necessary to resort to numerical methods. Discrete choice models in which the choice set has only two possible alternatives are called binary choice models. When the choice set has several discrete values, these are called multiple choice models or multinomial models.
Discrete choice models are called count data models when the values of the discrete dependent variable are numbers that do not reflect categories. In case the numerical values of the discrete dependent variable reflect categories, the models are called categorical discrete choice models, and are usually classified into ordered categorical discrete choice models (the numerical values have no quantitative meaning and reflect an ordering of categories) and unordered categorical discrete choice models (the numerical values reflect only categories).
2.5 BINARY DISCRETE CHOICE MODELS
Within the discrete choice models in which the choice set has only two mutually exclusive possible alternatives, we will consider the linear probability model, the Logit model and the Probit model.
2.5.1 MLP model (Linear probability model )
We start from the usual linear regression model:
one of whose hypotheses is:
which leads us to write the model as:
But in the case of discrete choice models where the choice set has only two mutually exclusive possible alternatives, Y is a Bernouilli random variable of parameter p, which allows us to write:
We are now dealing with the linear probability model, where, for example,β 1 measures the variation in the probability of success
(Y = 1) for a unit variation in X1 (all else constant).
As Y is a Bernouilli random variable:
We have then:
for each observation V(ui) = pi(1-pi) since Y is a Bernouilli random variable.
This is a model with heteroscedasticity because the error variance is not constant, since for each value of X1,...,Xk, the error variance has a different value (non-constant V(u)). Moreover, Y is a Bernouilli variable, so that the normality hypothesis is not satisfied either. This makes it necessary to estimate these models by an alternative method to ordinary least squares, for example, using maximum likelihood estimators or generalised least squares.
After estimating the linear probability model we have that:
can be interpreted as an estimate of the probability of success
(that Y = 1). In some applications it makes sense to interpret as the probability of success when all Xj are 0.
Another important limitation of the linear probability model is that for certain combinations of the explanatory variables X1,...,Xk, the estimated probabilities may be greater than zero or less than one.
2.5.2 Logit and Probit models binary : maximum likelihood estimation
We can consider Logit and Probit models as binary response models:
which, to avoid the problems of the linear probability model, are specified as Y = G(Xβ), where G is a function that takes values strictly between 0 and 1 (0(Z)<1) for all real numbers z. According to the different definitions of G we have the different binary choice models.
If we are dealing with the Logit model, whose expression will be:
In the case of the Probit model we have:
Imagen que contiene Texto El contenido generado por IA puede ser incorrecto.where is the density function of the normal (0,1).
The expression of the Probit model will be:
The Probit and Logit models, as they are non-linear models, cannot be estimated by OLS and maximum likelihood methods must be used.
Suppose we have n identically and independently distributed observations (random sample) that follow the model:
To obtain the maximum likelihood (ML) estimator, conditional on the explanatory variables, we need the likelihood function:
Diagrama El contenido generado por IA puede ser incorrecto.with:
The MV estimator ofβ is the one that maximises the logarithm of the likelihood function:
Diagrama El contenido generado por IA puede ser incorrecto.which will be a consistent, asymptotically normal and asymptotically efficient estimator.
The first-order conditions shall be:
Texto, Carta El contenido generado por IA puede ser incorrecto.where g(.) is the normal or logistic density