Predictive Analys
Predictive Analys
Poor quality data, such as data with missing values or outliers, can negatively impact
the accuracy of your models.
Overfitting occurs when your model is too complex and fits the training data too
closely. This can result in a model that performs well on the training data but fails to
generalize to new data.
Model interpretability can also be an issue if your model is too complex. This makes it
challenging for you to understand how it arrived at its predictions.
Selection bias can occur if your training data is not representative of the population
being studied. This can lead to inaccurate predictions and unfair outcomes.
Unforeseen changes in the future can render your model inaccurate since it is based on
historical data. Unexpected changes can be especially problematic for models that are
used for long-term predictions.
Examples of predictive modeling
Scatter plot with blue data points and a red linear regression line,
showing a positive correlation.
Neural network
Neural network models are a type of predictive modeling
technique inspired by the structure and function of the
human brain.
The goal of these models is to learn complex
relationships between input variables and output
variables, and use that information to make predictions.
Neural network models are often used in fields such as
image recognition, natural language processing, and
speech recognition, to make predictions such as object
recognition, sentiment analysis, and speech transcription.
Neural network model algorithims
Multilayer Perceptron (MLP) consists of multiple layers of nodes, including an input layer, one or
more hidden layers, and an output layer. The nodes in each layer perform a mathematical
operation on the input data, with the output of one layer serving as the input for the next layer.
The weights between the nodes are adjusted during training using backpropagation to minimize
the error between the predicted output and the actual output. MLP is a versatile algorithm that
can be used for a wide range of predictive modeling tasks, including classification, regression, and
pattern recognition.
Convolutional neural networks (CNN) are commonly used for image recognition tasks, with each
layer processing increasingly complex features of the image.
Recurrent neural networks (RNN) are used for sequential data, such as natural language
processing, and incorporate feedback loops that allow previous output to be used as input for the
next prediction.
Long Short-Term Memory (LSTM) is a type of RNN that addresses the vanishing gradient problem
and is particularly useful for learning long-term dependencies in sequential data.
Backpropagation is a common algorithm used to train neural networks
by adjusting the weights between nodes in the network based on the
error between the predicted output and the actual output.
Decision tree models can be used for both classification and regression tasks. In a
classification tree, the target variable is categorical, while in a regression tree, the
target variable is continuous. Decision tree models are easy to interpret and
visualize, making them useful for understanding the relationships between predictor
variables and the target variable. However, they can be prone to overfitting and may
not perform as well as other predictive modeling techniques on complex datasets.
Decision tree model algorithms:
• CART (Classification and Regression Tree) can be used for both classification and regression tasks. It
uses Gini impurity as a measure of the quality of a split, aiming to minimize it. CART constructs
binary trees, where each non-leaf node has two children.
• CHAID (Chi-squared Automatic Interaction Detection) is used for categorical variables and constructs
trees based on chi-squared tests to determine the most significant associations between the
predictor variables and the target variable. It can handle both nominal and ordinal categorical
variables.
• ID3 (Iterative Dichotomiser 3) is used to build decision trees for classification tasks. It selects the
attribute with the highest information gain at each node to split the data into subsets. Information
gain is calculated based on the entropy of the subsets.
• C4.5 is an extension of the ID3 algorithm that can handle both categorical and continuous variables.
It uses information gain ratio to select the splitting attribute, which takes into account the number of
categories and their distribution in the subsets.
• These algorithms use various criteria to determine the optimal split at each node, such as
information gain, Gini index, or chi-squared test.
Regression model algorithms
Boosting involves creating multiple weak models sequentially, where each model
tries to correct the errors of the previous model. Boosting is used to reduce the
bias of a single model and improve its accuracy.
Stacking involves training multiple models and using their predictions as input to
a meta-model, which then makes the final prediction. Stacking is used to combine
the strengths of multiple models and achieve better performance.
Random Forest is an extension of bagging that uses decision trees as the base
models. Random Forest creates multiple decision trees on different subsets of the
training data, and then aggregates their predictions to make the final prediction.