Deep Learning and Neural Networks
Deep Learning and Neural Networks
2. A base model (suppose a decision tree) is fitted on 9 parts and predictions are made
for the 10th part. This is done for each part of the train set.
3. The base model (in this case, decision tree) is then fitted on the whole train dataset.
4. Using this model, predictions are made on the test set.
5. Steps 2 to 4 are repeated for another base model (say knn) resulting in another set
of predictions for the train set and test set.
6. The predictions from the train set are used as features to build a new model.
7. This model is used to make final predictions on the test prediction set.
What Are Random Forests?
As the name suggests, a forest is a collection of many trees. And that’s true for
random forests as well – they are a collection of many individual decision trees. In
machine learning, this is referred to as ensemble modelling. In general, ensemble
methods use multiple learning algorithms to obtain better predictive performance
than any of the constituent learning algorithms alone. So, in our case, the collection
of decision trees as a whole unit behaves much better than any stand-alone decision
tree. In short, random forests rely on combining the results of many different decision
trees.
The random forest algorithm is one of the few non-neural network models that give
very high accuracy for both regression and classification tasks. It simply gives good
results. And while decision trees do provide great interpretability, when it comes
down to performance, they lose against random forests. In fact, unless transparency
of the model is a priority, almost every data scientist and analyst will use random
forests over decision trees.
How Do Random Forests Work?
Through a process called bootstrapping, we can create many different datasets that
share the general correlation of the original dataset. And what can we do with similar,
though slightly different datasets? Well, we can train ML algorithms with them and
obtain similar, yet slightly different models.
Since we are talking about random forests, those models would be, of course,
decision trees. And so, we have, let’s say, a hundred decision trees trained on a
hundred bootstrap samples. What do we do with them?
Well, we aggregate them. In other words, we combine all the different outputs into
one, usually through a majority voting system. Therefore, the most common output
among the different decision trees is the one we consider the final output.
So, for example, if 30 models say that the input corresponds to a rocket, 26 say it is a
plane, and 44 determine that it corresponds to a car, the final output would be a car.
That ensemble is called Bagged decision trees – “Bagged” stemming from Bootstrap
Aggregating. The advantages of having many different trees mainly come in the form
of reducing overfitting. Since we have different data for the different trees, it is harder
for them to model every individual datapoint, which is what overfitting is. However,
there’s something more this collection needs to become a random forest.
Random forests employ one additional crucial detail – they don’t consider all the
features at once. Just as every decision tree is trained with a different dataset, it is also
trained with a random subset of the input features.
Machine Learning with Decision Trees and Random Forests: Next Steps
How it works
Random forest algorithms have three main hyperparameters, which need to be set
before training. These include node size, the number of trees, and the number of
features sampled. From there, the random forest classifier can be used to solve for
regression or classification problems.
The random forest algorithm is made up of a collection of decision trees, and each
tree in the ensemble is comprised of a data sample drawn from a training set with
replacement, called the bootstrap sample. Of that training sample, one-third of it is set
aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to
later. Another instance of randomness is then injected through feature bagging,
adding more diversity to the dataset and reducing the correlation among decision
trees. Depending on the type of problem, the determination of the prediction will
vary. For a regression task, the individual decision trees will be averaged, and for a
classification task, a majority vote—i.e. the most frequent categorical variable—will
yield the predicted class. Finally, the oob sample is then used for cross-validation,
finalizing that prediction.
Another great quality of the random forest algorithm is that it is very easy to measure
the relative importance of each feature on the prediction.
In a decision tree, each internal node represents a ‘test’ on an attribute (e.g., whether a
coin flip comes up heads or tails), each branch represents the outcome of the test, and
each leaf node represents a class label (decision taken after computing all attributes).
A node that has no children is a leaf.
By looking at the feature importance you can decide which features to possibly drop
because they don’t contribute enough (or sometimes nothing at all) to the prediction
process. This is important because a general rule in machine learning is that the more
features you have the more likely your model will suffer from overfitting and vice
versa.
Below is a table and visualization showing the importance of 13 features, which are
used during a supervised classification project with the famous Titanic dataset on
kaggle.
Key Benefits
Benefits and challenges of random forest
There are a number of key advantages and challenges that the random forest
algorithm presents when used for classification or regression problems. Some of them
include:
Reduced risk of overfitting: Decision trees run the risk of overfitting as they
tend to tightly fit all the samples within training data. However, when there’s a robust
number of decision trees in a random forest, the classifier won’t overfit the model
since the averaging of uncorrelated trees lowers the overall variance and prediction
error.
Provides flexibility: Since random forest can handle both regression and
classification tasks with a high degree of accuracy, it is a popular method among data
scientists. Feature bagging also makes the random forest classifier an effective tool
for estimating missing values as it maintains accuracy when a portion of the data is
missing.
Easy to determine feature importance: Random forest makes it easy to
evaluate variable importance, or contribution, to the model. There are a few ways to
evaluate feature importance. Gini importance and mean decrease in impurity (MDI)
are usually used to measure how much the model’s accuracy decreases when a given
variable is excluded. However, permutation importance, also known as mean
decrease accuracy (MDA), is another importance measure. MDA identifies the
average decrease in accuracy by randomly permutating the feature values in oob
samples.
Key Challenges
The random forest algorithm has been applied across a number of industries, allowing
them to make better business decisions. Some use cases include:
How
Does
Deep Learning Work?
The individual layers of neural networks can also be thought of as a sort of filter
that works from gross to subtle, which increases the likelihood of detecting and
outputting a correct result. The human brain works similarly. Whenever we
receive new information, the brain tries to compare it with known objects. The
same concept is also used by deep neural networks.
All recent advances in artificial intelligence in recent years are due to deep
learning. Without deep learning, we would not have self-driving cars, chatbots
or personal assistants like Alexa and Siri. Google Translate would continue to be
as primitive as it was before Google switched to neural networks and Netflix
would have no idea which movies to suggest. Neural networks are behind all of
these deep learning applications and technologies.
Deep learning programs have multiple layers of interconnected nodes, with each
layer building upon the last to refine and optimize predictions and classifications.
Deep learning performs nonlinear transformations to its input and uses what it learns
to create a statistical model as output. Iterations continue until the output has reached
an acceptable level of accuracy. The number of processing layers through which data
must pass is what inspired the label deep.
Initially, the computer program might be provided with training data -- a set of
images for which a human has labeled each image dog or not dog with metatags. The
program uses the information it receives from the training data to create a feature set
for dog and build a predictive model. In this case, the model the computer first
creates might predict that anything in an image that has four legs and a tail should be
labeled dog. Of course, the program is not aware of the labels four legs or tail. It
simply looks for patterns of pixels in the digital data. With each iteration, the
predictive model becomes more complex and more accurate.
Unlike the toddler, who takes weeks or even months to understand the concept of
dog, a computer program that uses deep learning algorithms can be shown a training
set and sort through millions of images, accurately identifying which images have
dogs in them, within a few minutes.
To achieve an acceptable level of accuracy, deep learning programs require access to
immense amounts of training data and processing power, neither of which were easily
available to programmers until the era of big data and cloud computing. Because
deep learning programming can create complex statistical models directly from its
own iterative output, it is able to create accurate predictive models from large
quantities of unlabeled, unstructured data.
NO FEATURE EXTRACTION
The first advantage of deep learning over machine learning is the redundancy of
the so-called feature extraction.
The result of feature extraction is a representation of the given raw data that
these classic machine learning algorithms can use to perform a task. For
example, we can now classify the data into several categories or classes. Feature
extraction is usually quite complex and requires detailed knowledge of the
problem domain. This preprocessing layer must be adapted, tested and refined
over several iterations for optimal results.
Deep learning’s artificial neural networks don’t need the feature extraction step.
The layers are able to learn an implicit representation of the raw data directly
and on their own.
Here’s how it works: A more and more abstract and compressed representation
of the raw data is produced over several layers of an artificial neural net. We then
use this compressed representation of the input data to produce the result. The
result can be, for example, the classification of the input data into different
classes.
In other words, we can say that the feature extraction step is already part of the
process that takes place in an artificial neural network.
During the training process, this neural network optimizes this step to obtain the
best possible abstract representation of the input data. This means that deep
learning models require little to no manual effort to perform and optimize the
feature extraction process.
Let’s look at a concrete example. If you want to use a machine learning model to
determine if a particular image is showing a car or not, we humans first need to
identify the unique features of a car (shape, size, windows, wheels, etc.), then
extract the feature and give it to the algorithm as input data. In this way, the
algorithm would perform a classification of the images. That is, in machine
learning, a programmer must intervene directly in the action for the model to
come to a conclusion.
In the case of a deep learning model, the feature extraction step is completely
unnecessary. The model would recognize these unique characteristics of a car
and make correct predictions without human intervention.
The second huge advantage of deep learning, and a key part of understanding
why it’s becoming so popular, is that it’s powered by massive amounts of data.
The era of big data will provide huge opportunities for new innovations in deep
learning. But don’t take my word for it Andrew Ng, the chief scientist of China’s
major search engine Baidu, co-founder of Coursera and one of the leaders of the
Google Brain Project,puts it this way:
AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If
you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If
you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket
you need a huge engine and a lot of fuel.
The analogy to deep learning is that the rocket engine is the deep learning
models and the fuel is the huge amounts of data we can feed to these algorithms.
Deep
learning models tend to increase their accuracy with the increasing amount of
training data, whereas traditional machine learning models such as SVM and
naive Bayes classifier stop improving after a saturation point.
How Do Deep Learning Neural Networks Work?
BIOLOGICAL NEURAL NETWORKS
Artificial neural networks are inspired by the biological neurons found in our
brains. In fact, the artificial neural networks simulate some basic functionalities
of biological neural network, but in a very simplified way. Let’s first look at the
biological neural networks to derive parallels to artificial neural networks.In
short, a biological neural network consists of numerous neurons.
A typical neuron consists of a cell body, dendrites and an axon. Dendrites are
thin structures that emerge from the cell body. An axon is a cellular extension
that emerges from this cell body. Most neurons receive signals through the
dendrites and send out signals along the axon.
At the majority of synapses, signals cross from the axon of one neuron to the
dendrite of another. All neurons are electrically excitable due to the maintenance
of voltage gradients in their membranes. If the voltage changes by a large enough
amount over a short interval, the neuron generates an electrochemical pulse
called an action potential. This potential travels rapidly along the axon and
activates synaptic connections.
ARTIFICIAL NEURAL NETWORKS
Now that we have a basic understanding of how biological neural networks are
functioning, let’s take a look at the architecture of the artificial neural network.
When an artificial neural network learns, the weights between neurons change,
as does the strength of the connection. Well what does that mean? Given
training data and a particular task such as classification of numbers, we are
looking for certain set weights that allow the neural network to perform the
classification.
The set of weights is different for every task and every data set. We cannot
predict the values of these weights in advance, but the neural network has to
learn them. The process of learning is what we call training.
Deep Learning Neural Network
Architecture
The typical neural network architecture consists of several layers; we call the
first one the input layer.
The input layer receives input x, (i.e. data from which the neural network learns).
In our previous example of classifying handwritten numbers, these inputs x
would represent the images of these numbers (x is basically an entire vector
where each entry is a pixel).
The input layer has the same number of neurons as there are entries in the
vector x. In other words, each input neuron represents one element in the vector.
The last layer is called the output layer, which outputs a vector y representing the
neural network’s result. The entries in this vector represent the values of the
neurons in the output layer. In our classification, each neuron in the last layer
represents a different class.
In this case, the value of an output neuron gives the probability that the
handwritten digit given by the features x belongs to one of the possible classes
(one of the digits 0-9). As you can imagine the number of output neurons must be
the same number as there are classes.
Please consider a smaller neural network that consists of only two layers. The input layer
has two input neurons, while the output layer consists of three neurons.
As you can see in the picture, each connection between two neurons is represented by a
different weight w. Each of these weight w has indices. The first value of the indices stands
for the number of neurons in the layer from which the connection originates, the second
value for the number of the neurons in the layer to which the connection leads.
All weights between two neural network layers can be represented by a matrix called the
weight matrix.
A weight matrix has the same number of entries as there are connections between neurons.
The dimensions of a weight matrix result from the sizes of the two layers that are connected
by this weight matrix.
The number of rows corresponds to the number of neurons in the layer from which the
connections originate and the number of columns corresponds to the number of neurons in
the layer to which the connections lead.
In this particular example, the number of rows of the weight matrix corresponds to the size
of the input layer, which is two, and the number of columns to the size of the output layer,
which is three.
Real-world deep learning applications are a part of our daily lives, but in most cases, they
are so well-integrated into products and services that users are unaware of the complex data
processing that is taking place in the background. Some of these examples include the
following:
Law enforcement: Deep learning algorithms can analyze and learn from transactional data
to identify dangerous patterns that indicate possible fraudulent or criminal activity. Speech
recognition, computer vision, and other deep learning applications can improve the
efficiency and effectiveness of investigative analysis by extracting patterns and evidence
from sound and video recordings, images, and documents, which helps law enforcement
analyze large amounts of data more quickly and accurately.
Financial services: Financial institutions regularly use predictive analytics to drive
algorithmic trading of stocks, assess business risks for loan approvals, detect fraud, and help
manage credit and investment portfolios for clients.
Customer service: Many organizations incorporate deep learning technology into their
customer service processes. Chatbots—used in a variety of applications, services, and
customer service portals—are a straightforward form of AI. Traditional chatbots use natural
language and even visual recognition, commonly found in call center-like menus. However,
more sophisticated chatbot solutions attempt to determine, through learning, if there are
multiple responses to ambiguous questions. Based on the responses it receives, the chatbot
then tries to answer these questions directly or route the conversation to a human user.
Virtual assistants like Apple's Siri, Amazon Alexa, or Google Assistant extends the idea of a
chatbot by enabling speech recognition functionality. This creates a new method to engage
users in a personalized way.
Healthcare:The healthcare industry has benefited greatly from deep learning capabilities
ever since the digitization of hospital records and images. Image recognition applications
can support medical imaging specialists and radiologists, helping them analyze and assess
more images in less time.