0% found this document useful (0 votes)
12 views24 pages

AIML

Uploaded by

thillai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

AIML

Uploaded by

thillai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

MODULE I ARTIFICIAL INTELLIGENCE 9

Introduction, AI problems, foundation of AI and history of AI intelligent agents: Agents


and Environments, the concept of rationality, the nature of environments, structure of
agents, problem solving agents, problem formulation

(a) Intelligence - Ability to apply knowledge in order to perform better in an


environment. (b) Artificial Intelligence - Study and construction of agent
programs that perform well in a given environment, for a given agent
architecture. (c) Agent - An entity that takes action in response to precepts
from an environment. (d) Rationality - property of a system which does the
“right thing” given what it knows. (e) Logical Reasoning - A process of deriving
new sentences from old, such that the new sentences are necessarily true if the
old ones are true. Four Approaches of Artificial Intelligence: ➢ Acting humanly:
The Turing test approach. ➢ Thinking humanly: The cognitive modelling
approach. ➢ Thinking rationally: The laws of thought approach. ➢ Acting
rationally: The rational agent approach.

FUTURE OF ARTIFICIAL INTELLIGENCE • Transportation: Although it could take


a decade or more to perfect them, autonomous cars will one day ferry us from
place to place. • Manufacturing: AI powered robots work alongside humans to
perform a limited range of tasks like assembly and stacking, and predictive
analysis sensors keep equipment running smoothly. • Healthcare: In the
comparatively AI-nascent field of healthcare, diseases are more quickly and
accurately diagnosed, drug discovery is sped up and streamlined, virtual
nursing assistants monitor patients and big data analysis helps to create a
more personalized patient experience. • Education: Textbooks are digitized with
the help of AI, early-stage virtual tutors assist human instructors and facial
analysis gauges the emotions of students to help determine who’s struggling or
bored and better tailor the experience to their individual needs. • Media:
Journalism is harnessing AI, too, and will continue to benefit from it. Bloomberg
uses Cyborg technology to help make quick sense of complex financial reports.
The Associated Press employs the natural language abilities of Automated
Insights to produce 3,700 earning reports stories per year — nearly four times
more than in the recent past • Customer Service: Last but hardly least, Google
is working on an AI assistant that can place human-like calls to make
appointments at, say, your neighborhood hair salon. In addition to words, the
system understands context and nuance.
AGENTS AND ITS TYPES.

An agent is anything that can be viewed as perceiving its environment through


sensors and acting upon that environment through actuators. • Human
Sensors: • Eyes, ears, and other organs for sensors. • Human Actuators: •
Hands, legs, mouth, and other body parts. • Robotic Sensors: • Mic, cameras
and infrared range finders for sensors • Robotic Actuators: • Motors, Display,
speakers etc An agent can be: Human-Agent: A human agent has eyes, ears,
and other organs which work for sensors and hand, legs, vocal tract work for
actuators. Robotic Agent: A robotic agent can have cameras, infrared range
finder, NLP for sensors and various motors for actuators. Software Agent:
Software agent can have keystrokes, file contents as sensory input and act on
those inputs and display output on the screen. Hence the world around us is full
of agents such as thermostat, cell phone, camera, and even we are also
agents. Before moving forward, we should first know about sensors, effectors,
and actuators. Sensor: Sensor is a device which detects the change in the
environment and sends the information to other electronic devices. An agent
observes its environment through sensors. 8 Actuators: Actuators are the
component of machines that converts energy into motion. The actuators are
only responsible for moving and controlling a system. An actuator can be an
electric motor, gears, rails, etc. Effectors: Effectors are the devices which affect
the environment. Effectors can be legs, wheels, arms, fingers, wings, fins, and
display screen
MODULE SEARCHING
9
II
Searching for solutions, uniformed search strategies – Breadth first search, depth first
Search. Search with partial information (Heuristic search) Greedy best first search, A*
search Game Playing: Adversial search, Games, minimax, algorithm, optimal decisions
in multiplayer games, Alpha-Beta pruning, Evaluation functions, cutting of search
Uninformed Search Algorithms:

The search algorithms in this section have no additional information on the goal node
other than the one provided in the problem definition. The plans to reach the goal
state from the start state differ only by the order and/or length of actions.
Uninformed search is also called Blind search. These algorithms can only generate
the successors and differentiate between the goal state and non goal state.

The following uninformed search algorithms are discussed in this section.


1. Depth First Search
2. Breadth First Search
3. Uniform Cost Search
Each of these algorithms will have:
 A problem graph, containing the start node S and the goal node G.
 A strategy, describing the manner in which the graph will be traversed to
get to G.
 A fringe, which is a data structure used to store all the possible states
(nodes) that you can go from the current states.
 A tree, that results while traversing to the goal node.
 A solution plan, which the sequence of nodes from S to G.
Depth First Search:
Depth-first search (DFS) is an algorithm for traversing or searching tree or graph
data structures. The algorithm starts at the root node (selecting some arbitrary node
as the root node in the case of a graph) and explores as far as possible along each
branch before backtracking. It uses last in- first-out strategy and hence it is
implemented using a stack.

Example:
Question. Which solution would DFS find to move from node S to node G if run on
the graph below?
Solution. The equivalent search tree for the above graph is as follows. As DFS
traverses the tree “deepest node first”, it would always pick the deeper branch until
it reaches the solution (or it runs out of nodes, and goes to the next branch). The
traversal is shown in blue arrows.

Path: S -> A -> B -> C -> G

= the depth of the search tree = the number of levels of the search tree.
= number of nodes in level .

Time complexity: Equivalent to the number of nodes traversed in

DFS.
Space complexity: Equivalent to how large can the fringe

get.
Completeness: DFS is complete if the search tree is finite, meaning for a given
finite search tree, DFS will come up with a solution if it exists.
Optimality: DFS is not optimal, meaning the number of steps in reaching the
solution, or the cost spent in reaching it is high.
Breadth First Search:
Breadth-first search (BFS) is an algorithm for traversing or searching tree or graph
data structures. It starts at the tree root (or some arbitrary node of a graph,
sometimes referred to as a ‘search key’), and explores all of the neighbor nodes at
the present depth prior to moving on to the nodes at the next depth level. It is
implemented using a queue.

Example:
Question. Which solution would BFS find to move from node S to node G if run on
the graph below?

Solution. The equivalent search tree for the above graph is as follows. As BFS
traverses the tree “shallowest node first”, it would always pick the shallower branch
until it reaches the solution (or it runs out of nodes, and goes to the next branch).
The traversal is shown in blue arrows.

Path: S -> D -> G


= the depth of the shallowest solution.
= number of nodes in level .
Time complexity: Equivalent to the number of nodes traversed in BFS until the
shallowest

solution.
Space complexity: Equivalent to how large can the fringe
get.
Completeness: BFS is complete, meaning for a given search tree, BFS will come up
with a solution if it exists.

Optimality: BFS is optimal as long as the costs of all edges are equal.

Uniform Cost Search:

UCS is different from BFS and DFS because here the costs come into play. In other
words, traversing via different edges might not have the same cost. The goal is to
find a path where the cumulative sum of costs is the least.

Cost of a node is defined as:


cost(node) = cumulative cost of all nodes from root
cost(root) = 0
Example:
Question. Which solution would UCS find to move from node S to node G if run on
the graph below?

Solution. The equivalent search tree for the above graph is as follows. The cost of
each node is the cumulative cost of reaching that node from the root. Based on the
UCS strategy, the path with the least cumulative cost is chosen. Note that due to the
many options in the fringe, the algorithm explores most of them so long as their cost
is low, and discards them when a lower-cost path is found; these discarded
traversals are not shown below. The actual traversal is shown in blue.
Path: S -> A -> B -> G
Cost: 5

Let = cost of solution.


= arcs cost.

Then effective depth

Time complexity: , Space

complexity:
Advantages:
 UCS is complete only if states are finite and there should be no loop with
zero weight.
 UCS is optimal only if there is no negative cost.
Disadvantages:
 Explores options in every “direction”.
 No information on goal location.

Informed Search Algorithms:

Here, the algorithms have information on the goal state, which helps in more
efficient searching. This information is obtained by something called a heuristic.
In this section, we will discuss the following search algorithms.
1. Greedy Search
2. A* Tree Search
3. A* Graph Search
Search Heuristics: In an informed search, a heuristic is a function that estimates
how close a state is to the goal state. For example – Manhattan distance, Euclidean
distance, etc. (Lesser the distance, closer the goal.) Different heuristics are used in
different informed algorithms discussed below.
Greedy Search:

In greedy search, we expand the node closest to the goal node. The “closeness” is
estimated by a heuristic h(x).

Heuristic: A heuristic h is defined as-


h(x) = Estimate of distance of node x from the goal node.
Lower the value of h(x), closer is the node from the goal.

Strategy: Expand the node closest to the goal state, i.e. expand the node with a
lower h value.

Example:
Question. Find the path from S to G using greedy search. The heuristic values h of
each node below the name of the node.

Solution. Starting from S, we can traverse to A(h=9) or D(h=5). We choose D, as it


has the lower heuristic cost. Now from D, we can move to B(h=4) or E(h=3). We
choose E with a lower heuristic cost. Finally, from E, we go to G(h=0). This entire
traversal is shown in the search tree below, in blue.

Path: S -> D -> E -> G


Advantage: Works well with informed search problems, with fewer steps to reach a
goal.
Disadvantage: Can turn into unguided DFS in the worst case.

A* Tree Search:

A* Tree Search, or simply known as A* Search, combines the strengths of uniform-


cost search and greedy search. In this search, the heuristic is the summation of the
cost in UCS, denoted by g(x), and the cost in the greedy search, denoted by h(x).
The summed cost is denoted by f(x).

Heuristic: The following points should be noted wrt heuristics in A*

search.
 Here, h(x) is called the forward cost and is an estimate of the distance of
the current node from the goal node.
 And, g(x) is called the backward cost and is the cumulative cost of a
node from the root node.
 A* search is optimal only when for all nodes, the forward cost for a node
h(x) underestimates the actual cost h*(x) to reach the goal. This property
of A* heuristic is called admissibility.

Admissibility:

Strategy: Choose the node with the lowest f(x) value.

Example:
Question. Find the path to reach from S to G using A* search.

Solution. Starting from S, the algorithm computes g(x) + h(x) for all nodes in the
fringe at each step, choosing the node with the lowest sum. The entire work is shown
in the table below.

Note that in the fourth set of iterations, we get two paths with equal summed cost
f(x), so we expand them both in the next set. The path with a lower cost on further
expansion is the chosen path.

Path h(x) g(x) f(x)

S 7 0 7
S -> A 9 3 12

S -> D 5 2 7

S -> D -> B 4 2+1=3 7

S -> D -> E 3 2+4=6 9

S -> D -> B -> C 2 3+2=5 7

S -> D -> B -> E 3 3+1=4 7

S -> D -> B -> C -> G 0 5+4=9 9

S -> D -> B -> E -> G

0 4+3=7 7

Path: S -> D -> B -> E -> G


Cost: 7

A* Graph Search:
 A* tree search works well, except that it takes time re-exploring the
branches it has already explored. In other words, if the same node has
expanded twice in different branches of the search tree, A* search might
explore both of those branches, thus wasting time
 A* Graph Search, or simply Graph Search, removes this limitation by
adding this rule: do not expand the same node more than once.
 Heuristic. Graph search is optimal only when the forward cost between
two successive nodes A and B, given by h(A) – h (B), is less than or equal
to the backward cost between those two nodes g(A -> B). This property of
the graph search heuristic is called consistency.

Consistency:

Example:
Question. Use graph searches to find paths from S to G in the following graph.
the Solution. We solve this question pretty much the same way we solved last
question, but in this case, we keep a track of nodes explored so that we don’t re-
explore them.

Path: S -> D -> B -> E -> G


Cost: 7

MODULE III SUPERVISED LEARNING 9


Introduction, Different types of learning, Linear regression, Logistic regression,
Gradient Descent: Introduction, Stochastic Gradient Descent, Subgradients, Stochastic
Gradient Descent for risk minimization, Support Vector Machines: Hard SVM, Soft
SVM, Optimality conditions, Duality, Kernel trick, Implementing Soft SVM with Kernels,
Decision Trees: Decision Tree algorithms, Random forests

Gradient Descent is an iterative optimization process that searches for an objective


function’s optimum value (Minimum/Maximum). It is one of the most used methods
for changing a model’s parameters in order to reduce a cost function in machine
learning projects.
The primary goal of gradient descent is to identify the model parameters that provide
the maximum accuracy on both training and test datasets. In gradient descent, the
gradient is a vector pointing in the general direction of the function’s steepest rise at
a particular point. The algorithm might gradually drop towards lower values of the
function by moving in the opposite direction of the gradient, until reaching the
minimum of the function.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
In this article, we will be discussing Stochastic Gradient Descent (SGD).
Table of Content
 Stochastic Gradient Descent (SGD):
 Stochastic Gradient Descent Algorithm
 Difference between Stochastic Gradient Descent & batch Gradient Descent
 Python Code For Stochastic Gradient Descent
 Stochastic Gradient Descent (SGD) using TensorFLow
 Advantages of Stochastic Gradient Descent
 Disadvantages of Stochastic Gradient Descent
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm
that is used for optimizing machine learning models. It addresses the computational
inefficiency of traditional Gradient Descent methods when dealing with large datasets
in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random
training example (or a small batch) is selected to calculate the gradient and update
the model parameters. This random selection introduces randomness into the
optimization process, hence the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing
with large datasets. By using a single example or a small batch, the computational
cost per iteration is significantly reduced compared to traditional Gradient Descent
methods that require processing the entire dataset.
Stochastic Gradient Descent Algorithm
 Initialization: Randomly initialize the parameters of the model.
 Set Parameters: Determine the number of iterations and the learning rate
(alpha) for updating the parameters.
 Stochastic Gradient Descent Loop: Repeat the following steps until the
model converges or reaches the maximum number of iterations:
o Shuffle the training dataset to introduce randomness.
o Iterate over each training example (or a small batch) in
the shuffled order.
o Compute the gradient of the cost function with respect
to the model parameters using the current training
example (or batch).
o Update the model parameters by taking a step in the
direction of the negative gradient, scaled by the learning
rate.
o Evaluate the convergence criteria, such as the difference
in the cost function between iterations of the gradient.
 Return Optimized Parameters: Once the convergence criteria are met or
the maximum number of iterations is reached, return the optimized model
parameters.
In SGD, since only one sample from the dataset is chosen at random for each
iteration, the path taken by the algorithm to reach the minima is usually noisier than
your typical Gradient Descent algorithm. But that doesn’t matter all that much
because the path taken by the algorithm does not matter, as long as we reach the
minimum and with a significantly shorter training time.
The path taken by Batch Gradient Descent is shown below:

Batch gradient optimization path

A path taken by Stochastic Gradient Descent looks as follows –

stochastic gradient optimization path

One thing to be noted is that, as SGD is generally noisier than typical Gradient
Descent, it usually took a higher number of iterations to reach the minima, because
of the randomness in its descent. Even though it requires a higher number of
iterations to reach the minima than typical Gradient Descent, it is still
computationally much less expensive than typical Gradient Descent. Hence, in most
scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning
algorithm.
Difference between Stochastic Gradient Descent & batch Gradient Descent
The comparison between Stochastic Gradient Descent (SGD) and Batch Gradient
Descent are as follows:
Stochastic Gradient Descent
Aspect (SGD) Batch Gradient Descent

Uses a single random sample or a


Uses the entire dataset (batch) at
small batch of samples at each
each iteration.
Dataset Usage iteration.
Computationally less expensive Computationally more expensive
Computational per iteration, as it processes per iteration, as it processes the
Efficiency fewer data points. entire dataset.

Faster convergence due to Slower convergence due to less


Convergence frequent updates. frequent updates.

High noise due to frequent


Low noise as it updates
updates with a single or few
parameters using all data points.
Noise in Updates samples.

Less stable as it may oscillate More stable as it converges


Stability around the optimal solution. smoothly towards the optimum.

Requires less memory as it


Requires more memory to hold
Memory processes fewer data points at a
the entire dataset in memory.
Requirement time.

Frequent updates make it


Less frequent updates make it
Update suitable for online learning and
suitable for smaller datasets.
Frequency large datasets.

Initialization Less sensitive to initial parameter More sensitive to initial parameter


Sensitivity values due to frequent updates. values.

Advantages of Stochastic Gradient Descent


 Speed: SGD is faster than other variants of Gradient Descent such as
Batch Gradient Descent and Mini-Batch Gradient Descent since it uses only
one example to update the parameters.
 Memory Efficiency: Since SGD updates the parameters for each training
example one at a time, it is memory-efficient and can handle large
datasets that cannot fit into memory.
 Avoidance of Local Minima: Due to the noisy updates in SGD, it has the
ability to escape from local minima and converges to a global minimum.
Disadvantages of Stochastic Gradient Descent
 Noisy updates: The updates in SGD are noisy and have a high variance,
which can make the optimization process less stable and lead to
oscillations around the minimum.
 Slow Convergence: SGD may require more iterations to converge to the
minimum since it updates the parameters for each training example one at
a time.
 Sensitivity to Learning Rate: The choice of learning rate can be critical in
SGD since using a high learning rate can cause the algorithm to overshoot
the minimum, while a low learning rate can make the algorithm converge
slowly.
 Less Accurate: Due to the noisy updates, SGD may not converge to the
exact global minimum and can result in a suboptimal solution. This can be
mitigated by using techniques such as learning rate scheduling and
momentum-based updates.

MODULE UNSUPERVISED LEARNING 9


IV
Nearest Neighbour: k-nearest neighbour, Curse of dimensionality, Clustering: Linkage-
based clustering algorithms, k-means algorithm, Spectral clustering, Dimensionality
reduction: Principal Component Analysis, Random projections, Compressed sensing

MODULE V COMPUTATIONAL LEARNING THEORY AND DEEP NEURAL 9


NETWORKS
Statistical Learning Framework: PAC learning, Agnostic PAC learning, Bias-complexity
tradeoff, No free lunch theorem, VC dimension, Structural risk minimization, Adaboost,
Foundations of Deep Learning: DNN, CNN, RNN, Autoencoders

Parameter CLASSIFICATION CLUSTERING

Type used for supervised learning used for unsupervised learning

process of classifying the input grouping the instances based on


Basic instances based on their their similarity without the help
corresponding class labels of class labels

it has labels so there is need of


there is no need of training and
Need training and testing dataset for
testing dataset
verifying the model created

more complex as compared to less complex as compared to


Complexity
clustering classification

k-means clustering algorithm,


Logistic regression, Naive Bayes
Example Fuzzy c-means clustering
classifier, Support vector
Algorithms algorithm, Gaussian (EM)
machines, etc.
clustering algorithm, etc.

2. KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does
not make any underlying assumptions about the distribution of data (as opposed to
other algorithms such as GMM, which assume a Gaussian distribution of the given
data). We are given some prior data (also called training data), which classifies
coordinates into groups identified by an attribute.

3. Clustering consists of grouping certain objects that are similar to each other, it can
be used to decide if two items are similar or dissimilar in their properties. In a Data
Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small then there is
a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application.
4. Multiple dimensions are hard to think in, impossible to visualize, and, due to the
exponential growth of the number of possible values with each dimension, complete
enumeration of all subspaces becomes intractable with increasing dimensionality. This
problem is known as the curse of dimensionality.

5. In multivariate statistics, spectral clustering techniques make use of the spectrum


(eigenvalues) of the similarity matrix of the data to perform dimensionality reduction
before clustering in fewer dimensions.

6. The Curse of Dimensionality significantly impacts machine learning algorithms in


various ways. It leads to increased computational complexity, longer training times, and
higher resource requirements.

7. Random projection is a dimension reduction tool. “Projection” means that the


technique projects the data from a high-dimensional space to a lower-dimensional space,
and “Random” means the projection matrix is randomly generated.

8. PAC (Probably Approximately Correct) learning is a framework used for mathematical


analysis. A PAC Learner tries to learn a concept (approximately correct) by selecting a
hypothesis from a set of hypotheses that has a low generalization error.

9. The bias-variance tradeoff implies that as we increase the complexity of a model, its
variance decreases, and its bias increases. Conversely, as we decrease the model's
complexity, its variance increases, but its bias decreases.

10.

Comparison Feed-forward Neural


Attribute Networks Recurrent Neural Networks

Signal flow direction Forward only Bidirectional

Delay introduced No Yes

Complexity Low High

Neuron independence
Yes No
in the same layer

Speed High slow

Pattern recognition, speech Language translation, speech-


Commonly used for recognition, and character to-text conversion, and
recognition robotic control

11. With fast implementation time, the CNN model requires fewer parameters for
training, and model performance is maintained. With faster execution, the DNN model
requires the most parameters for training, but the model performance is compromised
with less accuracy.
12. Artificial intelligence (AI) is the overarching system. Machine learning is a subset of
AI. Deep learning is a subfield of machine learning, and neural networks make up the
backbone of deep learning algorithms.

13.b. The advancements in Data Science and Machine Learning have made it possible
for us to solve several complex regression and classification problems. However, the
performance of all these ML models depends on the data fed to them. Thus, it is
imperative that we provide our ML models with an optimal dataset. Now, one might
think that the more data we provide to our model, the better it becomes – however, it
is not the case. If we feed our model with an excessively large dataset (with a large
no. of features/columns), it gives rise to the problem of overfitting, wherein the model
starts getting influenced by outlier values and noise. This is called the Curse of
Dimensionality.

Dimensionality Reduction is a statistical/ML-based technique wherein we try to


reduce the number of features in our dataset and obtain a dataset with an optimal
number of dimensions.
One of the most common ways to accomplish Dimensionality Reduction is Feature
Extraction, wherein we reduce the number of dimensions by mapping a higher
dimensional feature space to a lower-dimensional feature space. The most popular
technique of Feature Extraction is Principal Component Analysis (PCA)
Dimensionality reduction is a technique that reduces the number of features or
variables in a dataset, while preserving the essential information or structure. It can
help you optimize your model performance by improving the speed, accuracy, and
interpretability of your data analysis.

What is Predictive Modeling: Predictive modeling is a probabilistic process that


allows us to forecast outcomes, on the basis of some predictors. These predictors are
basically features that come into play when deciding the final result, i.e. the outcome
of the model.
Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible. This can be
done for a variety of reasons, such as to reduce the complexity of a model, to improve
the performance of a learning algorithm, or to make it easier to visualize the data.
There are several techniques for dimensionality reduction, including principal
component analysis (PCA), singular value decomposition (SVD), and linear discriminant
analysis (LDA). Each technique uses a different method to project the data onto a
lower-dimensional space while preserving important information.

14. b. To overcome the curse of dimensionality, you can consider the following
strategies:

Dimensionality Reduction Techniques:

 Feature Selection: Identify and select the most relevant features from the
original dataset while discarding irrelevant or redundant ones. This reduces the
dimensionality of the data, simplifying the model and improving its efficiency.

 Feature Extraction: Transform the original high-dimensional data into a lower-


dimensional space by creating new features that capture the essential
information. Techniques such as Principal Component Analysis (PCA) and t-
distributed Stochastic Neighbor Embedding (t-SNE) are commonly used for
feature extraction.

Data Preprocessing:

 Normalization: Scale the features to a similar range to prevent certain features


from dominating others, especially in distance-based algorithms.

 Handling Missing Values: Address missing data appropriately through imputation


or deletion to ensure robustness in the model training process.

Feature Selection and Dimensionality Reduction


1. Feature Selection: SelectKBest is used to select the top k features based
on a specified scoring function (f_classif in this case). It selects the features
that are most likely to be related to the target variable.
2. Dimensionality Reduction: PCA (Principal Component Analysis) is then
used to further reduce the dimensionality of the selected features. It
transforms the data into a lower-dimensional space while retaining as much
variance as possible.
Training the classifiers
1. Training Before Dimensionality Reduction: Train a Random Forest
classifier (clf_before) on the original scaled features (X_train_scaled)
without dimensionality reduction.
2. Evaluation Before Dimensionality Reduction: Make predictions
(y_pred_before) on the test set (X_test_scaled) using the classifier trained
before dimensionality reduction, and calculate the accuracy
(accuracy_before) of the model.
3. Training After Dimensionality Reduction: Train a new Random Forest
classifier (clf_after) on the reduced feature set (X_train_pca) after
dimensionality reduction.
4. Evaluation After Dimensionality Reduction: Make predictions
(y_pred_after) on the test set (X_test_pca) using the classifier trained after
dimensionality reduction, and calculate the accuracy (accuracy_after) of the
model.
The accuracy before dimensionality reduction is 0.8745, while the accuracy
after dimensionality reduction is 0.9236. This improvement indicates that the
dimensionality reduction technique (PCA in this case) helped the model generalize
better to unseen data.

What are Autoencoders?


Autoencoders are a specialized class of algorithms that can learn efficient
representations of input data with no need for labels. It is a class of artificial neural
networks designed for unsupervised learning. Learning to compress and effectively
represent input data without specific labels is the essential principle of an automatic
decoder. This is accomplished using a two-fold structure that consists of an encoder
and a decoder. The encoder transforms the input data into a reduced-dimensional
representation, which is often referred to as “latent space” or “encoding”. From that
representation, a decoder rebuilds the initial input. For the network to gain meaningful
patterns in data, a process of encoding and decoding facilitates the definition of
essential features.
Architecture of Autoencoder in Deep Learning
The general architecture of an autoencoder includes an encoder, decoder, and
bottleneck layer.
What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence and Deep Learning is an important

part of its’ broader family which includes deep neural networks, deep belief networks,

and recurrent neural networks.² Mainly, in Deep Learning there are three fundamental

architectures of neural network that perform well on different types of data which

are FFNN, RNN, and CNN.


Deep Neural Networks (DNNs)

Deep Neural Networks (DNNs) are typically Feed Forward Networks (FFNNs) in

which data flows from the input layer to the output layer without going backward³ and

the links between the layers are one way which is in the forward direction and they never

touch a node again.

The outputs are obtained by supervised learning with datasets of some information based

on ‘what we want’ through back propagation. Like you go to a restaurant and the chef

gives you an idea about the ingredients of your meal. FFNNs work in the same way as

you will have the flavor of those specific ingredients while eating but just after finishing

your meal you will forget what you have eaten. If the chef gives you the meal of same

ingredients again you can’t recognize the ingredients, you have to start from scratch as

you don’t have any memory of that. But the human brain doesn’t work like that.

Recurrent Neural Network (RNN)

A Recurrent Neural Network (RNN) addresses this issue which is a FFNN with a time

twist. This neural network isn’t stateless, has connections between passes and
connections through time. They are a class of artificial neural network where connections

between nodes form a directed graph along a sequence like features links from a layer to

previous layers, allowing information to flow back into the previous parts of the network

thus each model in the layers depends on past events, allowing information to persist.

In this way, RNNs can use their internal state (memory) to process sequences of inputs.

This makes them applicable to tasks such as unsegmented, connected handwriting

recognition or speech recognition. But they not only work on the information you feed but

also on the related information from the past which means whatever you feed and train

the network matters, like feeding it ‘chicken’ then ‘egg’ may give different output in

comparison to ‘egg’ then ‘chicken’. RNNs also have problems like vanishing (or

exploding) gradient/long-term dependency problem where information rapidly gets lost

over time. Actually, it’s the weight which gets lost when it reaches a value of 0 or 1 000

000, not the neuron. But in this case, the previous state won’t be very informative as it’s

the weight which stores the information from the past.


Long Short Term Memory (LSTM)

Thankfully, breakthroughs like Long Short Term Memory (LSTM) don’t have this

problem! LSTMs are a special kind of RNN, capable of learning long-term dependencies

which make RNN smart at remembering things that have happened in the past and

finding patterns across time to make its next guesses make sense. LSTMs broke records

for improved Machine Translation, Language Modeling and Multilingual Language

Processing.

Convolutional Neural Network (CNN)

Next comes the Convolutional Neural Network (CNN, or ConvNet) which is a class of

deep neural networks which is most commonly applied to analyzing visual imagery. Their

other applications include video understanding, speech recognition and understanding

natural language processing. Also, LSTM combined with Convolutional Neural

Networks (CNNs) improved automatic image captioning like those are seen in

Facebook. Thus you can see that RNN is more like helping us in data processing

predicting our next step whereas CNN helps us in visuals analyzing.

You might also like