Machine Learning Report
Machine Learning Report
Abstract:
Introduction:
Ever since the technical revolution, we’ve been generating an immeasurable amount
of data. As per research, we generate around 2.5 quintillion bytes of data every single day! It
is estimated that by 2020, 1.7MB of data will be created every second for every person on
earth. With the availability of so much data, it is finally possible to build predictive models
that can study and analyze complex data to find useful insights and deliver more accurate
results.Top Tier companies such as Netflix and Amazon build such Machine Learning
models by using tons of data in order to identify profitable opportunities and avoid unwanted
risks.
Improve Decision Making: By making use of various algorithms, Machine Learning can be
used to make better business decisions. For example, Machine Learning is used to forecast
sales, predict downfalls in the stock market, identify risks and anomalies, etc.
Uncover patterns & trends in data: Finding hidden patterns and extracting key insights
from data is the most essential part of Machine Learning. By building predictive models and
using statistical techniques, Machine Learning allows you to dig beneath the surface and
explore the data at a minute scale. Understanding data and extracting patterns manually will
take days, whereas Machine Learning algorithms can perform such computations in less than
a second. Solve complex problems: From detecting the genes linked to the deadly ALS
disease to building self-driving cars, Machine Learning can be used to solve the most
complex problems.
The term Machine Learning was first coined by Arthur Samuel in the year 1959.
Looking back, that year was probably the most significant in terms of technological
advancements.
If you browse through the net about ‘what is Machine Learning’, you’ll get at least 100
different definitions. However, the very first formal definition was given by Tom M.
Mitchell:
“A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves with
experience E.”
In simple terms, Machine learning is a subset of Artificial Intelligence (AI) which provides
machines the ability to learn automatically & improve from experience without being
explicitly programmed to do so. In the sense, it is the practice of getting Machines to solve
problems by gaining the ability to think. But wait, can a machine think or make decisions?
Well, if you feed a machine a good amount of data, it will learn how to interpret, process and
analyze this data by using Machine Learning Algorithms, in order to solve real-world
problems.
Before moving any further, let’s discuss some of the most commonly used terminologies in
Machine Learning.
Algorithm:
A Machine Learning algorithm is a set of rules and statistical techniques used to learn
patterns from data and draw significant information from it. It is the logic behind a Machine
Learning model. An example of a Machine Learning algorithm is the Linear Regression
algorithm.
Model:
Predictor Variable:
Response Variable:
It is the feature or the output variable that needs to be predicted by using the predictor
variable(s).
Training Data:
The Machine Learning model is built using the training data. The training data helps
the model to identify key trends and patterns essential to predict the output.
Testing Data:
After the model is trained, it must be tested to evaluate how accurately it can predict an
outcome. This is done by the testing data set.
Data Types:
To analyze data, it is important to know what type of data we are dealing with.
Numerical
Categorical
Ordinal
Numerical data are numbers, and can be split into two numerical categories:
Discrete Data:
- numbers that are limited to integers. Example: The number of cars passing
by.
Continuous Data:
- numbers that are of infinite value. Example: The price of an item, or the size
of an item
Categorical data are values that cannot be measured up against each other.
Example: a color value, or any yes/no values.
Ordinal data are like categorical data, but can be measured up against each other. Example:
school grades where A is better than B and so on.
Containers that store values are called variables in Python. Variables are only designated
memory spaces for the storage of values. This implies that you set aside some memory when
you create a variable. The interpreter allots memory and determines what can be placed in the
reserved memory based on the data type of a variable. Therefore, you may store integers,
decimals, or characters in these variables by giving them alternative data types. Variables do
not need to be declared or specified beforehand in Python, unlike many other programming
languages. A lot of values need to be managed while developing a program. We utilize
variables to store values. A variable’s value may be modified while a program is running.
Introduction:
Variables:
Containers that store values are called variables in Python. Variables are only
designated memory spaces for the storage of values. This implies that you set aside some
memory when you create a variable. The interpreter allots memory and determines what can
be placed in the reserved memory based on the data type of a variable. Therefore, you may
store integers, decimals, or characters in these variables by giving them alternative data types.
Variables do not need to be declared or specified beforehand in Python, unlike many other
programming languages. A lot of values need to be managed while developing a program.
We utilize variables to store values. A variable’s value may be modified while a program is
running.
Introduction:
So for all those of you who do not know what is Machine Learning? Machine
Learning, in the simplest of terms, is teaching your machine about something. You collect
data, clean the
data, create algorithms, teach the algorithm essential patterns from the data and then expect
the algorithm to give you a helpful answer. If the algorithm lives up to your expectations, you
have successfully taught your algorithm. If not, just scrap everything and start from scratch.
That is how it works here. Oh, and if you are looking for a formal definition, Machine
Learning is the process of creating models that can perform a certain task without the need
for a human explicitly programming it to do something.
There are 3 types of Machine Learning which are based on the way the algorithms are
created. They are:
Supervised Learning – You supervise the learning process, meaning the data that
you have collected here is labelled and so you know what input needs to be mapped to
what output. This helps you correct your algorithm if it makes a mistake in giving you
the answer.
Unsupervised Learning – The data collected here has no labels and you are unsure
about the outputs. So you model your algorithm such that it can understand patterns
from the data and output the required answer. You do not interfere when the
algorithm learns.
Reinforcement Learning – There is no data in this kind of learning, nor do you teach
the algorithm anything. You model the algorithm such that it interacts with the
environment and if the algorithm does a good job, you reward it, else you punish the
algorithm. With continuous interactions and learning, it goes from being bad to being
the best that it can for the problem assigned to it.
Now that you have a basic idea of what is Machine Learning and the different types of
Machine Learning, let us dwell into the actual topic for discussion here and answer What is
Supervised Learning? Where is Supervised Learning used? What are the types of Supervised
Learning? Supervised Learning Algorithms and much more!
Suppose you have a niece who has just turned 2 years old and is learning to speak.
She knows the words, Papa and Mumma, as her parents have taught her how she needs to call
them. You want to teach her what a dog and a cat is. So what do you do? You either show her
videos of dogs and cats or you bring a dog and a cat and show them to her in real-life so that
she can understand how they are different.
Now there are certain things you tell her so that she understands the differences between the 2
animals.
Dogs and cats both have 4 legs and a tail.
Dogs come in small to large sizes. Cats, on the other hand, are always small.
Dogs have a long mouth while cats have smaller mouths.
Dogs bark while cats meow.
Different dogs have different ears while cats have almost the same kind of ears.
Now you take your niece back home and show her pictures of different dogs and cats. If she
is able to differentiate between the dog and cat, you have successfully taught her.
So what happened here? You were there to guide her to the goal of differentiating between a
dog and a cat. You taught her every difference there is between a dog and a cat. You then
tested her if she was able to learn. If she was able to learn, she called the dog as a dog and a
cat as a cat. If not, you taught her more and were able to teach her. You acted as the
supervisor and your niece acted as the algorithm that had to learn. You even knew what was a
dog and what was a cat. Making sure that she was learning the correct thing. That is the
principle that Supervised Learning follows.
Now with having a basic understanding of what Supervised Learning is, let’s also understand
what makes this kind of learning important.
Learning gives the algorithm experience which can be used to output the predictions
for new unseen data
Experience also helps in optimizing the performance of the algorithm
Real-world computations can also be taken care of by the Supervised Learning
algorithms
With the importance of Supervised Learning understood, let’s take a look at the types of
Supervised Learning along with the algorithms!
Regression
Classification
Regression is the kind of Supervised Learning that learns from the Labelled Datasets and is
then able to predict a continuous-valued output for the new data given to the algorithm. It is
used whenever the output required is a number such as money or height etc.
Supervised Learning Algorithms are used in a variety of applications. Let’s go through some
of the most well-known applications.
Those were some of the places where Supervised Learning has shined and shown its grit in
the real world of today. With that, let us move over to the differences between Supervised
and Unsupervised learning.
Unsupervised Learning – The data collected here has no labels and you are unsure
about the outputs. So you model your algorithm such that it can understand patterns
from the data and output the required answer. You do not interfere when the
algorithm learns.
Reinforcement Learning – There is no data in this kind of learning, nor do you teach
the algorithm anything. You model the algorithm such that it interacts with the
environment and if the algorithm does a good job, you reward it, else you punish the
algorithm. With continuous interactions and learning, it goes from being bad to being
the best that it can for the problem assigned to it.
Now that we know what is Machine Learning and the different types of Machine Learning,
let us dwell into the actual topic for discussion here and answer What is Unsupervised
Learning? Where is Unsupervised Learning used? Unsupervised Learning Algorithms and
much more
Let me give you a real-life example of where Unsupervised Learning may have been used
you to learn about something.
There are 2 teams with jerseys of colour Blue and Yellow. Since Virat Kohli belongs
to India and you see the score of India on the screen, you conclude that India has the
jersey of Blue which makes Australia have yellow Jersey.
There are different types of players on the field. 2 which belong to India have bats in
their hand meaning that they are batting. There is someone who runs up and bowls the
ball, making him a bowler. There are around 9 players around the field who try to stop
the ball from reaching the boundary of the stadium. There is someone behind the
wickets and 2 umpires to manage the match.
If the ball hits the wickets or if the ball is caught by the fielders, the batsman is out
and has to walk back.
Virat Kohli has the number 18 and his name on the back of his jersey and if this
player scores a 4 or a 6, you need to cheer.
You make these observations one-by-one and now know when to cheer or boo when the
wickets fall. From knowing nothing to knowing the basics of cricket, you can now enjoy the
match with your friends.
What happened here? You had every material that you needed to learn about the basics of
cricket. The TV, when and who your friends cheer for. This made you learn about cricket by
yourself without someone guiding you about anything. This is the principle that unsupervised
learning follows. So having understood what Unsupervised Learning is, let us move over and
understand what makes it so important in the field of Machine Learning.
Why is it important?
So what does Unsupervised Learning help us obtain? Let me tell you all about it.
Unsupervised Learning algorithms work on datasets that are unlabelled and find
patterns which would previously not be known to us.
These patterns obtained are helpful if we need to categorize the elements or find
an association between them.
They can also help detect anomalies and defects in the data which can be taken care of
by us.
Lastly and most importantly, data which we collect is usually unlabelled which makes work
easier for us when we use these algorithms.
Now that we know the importance, let us move ahead and understand the different
Clustering
Association
Clustering is the type of Unsupervised Learning where you find patterns in the data that you
are working on. It may be the shape, size, colour etc. which can be used to group data items
or create clusters.
K-Means Clustering – This algorithm works step-by-step where the main goal is to
achieve clusters which have labels to identify them. The algorithm creates clusters of
different data points which are as homogenous as possible by calculating the centroid
of the cluster and making sure that the distance between this centroid and the new data
point is as less as possible. The smallest distance between the data point and the
centroid determines which cluster it belongs to while making sure the clusters do not
interlay with each other. The centroid acts like the heart of the cluster. This ultimately
gives us the cluster which can be labelled as needed.
K-NN Clustering – This is probably the most simple of the Machine Learning
algorithms as the algorithm does not really learn but rather classifies the new data
point based on the datasets that have been stored by it. This algorithm is also called as
a lazy learner because it learns only when the algorithm is given a new data point. It
works well with smaller datasets as huge datasets take time to learn.
Association is the kind of Unsupervised Learning where you find the dependencies of
one data item to another data item and map them such that they help you profit better. Some
popular algorithms in Association Rule Mining are discussed below:
Apriori algorithm – The Apriori Algorithm is a breadth-first search based which
calculates the support between items. This support basically maps the dependency of
one data item with another which can help us understand what data item influences
the possibility of something happening to the other data item. For example, bread
influences the buyer to buy milk and eggs. So that mapping helps increase profits for
the store. That sort of mapping can be learnt using this algorithm which yields rules as
for its output.
FP-Growth Algorithm – The Frequency Pattern (FP) algorithm finds the count of the
pattern that has been repeated, adds that to a table and then finds the most plausible
item and sets that as the root of the tree. Other data items are then added into the tree
and the support is calculated. If that particular branch fails to meet the threshold of the
support, it is pruned. Once all the iterations are completed, a tree with the root to the
item will be created which are then used to make the rules of the association. This
algorithm is faster than Apriori as the support is calculated and checked for increasing
iterations rather than creating a rule and checking the support from the dataset.
Now that you have a clear understanding between the two kinds of Unsupervised
Learning, let us now learn about some of the
applications of Unsupervised Learning.
Unsupervised Learning helps in a variety of ways which can be used to solve various real-
world problems.
They help us in understanding patterns which can be used to cluster the data points
based on various features.
Understanding various defects in the dataset which we would not be able to detect
initially.
They help in mapping the various items based on the dependencies of each other.
Cleansing the datasets by removing features which are not really required for the
machine to learn from.
This ultimately leads to applications which are helpful to us. Certain examples of where
Unsupervised Learning algorithms are used are discussed below:
AirBnB – This is a great application which helps host stays and experiences
connecting people all over the world. This application uses Unsupervised Learning
where the user queries his or her requirements and Airbnb learns these patterns and
recommends stays and experiences which fall under the same group or cluster.
Amazon – Amazon also uses unsupervised learning to learn the customer’s purchase
and recommend the products which are most frequently bought together which is an
example of association rule mining.
Reinforcement:
Value-Based:
Policy-based:
In a policy-based RL method, you try to come up with such a policy that the action
performed in every state helps you to gain maximum reward in the future.
Stochastic Policy:
Model-Based:
In this Reinforcement Learning method, you need to create a virtual model for each
environment. The agent learns to perform in that specific environment.
Positive:
Negative:
Q learning
Set of actions- A
Set of states -S
Reward- R
Policy- n
Value- V
Q-Learning:
The outside of the building can be one big outside area (5)
Doors which is not directly connected to the target room gives zero reward
As doors are two-way, and two arrows are assigned for each room
Explanation:
In the below-given image, a state is described as a node, while the arrows show the action.
For example, an agent traverse from room number 2 to 5
Linear Regression:
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between
x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person.
The regression line is the best fit line for our model.
When training the model – it fits the best line to predict the value of y for a given value of x.
The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using
our model for prediction, it will predict the value of y for the input value of x.
By achieving the best-fit regression line, the model aims to predict y value such that
the error difference between predicted value and true value is minimum. So, it is very
important to update the θ1 and θ2 values, to reach the best value that minimize the error
between predicted y value (pred) and true y value (y).
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value)
and achieving the best fit line the model uses Gradient Descent. The idea is to start with
random θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.
This article discusses the basics of Logistic Regression and its implementation in
Python. Logistic regression is basically a supervised classification algorithm. In a
classification problem, the target variable (or output), y, can take only discrete values for a
given set of features (or inputs), X.
Contrary to popular belief, logistic regression IS a regression model. The model builds a
regression model to predict the probability that a given data entry belongs to the category
numbered as “1”. Just like Linear regression assumes that the data follows a linear function,
Logistic regression models the data using the sigmoid function.
The decision for the value of the threshold value is majorly affected by the values
of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is
the case.
In the case of a Precision-Recall trade-off, we use the following arguments to decide upon the
threshold:-
1. Low Precision/High Recall: In applications where we want to reduce the number of false
negatives without necessarily reducing the number of false positives, we choose a decision
value that has a low value of Precision or a high value of Recall. For example, in a cancer
diagnosis application, we do not want any affected patient to be classified as not affected
without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is
because the absence of cancer can be detected by further medical diseases but the presence of
the disease cannot be detected in an already rejected candidate.
2. High Precision/Low Recall: In applications where we want to reduce the number of false
positives without necessarily reducing the number of false negatives, we choose a decision
value that has a high value of Precision or a low value of Recall. For example, if we are
classifying customers whether they will react positively or negatively to a personalized
advertisement, we want to be absolutely sure that the customer will react positively to the
advertisement because otherwise, a negative reaction can cause a loss of potential sales from
the customer.
Based on the number of categories, Logistic regression can be classified as:
1. binomial: target variable can have only 2 possible types: “0” or “1” which may
represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2. multinomial: target variable can have 3 or more possible types which are not ordered
(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs
“disease C”.
3. ordinal: it deals with target variables with ordered categories. For example, a test
score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each
category can be given a score like 0, 1, 2, 3.
First of all, we explore the simplest form of Logistic Regression, i.e Binomial Logistic
Regression.
Student Self-Evaluation for the Short-
Internship
Student Name: Mallepogu sagar
Date of Evaluation:
1 Oral communication 1 2 3 4 5
2 Written communication 1 2 3 4 5
3 Proactiveness 1 2 3 4 5
5 Positive Attitude 1 2 3 4 5
6 Self-confidence 1 2 3 4 5
7 Ability to learn 1 2 3 4 5
9 Professionalism 1 2 3 4 5
10 Creativity 1 2 3 4 5
12 Time Management 1 2 3 4 5
15 OVERALL PERFORMANCE 1 2 3 4 5
Date of Evaluation:
15 OVERALL PERFORMANCE 1 2 3 4 5
1. Activity Log 25
2. Internship Evaluation 50
3. Oral Presentation 25
Certified by
Now, if we try to apply Linear Regression to the above problem, we are likely to get
continuous values using the hypothesis we discussed above. Also, it does not make sense
where,
So, now, we can define conditional probabilities for 2 labels(0 and 1) for observation
as:
Likelihood is nothing but the probability of data(training examples), given a model and
specific parameter values(here, ). It measures the support provided by the data for each
possible value of the . We obtain it by multiplying all for given .
The cost function for logistic regression is proportional to the inverse of the likelihood of
parameters. Hence, we can obtain an expression for cost function, J using log-likelihood
equation as:
and our aim is to estimate so that cost function is minimized!! Using Gradient descent
algorithm
Here, y and h(x) represent the response vector and predicted response vector(respectively).
Also, is the vector representing the observation values for feature.
The Decision Tree Algorithm is one of the popular supervised type machine learning
algorithms that is used for classifications. This algorithm generates the outcome as the
optimized result based upon the tree structure with the conditions or rules. The decision tree
algorithm associated with three major components as Decision Nodes, Design Links, and
Decision Leaves. It operates with the Splitting, pruning, and tree selection process. It
supports both numerical and categorical data to construct the decision tree. Decision tree
algorithms are efficient for large data set with less time complexity. This Algorithm is mostly
used in customer segmentation and marketing strategy implementation in the business.
Decision Nodes, which is where the data is split or, say, it is a place for the attribute.
Decision Link, which represents a rule.
Decision Leaves, which are the final outcomes.
There are many steps that are involved in the working of a decision tree:
1. Splitting – It is the process of the partitioning of data into subsets. Splitting can be done on
various factors as shown below i.e. on a gender basis, height basis, or based on class.
2. Pruning – It is the process of shortening the branches of the decision tree, hence
limiting the tree depth.
Pre-Pruning – Here, we stop growing the tree when we do not find any statistically
significant association between the attributes and class at any particular node.
Post-Pruning – In order to post prune, we must validate the performance of the test
set model and then cut the branches that are a result of overfitting noise from the
training set.
3. Tree Selection – The third step is the process of finding the smallest tree that fits the data.
We can also set some threshold values if the features are continuous.
In simple words, entropy is the measure of how disordered your data is. While you
might have heard this term in your Mathematics or Physics classes, it’s the same here. The
reason Entropy is used in the decision tree is because the ultimate goal in the decision tree is
to group similar data groups into similar classes, i.e. to tidy the data.
Let us see the below image, where we have the initial dataset, and we are required to
apply a decision tree algorithm in order to group together similar data points in one category.
After the decision split, as we can clearly see, most of the red circles fall under one class
while most of the blue crosses fall under another class. Hence a decision was to classify the
attributes that could be based on various factors.
Let us say that we have got “N” sets of the item, and these items fall into two
categories, and now in order to group the data based on labels, we introduce the ratio:
Random Forest:
Decision trees:
Since the random forest model is made up of multiple decision trees, it would be
helpful to start by describing the decision tree algorithm briefly. Decision trees start with a
basic question, such as, “Should I surf?” From there, you can ask a series of questions to
determine an answer, such as, “Is it a long period swell?” or “Is the wind blowing offshore?”.
These questions make up the decision nodes in the tree, acting as a means to split the data.
Each question helps an individual to arrive at a final decision, which would be denoted by the
leaf node. Observations that fit the criteria will follow the “Yes” branch and those that don’t
will follow the alternate path. Decision trees seek to find the best split to subset the data, and
they are typically trained through the Classification and Regression Tree (CART) algorithm.
Metrics, such as Gini impurity, information gain, or mean square error (MSE), can be used to
evaluate the quality of the split.
An Internship Report on
AWS Academy Cloud Virtual Internship
BACHELOR OF TECHNOLOGY
IN
MECHANICAL ENGINEERING
submitted by
Mallepogu sagar
(228A1A0309)
Under the esteemed guidance
Assistant professor Mr.Ramesh Babu
CERTIFICATE
This is to certify that the report entitled “MACHINE LEARNING”, that is being submitted by
MALLEPOGU SAGAR of III Year I Semister bearing (228A1A0309), in partial fullfilment for
the award of the Degree of Bachelor of Technology Mechanical Engineering, Rise Krishna Sai
Prakasam Group Of Institutions is a record of Bonafede work carried out by them.
Vision of
To be a center of excellence in computer science and engineering for
the
Department value-based education to serve humanity and contribute for socio-
economic development.
Provide professional knowledge by student centric
teaching- learning process to contribute software industry.
Inculcate training on cutting edge technologies for industryneeds.
PO2 Problem Analysis: Identify, formulate, review research literature, and analyze complex
Engineering problem searching substantiated conclusions using first principles o
mathematics, natural sciences, and engineering sciences
Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex
PO5 Engineering activities with an understanding of the limitations
The Engineer and Society: Apply reasoning informed by the contextual knowledge to
PO6 assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the Professional engineering practice.
PO11 Project Management and Finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments
Life-long Learning: Recognize the need for, and have the preparation and ability to
PO12 engage in independent and life-long learning in the broadest context of technological
change.
Program Educational Objectives (PEOs):
PEO1: Develop software solutions for real world problems by applying Mathematics
PEO2: Function as members of multi-disciplinary teams and to communicate effectively using
modern tools.
PEO3: Pursue career in software industry or higher studies with continuous learning and apply
professional knowledge.
PEO4: Practice the profession with ethics, integrity, leadership and social responsibility.
I take this opportunity to express my deep gratitude of appreciation to all those who
encourage us for successful completion of the internship.
Dr. A.V. BHASKARA RAO, PRINCIPAL of Rise Krishna Sai Prakasam Group Of
Institutions for his suggestions.
I am also thankful to all who helped me directly and indirectly in the successful
completion of this internship.
Project associate:
Mallepogu Sagar
(228A1A0309)
Brief description of the Person
Day & daily activity In-Charge
Learning Outcome
Date
Signature
amazon web
services
web services
introduction to
services
web services
WEEK 2
Objective of the activity done: The main objective of second week activities
is to know about flows Total Cost of Ownership, Technical Support. And how to
complete given modules as a part of the internship.
introduction to
Support Vector Machines (SVM)
amazon web
services
web services
WEEK-3
The objective of the activity done: The main objective of the third week
activities to
know about the Lightning Web Components (LWC), Lightning Web Components (LWC
& API).
Detailed Report:
introduction to
services
web services
WEEK 4
The objective of the activity done: The main objective of the fourth week is to
test with in the knowledge gain all over the course by completing AWS Identity and the
Management (IAM), Securing a new AWS account Securing a new AWS account,
Securing accounts etc.
amazon web
services
WEEK-5
The objective of the activity done: The main objective of this
week activities is to know about the Validations and Networking
basics, Amazon VPC etc. based on certain modules as part of the
internship.
computing
amazon web
services
web services
The random forest algorithm is made up of a collection of decision trees, and each
tree in the ensemble is comprised of a data sample drawn from a training set with
replacement, called the bootstrap sample. Of that training sample, one-third of it is set aside
as test data, known as the out-of-bag (oob) sample, which we’ll come
K-Nearest Neighbour:
K-Nearest Neighbours is one of the most basic yet essential classification algorithms
in Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not
make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points a group
by analyzing the training set. Note that the unclassified points are marked as ‘White’.
Intuition:
If we plot these points on a graph, we may be able to locate some clusters or groups.
Now, given an unclassified point, we can assign it to a group by observing what group its
nearest neighbours belong to. This means a point close to a cluster of points classified as
‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element
of this array represents a tuple (x, y).
for i=0 to m:
the theory behind the Naive Bayes classifiers and their implementation.
Consider a fictional dataset that describes the weather conditions for playing a game of golf.
Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”)
for playing golf.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector
consists of the value of dependent features. In above dataset, features are
‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for
each row of feature matrix. In above dataset, the class variable name is ‘Play
golf’.
Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
independent
equal
contribution to the outcome.
We assume that no pair of features are dependent. For example, the temperature
being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has
no effect on the winds. Hence, the features are assumed to be independent.
Secondly, each feature is given the same weight(or importance). For example,
knowing only temperature and humidity alone can’t predict the outcome
accurately. None of the attributes is irrelevant and assumed to be
contributing equally to the outcome.
Note: The assumptions made by Naive Bayes are not generally correct in real-world
situations. In-fact, the independence assumption is never correct but often works well in
practice.
Now, before moving to the formula for Naive Bayes, it is important to know about Bayes’
theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. Bayes’ theorem is stated mathematically as the following
equation:
Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.
P(A) is the priori of A (the prior probability, i.e. Probability of event before
evidence is seen). The evidence is an attribute value of an unknown instance(here,
it is event B).
P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is
seen.
where, y is class variable and X is a dependent feature vector (of size n) where:
Just to clear, an example of a feature vector and corresponding class variable can be: (refer
1st row of dataset)
Naive assumption:
Now, its time to put a naive assumption to the Bayes’ theorem, which
is, independence among the features. So now, we split evidence into the independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:
So, finally, we are left with the task of calculating P(y) and P(xi | y).
Please note that P(y) is also called class probability and P(xi | y) is called conditional
probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding
the distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset. For this, we need to
do some precomputations on our dataset.
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:
So, in the figure above, we have calculated P(x i | yj) for each xi in X and yj in y manually in
the tables 1-4. For example, probability of playing golf given that the temperature is cool, i.e
P(temp. = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For
example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
The method that we discussed above is applicable for discrete data. In case of continuous
data, we need to make some assumptions regarding the distribution of values of each feature.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding
the distribution of P(xi | y).
Now, we discuss one of such classifiers here.
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited for
classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional
space that distinctly classifies the data points. The dimension of the hyperplane depends upon
the number of features. If the number of input features is two, then the hyperplane is just a
line. If the number of input features is three, then the hyperplane becomes a 2-D plane. It
becomes difficult to imagine when the number of features exceeds three.
Let’s consider two independent variables x1, x2 and one dependent variable which is either
a blue circle or a red circle.
Linearly Separable Data points :
From the figure above its very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x1, x2) that segregates our data
points or does a classification between red and blue circles. So how do we choose the best
line or in general the best hyperplane that segregates our data points.
One reasonable choice as the best hyperplane is the one that represents the largest separation
or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2.
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.
So in this type of data points what SVM does is, it finds maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these type of cases are called soft margin. When there is a soft margin to the
data set, the SVM tries to minimize (1/margin+𝖠(∑penalty)). Hinge loss is a commonly used
penalty. If no violations no hinge loss.If violations hinge loss proportional to the distance of
violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls
are separable by a straight line/linear line). What to do if data are not linearly separable?
Say, our data is like shown in the figure above.SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a function of
distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms it into
higher-dimensional space, ie it converts not separable problem to separable problem. It is
mostly useful in non-linear separation problems. Simply put the kernel, it does some
extremely complex data transformations then finds out the process to separate the data based
on the labels or outputs defined.
Advantages of SVM:
Artificial neural networks are a technology based on studies of the brain and nervous system
as depicted in Fig. 1. These networks emulate a biological neural network but they use a
reduced set of concepts from biological neural systems. Specifically, ANN models simulate
the electrical activity of the brain and nervous system. Processing elements (also known as
either a neurode or perceptron) are connected to other processing elements. Typically the
neurodes are arranged in a layer or vector, with the output of one layer serving as the input to
the next layer and possibly other layers. A neurode may be connected to all or a subset of the
neurodes in the subsequent layer, with these connections simulating the synaptic
connections of the brain. Weighted data signals entering a neurode simulate the electrical
excitation of a nerve cell and consequently the transference of information within the network
or brain. The input values to a processing element, in, are multiplied by a connection
weight, wn,m, that simulates the strengthening of neural pathways in the brain. It is through the
adjustment of the connection strengths or weights that learning is emulated in ANNs.
All of the weight-adjusted input values to a processing element are then aggregated using a
vector to scalar function such as summation (i.e., y = Σwijxi), averaging, input maximum, or
mode value to produce a single input value to the neurode. Once the input value is calculated,
the processing element then uses a transfer function to produce its output (and consequently
the input signals for the next processing layer). The transfer function transforms the neurode's
input value. Typically this transformation involves the use of a sigmoid, hyperbolic-tangent,
or other nonlinear function. The process is repeated between layers of processing elements
until a final output value, on, or vector of values is produced by the neural network.
Theoretically, to simulate the asynchronous activity of the human nervous system, the
processing elements of the artificial neural network should also be activated with the
weighted
input signal in an asynchronous manner. Most software and hardware implementations
of artificial neural networks, however, implement a more discretized approach that
guarantees that each processing element is activated once for each presentation of a vector of
input values
The human brain processes a huge amount of information the second we see an image. Each
neuron works in its own receptive field and is connected to other neurons in a way that they
cover the entire visual field. Just as each neuron responds to stimuli only in the restricted
region of the visual field called the receptive field in the biological vision system, each
neuron in a CNN processes data only in its receptive field as well. The layers are arranged in
such a way so that they detect simpler patterns first (lines, curves, etc.) and more complex
patterns (faces, objects, etc.) further along. By using a CNN, one can enable sight to
computers.
Convolutional Neural Network Architecture:
A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully
connected layer.
Architecture of a CNN
Convolution Layer
The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load.
This layer performs a dot product between two matrices, where one matrix is the set of
learnable parameters otherwise known as a kernel, and the other matrix is the restricted
portion of the receptive field. The kernel is spatially smaller than an image but is more in-
depth. This means that, if the image is composed of three (RGB) channels, the kernel height
and width will be spatially small, but the depth extends up to all three channels.
A recurrent neural network (RNN) is a type of artificial neural network which uses
sequential data or time series data. These deep learning algorithms are commonly used for
ordinal or temporal problems, such as language translation, natural language processing (nlp),
speech recognition, and image captioning; they are incorporated into popular applications
such as Siri, voice search, and Google Translate. Like feedforward and convolutional neural
networks (CNNs), recurrent neural networks utilize training data to learn. They are
distinguished by their “memory” as they take information from prior inputs to influence the
current input and output. While traditional deep neural networks assume that inputs and
outputs are independent of each other, the output of recurrent neural networks depend on the
prior elements within the sequence. While future events would also be helpful in determining
the output of a given sequence, unidirectional recurrent neural networks cannot account for
these events in their predictions.
Comparison of Recurrent Neural Networks (on the left) and Feedforward Neural Networks
(on the right)
Let’s take an idiom, such as “feeling under the weather”, which is commonly used when
someone is ill, to aid us in the explanation of RNNs. In order for the idiom to make sense, it
needs to be expressed in that specific order. As a result, recurrent networks need to account
for the position of each word in the idiom and they use that information to predict the next
word in the sequence.
Looking at the visual below, the “rolled” visual of the RNN represents the whole neural
network, or rather the entire predicted phrase, like “feeling under the weather.” The
“unrolled” visual represents the individual layers, or time steps, of the neural network. Each
layer maps to a single word in that phrase, such as “weather”. Prior inputs, such as “feeling”
and “under”, would be represented as a hidden state in the third timestep to predict the output
in the sequence, “the”.
Another distinguishing characteristic of recurrent networks is that they share parameters
across each layer of the network. While feedforward networks have different weights across
each node, recurrent neural networks share the same weight parameter within each layer of
the network. That said, these weights are still adjusted in the through the processes of
backpropagation and gradient descent to facilitate reinforcement learning.
Through this process, RNNs tend to run into two problems, known as exploding gradients
and vanishing gradients. These issues are defined by the size of the gradient, which is the
slope of the loss function along the error curve. When the gradient is too small, it continues to
become smaller, updating the weight parameters until they become insignificant—i.e. 0.
When that occurs, the algorithm is no longer learning. Exploding gradients occur when the
gradient is too large, creating an unstable model. In this case, the model weights will grow
too large, and they will eventually be represented as NaN. One solution to these issues is to
reduce the number of
hidden layers within the neural network, eliminating some of the complexity in the RNN
model.