ML Internship
ML Internship
ON
BACHELOR OF TECHNOLOGY
IN
1
SIR C R REDDY COLLEGE OF ENGINEERING
CERTIFICATE
This is to certify that this Virtual Internship report entitled “AI & ML DEVEOPER
(MACHINE LEARNING)” submitted by NEELAM SASI PRIYA (20B81A05C0) in
partial fulfilment for the award of degree of BACHELOR OFTECHNOLOGY in
COMPUTER SCIENCE AND ENGINEERING, at SIR C R REDDYCOLLEGE OF
ENGINEERING, ELURU affiliated to Jawaharlal Nehru Technological University,
Kakinada.
EXTERNAL EXAMINER
2
3
ACKNOWLEDGEMENT
First, I would like to thank Mr. Upendar, Chip Electronics, Vijayawada for giving
me this Opportunity to do an internship within the organization. I also would like all the
people that worked along with me Chip Electronics, Vijayawada with their patience
and openness they created an enjoyable working environment.
It is indeed with a great sense of pleasure and immense sense of gratitude that I
acknowledge the help of these individuals.
I would like to thank my Head of the Department Dr. A. YESU BABU, for his
constructive criticism throughout my internship.
I would like to thank, Mr. V. PRANAV Internship coordinator Department of CSE for
his support and advices to get and complete internship in above Said organization.
(20B81A05C0)
4
ABSTRACT
Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy.
Over the last couple of decades, the technological advances in storage and
processing power have enabled some innovative products based on machine
learning, such as Netflix’s recommendation engine and self-driving cars.
ORGANISATION INFORMATION
5
LEARNING OBJECTIVES/INTERNSHIP
OBJECTIVES
➢ An objective for this position should emphasize the skills you already possess in
the area and your interest in learning more.
➢ Utilizing internships is a great way to build your resume and develop skills that
emphasized in your resume for future jobs.
➢ Utilizing
➢ When you are applying for a Training Internship, make sure to highlight any
special skills or talents that can make you stand apart from the rest of the
applicants so that you have an improved chance of landing the position.
6
INDEX
1. Introduction …………………………………………………………………………………8
2. History and Evaluation of Machine Learning…………………………………. 9
3. Life Cycle of Machine Learning……………………………………………………. 10
4. Classification of Machine Learning………………………………………………. 12
5. Supervised Machine Learning………………………………………………………15
Types of Supervised Machine Learning Algorithms……………………….16
Types of Classification………………………………………………………………….19
Types of Regression…………………………………………………………………….24
6. Unsupervised Machine Learning………………………………………………….31
Clustering In Machine Learning……………………………………………………32
Types of Clustering Methods……………………………………………………….33
Types of Clustering Algorithms…………………………………………………….35
Principal Component Analysis…………………………………………………….37
7. Applications of Machine Learning……………………………………………….38
8. Conclusion………………………………………………………………………………….39
7
INTRODUCTION
Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to
“self-learn” from training data and improve over time, without being explicitly programmed.
Machine learning algorithms are able to detect patterns in data and learn from them, in
order to make their own predictions. In short, machine learning algorithms and models learn
through experience.
In traditional programming, a computer engineer writes a series of directions that instruct a
computer how to transform input data into a desired output. Instructions are mostly based
on an IF-THEN structure: when certain conditions are met, the program executes a specific
action.
Machine learning, on the other hand, is an automated process that enables machines to
solve problems with little or no human input, and take actions based on past observations.
While artificial intelligence and machine learning are often used interchangeably, they are
two different concepts. AI is the broader concept – machines making decisions, learning new
skills, and solving problems in a similar way to humans – whereas machine learning is a
subset of AI that enables intelligent systems to autonomously learn new things from data.
Instead of programming machine learning algorithms to perform tasks, you can feed them
examples of labelled data (known as training data), which helps them make calculations,
process data, and identify patterns automatically.
Put simply, Google’s Chief Decision Scientist describes machine learning as a fancy labelling
machine. After teaching machines to label things like apples and pears, by showing them
examples of fruit, eventually they will start labelling apples and pears without any help –
provided they have learned from appropriate and accurate training examples.
Machine learning can be put to work on massive amounts of data and can perform much
more accurately than humans.
8
HISTORY AND EVOLUTION OF MACHINE
LEARNING
Before some years (about 40-50 years), machine learning was science fiction, but today it is the part
of our daily life. Machine learning is making our day today life easy from self-driving cars to Amazon
virtual assistant "Alexa". However, the idea behind machine learning is so old and has a long history.
Below some milestones are given which have occurred in the history of machine learning: The early
history of Machine Learning (Pre-1940): 1834: In 1834, Charles Babbage, the father of the computer,
conceived a device that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure. 1936: In 1936, Alan Turing gave a theory
that how a machine can determine and execute a set of instructions. The era of stored program
computers: 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which was
the first electronic general-purpose computer. After that stored program computer such as EDSAC in
1949 and EDVAC in 1951 were invented. 1943: In 1943, a human neural network was modelled with
an electrical circuit. In 1950, the scientists started applying their idea to work and analysed how
human neurons might work. Computer machinery and intelligence: 1950: In 1950, Alan Turing
published a seminal paper, "Computer Machinery and Intelligence," on the topic of artificial
intelligence. In his paper, he asked, "Can machines think?"
Machine intelligence in Games: 1952: Arthur Samuel, who was the pioneer of machine
learning, created a program that helped an IBM computer to play a checkers game. It performed
better more it played. 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The first "AI" winter: The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter. 6/7 In this duration, failure of machine
translation occurred, and people had reduced their interest from AI, which led to reduced funding by
the government to the researches.
Machine Learning from theory to reality 1959: In 1959, the first neural network was
applied to a real-world problem to remove echoes over phone lines using an adaptive filter. 1985: In
1985, Terry Sejnowski and Charles Rosenberg invented a neural network , Which was able to teach
itself how to correctly pronounce 20,000 words in one week. 1997: The IBM's Deep blue intelligent
computer won the chess game against the chess expert Garry Kasparov, and it became the first
computer which had beaten a human chess expert. Machine Learning at 21 Century.
2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to neural net
research as "deep learning," and nowadays, it has become one of the most trending technologies.
2012: In 2012, Google created a deep neural network which learned to recognize the image of
humans and cats in YouTube videos.
2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the first Chabot who
convinced the 33% of human judges that it was not a machine.
9
2014: Deep Face was a deep neural network created by Facebook, and they claimed that it could
recognize a person with the same precision as a human can do.
2016: AlphaGo beat the world's number second player Lee sedol at Go game. In 2017 it beat the
number one player of this game Ke Jie.
2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to learn the
online trolling. It used to read millions of comments of different websites to learn to stop online
trolling.
Machine Learning at present: Now machine learning has got a great advancement in its
research, and it is present everywhere around us, such as self-driving cars, Amazon Alexa, Catboats,
recommender system, and many more. It includes Supervised, unsupervised, and reinforcement
learning with clustering, classification, decision tree, SVM algorithms, etc. Modern machine learning
models can be used for making various predictions, including weather prediction, disease prediction,
stock market analysis, etc.
The main purpose of the life cycle is to find a solution to the problem or project. Machine learning
life cycle involves seven major steps, which are given below:
• Gathering Data
• Data preparation
• Data Wrangling
• Analyse Data
• Train the model
• Test the model
• Deployment
10
1. Gathering Data: Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems. In this step, we need to identify the different
data sources, as data can be collected from various sources such as files, database, internet, or
mobile devices. It is one of the most important steps of the life cycle. The quantity and quality of the
collected data will determine the efficiency of the output. The more will be the data, the more
accurate will be the prediction. This step includes the below tasks: Identify various data sources
Collect data Integrate the data obtained from different sources By performing the above task, we get
a coherent set of data, also called as a dataset. It will be used in further steps.
2. Data preparation: After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and prepare it to use in our
machine learning training. In this step, first, we put all data together, and then randomize the
ordering of data.
Data exploration: It is used to understand the nature of data that we have to work with. We need
to understand the characteristics, format, and quality of data. A better understanding of data leads
to an effective outcome. In this, we find Correlations, general trends, and outliers.
Data pre-processing: Now the next step is preprocessing of data for its analysis.
3. Data Wrangling: Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and transforming the data
in a proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality issues. It
is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
4. Data Analysis: Now the cleaned and prepared data is passed on to the analysis step. This step
involves:
The aim of this step is to build a machine learning model to analyse the data using various analytical
techniques and review the outcome. It starts with the determination of the type of the problems,
where we select the machine learning techniques such as Classification, Regression, Cluster analysis,
Association, etc. then build the model using prepared data, and evaluate the model. Hence, in this
step, we take the data and use machine learning algorithms to build the model.
5. Train Model : Now the next step is to train the model, in this step we train our model to improve
its performance for better outcome of the problem. We use datasets to train the model using various
11
machine learning algorithms. Training a model is required so that it can understand the various
patterns, rules, and, features.
6. Test Model : Once our machine learning model has been trained on a given dataset, then we test
the model. In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment: The last step of machine learning life cycle is deployment, where we deploy the
model in the real-world system. If the above-prepared model is producing an accurate result as per
our requirement with acceptable speed, then we deploy the model in the real system. But before
deploying the project, we will check whether it is improving its performance using available data or
not. The deployment phase is similar to making the final report for a project.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc. Based on the methods and way of learning, machine
learning is divided into mainly four types, which are:
4. Reinforcement Learning
12
1.Supervised Machine Learning: As its name suggests, Supervised machine learning is based on
supervision. It means in the supervised learning technique, we train the machines using the
"labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled
data specifies that some of the inputs are already mapped to the output. More preciously, we can
say; first, we train the machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection,
Spam filtering, etc.
Categories of Supervised Machine Learning Supervised machine learning can be classified into two
types of problems, which are given below:
• Classification
• Regression
In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset
according to the similarities, patterns, and differences. Machines are instructed to find the hidden
patterns from the input dataset.
• Clustering
• Association
13
Categories of Reinforcement Learning: Reinforcement learning is categorized mainly into two
types of methods/algorithms:
Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency
that the required behaviour would occur again by adding something. It enhances the strength of the
behaviour of the agent and positively impacts it.
Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the
positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the
negative condition. Real-world Use cases of Reinforcement Learning Video Games: RL algorithms are
much popular in gaming applications. It is used to gain superhuman performance. Some popular
games that use RL algorithms are AlphaGO and AlphaGO Zero.
Resource Management: The "Resource Management with Deep Reinforcement Learning" paper
showed that how to use RL in computer to automatically learn and schedule resources to wait for
different jobs in order to minimize average job slowdown.
Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There
are different industries that have their vision of building intelligent robots using AI and Machine
learning technology.
Text Mining: Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
14
SUPERVISED MACHINE LEARNING
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of
test data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
15
Types of supervised Machine learning Algorithms: Supervised learning can be further divided
into two types of problems
Regression: Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Classification: Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc. Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
In supervised learning, we can have an exact idea about the classes of objects. Supervised learning
model helps us to solve various real-world problems such as fraud detection, spam filtering, etc.
Supervised learning models are not suitable for handling the complex tasks.
16
Supervised learning cannot predict the correct output if the test data is different from the training
dataset. Training required lots of computation times. In supervised learning, we need enough
knowledge about the classes of object.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is called as
Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as
Multi-class Classifier.
Learners in Classification Problems: In the classification problems, there are two types of learners:
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the
training dataset. It takes less time in training but more time for predictions.
2.Eager Learners: Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction.
17
Types of Machine Learning Classification
Algorithms
Classification Algorithms can be further divided into the Mainly two category:
• Linear Models
Logistic Regression
• Non-linear Models
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
2. Confusion Matrix: The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
It is also known as the error matrix.
The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:
18
3. AUC-ROC curve: ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve. It is a graph that shows the performance of the classification model at
different thresholds. To visualize the performance of the multi-class classification model, we use
the AUC-ROC Curve. The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate)
on Y-axis and FPR (False Positive Rate) on X-axis.
TYPES OF CLASSIFICATION
• Binary Classification
• Multi -class Classification
Binary Classification
It is a process or task of classification, in which a given data is being classified into two classes. It’s
basically a kind of prediction about which of two groups the thing belongs to.
Binary classification uses some algorithms to do the task, some of the most common algorithms used
by binary classification are:
2. RECALL: The recall is also known as sensitivity. In binary classification (Yes/No) recall is used to
measure how “sensitive” the classifier is to detecting positive cases. To put it another way, how many
real findings did we “catch” in our sample? We may manipulate this metric by classifying both results
as positive.
3. F1 SCORE: The F1 score can be thought of as a weighted average of precision and recall, with the
best value being 1 and the worst being 0. Precision and recall also make an equal contribution to the
F1 ranking.
19
Multiclass Classification
Multi-class classification is the task of classifying elements into different classes. Unlike binary, it
doesn’t restrict itself to any number of classes.
20
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
Step-4: Among these k neighbours, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Advantages of KNN Algorithm: It is simple to implement. It is robust to the noisy training data It
can be more effective if the training data is large.
Disadvantages of KNN Algorithm: Always needs to determine the value of K which may be
complex some time. The computation cost is high because of calculating the distance between the
data points for all the training samples.
21
Decision Tree Algorithm
Decision trees are a popular machine learning algorithm that can be used for both regression and
classification tasks.
A decision tree is a hierarchical model used in decision support that depicts decisions and their
potential outcomes, incorporating chance events, resource expenses, and utility. This algorithmic
model utilizes conditional control statements and is non-parametric, supervised learning, useful for
both classification and regression tasks. The tree structure is comprised of a root node, branches,
internal nodes, and leaf nodes, forming a hierarchical, tree-like structure.
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a
tree structure to show the predictions that result from a series of feature-based splits. It starts with a
root node and ends with a decision made by leaves
Before learning more about decision trees let’s get familiar with some of the terminologies:
Root Nodes – It is the node present at the beginning of a decision tree from this node the
population starts dividing according to various features.
Decision Nodes – the nodes we get after splitting the root nodes are called Decision Node.
22
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or terminal
nodes.
Sub-tree – just like a small portion of a graph is called subgraph similarly a sub-section of this
decision tree is called subtree.
Entropy:
Entropy is nothing but the uncertainty in our dataset or measure of disorder.
E=−(plog2p+plog2pp+plog2p)
Information Gain
Information gain measures the reduction of uncertainty given some feature and it is also a deciding
factor for which attribute should be selected as a decision node or root node.
Pruning
It is another method that can help us avoid overfitting. It helps in improving the performance of the
tree by cutting the nodes or sub-nodes which are not significant. It removes the branches which have
very low importance.
Pre-pruning – we can stop growing the tree earlier, which means we can prune/remove/cut a node
if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based on their
significance.
23
TYPES OF REGRESSION
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between
the variables.
y= a +a x+ ε
24
X= Independent Variable (predictor Variable)
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression: If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
Multiple Linear regression: If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
Polynomial Regression
Polynomial Regression is a type of regression which models the non-linear dataset using a linear
model. It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y. Suppose there is a dataset which consists of datapoints
which are present in a non-linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression. In Polynomial regression, the
original features are transformed into polynomial features of given degree and then model using a
linear model. Which means the datapoints are best fitted using a polynomial line.
Here Y is the predicted/target output, b , b ,... b are the regression coefficients. x is our
independent/input variable.
25
Support Vector Regression in Machine Learning
Support Vector Machine is a supervised learning algorithm which can be used for regression as well
as classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are
some keywords which are used in Support Vector Regression:
• Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
• Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
• Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
• Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.
26
Random Forest Regression in Machine Learning
• Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
• The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each
tree output. The combined decision trees are called as base models, and it can be
represented more formally as: g(x)= f (x)+ f (x)+ f (x)+...
• Random forest uses Bagging or Bootstrap Aggregation technique of ensemble
learning in which aggregated decision tree runs in parallel and do not interact with
each other.
27
Logistic Regression in Machine Learning
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefor the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
28
Assumptions for Logistic Regression:
• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
29
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modelling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a
model has captured a strong relationship or not.
Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
Independent Variable: The factors which affect the dependent variables or which are used
to predict the values of the dependent variables are called independent variable, also called
as a predictors
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
30
UNSUPERVISED MACHINE LEARNING
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects of
another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
31
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
32
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:
1. Partitioning Clustering
2. Density-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm
In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
33
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level. The
most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group
or cluster. Each dataset has a set of membership coefficients, which depend on the degree of
membership to be used in a cluster sometimes also known as the Fuzzy k-means algorithm.
34
Clustering Algorithms
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.
• Determines the best value for K-center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
35
Naive Bayes Classifier in Machine Learning
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the bases of colour, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.
Baye’s Theorem
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability. The formula for
Bayes' theorem is given as:
36
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the observations
of correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It is one of the
popular tools that is used for exploratory data analysis and predictive modelling. It is a technique to
draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high dimensional data. The
PCA algorithm is based on some mathematical concepts such as:
Dimensionality: It is the number of features or variables present in the given dataset. More easily, it
is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here, -1
occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables is called the
Covariance Matrix.
Steps for PCA algorithm
37
Applications of Machine Learning
Machine learning is one of the most exciting technologies that one would have ever come across.
As is evident from the name, it gives the computer that which makes it more similar to humans:
The ability to learn. Machine learning is actively being used today, perhaps in many more places
than one would expect.
Image Recognition
One of the most notable machine learning applications is image recognition, which is a method for
Catalog and detecting an object or feature in a digital image. In addition, this technique is used for
further analysis, such as pattern recognition, face detection, and face recognition.
Speech Recognition
ML software can make measurements of words spoken using a collection of numbers that represent
the speech signal. Popular applications that employ speech recognition include Amazon’s Alexa,
Apple’s Siri, and Google Maps ML software can make measurements of words spoken using a
collection of numbers that.
As ML algorithms can identify critical traits in complicated datasets, it is applied in cancer research. It
is used to construct prediction models using techniques like Artificial Neural Networks (ANNs),
Bayesian Networks (BNs), and Decision Trees (DTs). This helps in precise decision-making and
modelling of the evolution and therapy of malignant diseases.
Fraud Detection
Fraud prevention is one of the most significant uses of machine learning in the banking and finance
industry. This technology is implemented to search through large volumes of transactional data and
spot patterns for unusual behaviour. Every purchase a customer makes is evaluated in real-time, and
the likelihood that the transaction is fraudulent is indicated by a fraud score. The transaction is
subsequently blocked or frozen for manual examination in the event of a fraudulent transaction. This
entire process takes place in just a few seconds.
One of the prominent elements of typically any e-commerce website is product recommendation,
which involves the sophisticated use of machine learning algorithms. Websites track customer
behaviour based on past purchases, browsing habits, and cart history and then recommend products
using machine learning and AI.
38
39
CONCLUSION
40