Machine Learning Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

2022

MACHINE LEARNING
mANDAOTRY COURSE
PRAMOTH SABARI S
COURSE PLAN AND PORTION:

1|Page
2|Page
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Artificial Intelligence:
Why AI? - Demystifying Intelligent Systems History of AI
Dartmouth Conference in 1956 organized by John
What is AI? - Technical Overview
McCharthy.
AI and Business
According to Marvin Minsky AI is science of making
Sneak Peek into Popular AI platforms machine that do thing that would require
intelligence if done by men.
It is ability of the machine to mimic the cognitive
Why AI
function that associate with the human.
 This is self-learning and adaptive.
 AI minds the pattern and become Types of AI
knowledgeable on their own. Two types of AI
 It is able to interpret and analyze a huge
 Weak AI (Artificial Narrow Intelligence)
amount of data in short duration of time
 Self-driving car, google assistant
 The data can be images, videos, folders,
 Strong AI (Artificial General Intelligence)
documents, etc.,
 Machines that exhibit human
 These solves problem for which a formula
intelligence and can perform any
or procedure do not exist.
task that humans do.
Note: Video Analysis
Aspects of AI
Natural language Analysis  Planning – Decision making tasks
 Speech recognition – ability of computer to
AI Business Opportunity
recognize speech and covert into text.
 It helps to optimize the data and analyze
 Natural Language processing –
the trend for the particular product or
Understanding human speaking language
particular solution and gives the solution
 Robotics – Intelligence behavior to the
accordingly.
robot
 Walmart, Rolls-Royce are some of the
 Expert Systems – Gives solutions based on a
companies that uses AI technology to
huge data provided to it
prevent their lose and increase their profit.
 Machine Learning
 AI is helping them achieve:
 Vision – Making understand the images and
1. Creation of new business lines
videos.
2. Competitive advantages
3. Cost reduction, increased Robotic Process Automation
productivity, process improvements Robotic process automation (RPA) is a software
4. Personalized communication at technology that makes it easy to build, deploy, and
scale manage software robots that emulate human
5. AI centered innovations of products actions interacting with digital systems and
and processes. software.

3|Page
Machine Learning Logistics Regression – Predicts the probability of
It learns pattern from the given data and applies to occurrence of the event and the predicted value
the data. Machine Learning includes Supervised will be between 0 and 1.
Learning, Unsupervised Learning and
It uses the logistics function which S- shaped. It is
Reinforcement learning.
used to fit the data and predict the of given data.
It deals with predictive analytics This covert the occurrence of the event into
probability.
Machine learning learns the patterns from the past
data and makes a model. This model is used to Regression
make predictions for a new data. This is way It learns the relation between the given input data
machine learning predicts. and output. This regression is based on the single
input feature data that can be newly given and this
The choice of machine learning algorithm depends
predicts the output based on the older data,
on the type of the problem and type of data we
have. The regression can of two types based on the
feature number,
Supervised learning
Input variable (X) give Output variable  Single Linear Regression
 Multiple Linear Regression
It is used when data is well labelled.
Deep Learning or Artificial Neural
Types of Supervised learning
Networks
 Classification – Predicts discrete Artificial Neural Networks is inspired from
values biological neurons.
 Regression – Predicts continuous
These contains three nodes: Input nodes, hidden
numeric value
nodes and Output nodes. This connection and
Classification combination such many nodes is called as neural
Support Vector Machine networks.
Widely used classification techniques that helps in This produces more accurate results than the
deciding the optimal classifier between the classes traditional methods of machine learning.
of data.
The neural networks model will be created by
Decision Tree – It represents the data in form of a feeding different type of input, which will be
tree containing decision nodes and leaf nodes. mapped to the accurate output using formation of
many neural networks.
More the number of input data there will be a
greater number of nodes in the decision tree and Types of ANN – Convolutional Neural Networks
more will be the accuracy of the output. (CNN) and Recurrent Neural Network (RNN).
KNN – It is widely used for classification Overfitting – It happens when a model learns the
techniques. Compares K nearest neighbors in the details as well as noise in the training data. Bad
data point with the new data point. performance of test data but good performance in
training data.
K defines as the no. of neighbors.

4|Page
Underfitting – It happens when model cannot Timeseries forecasting – predicting the future
capture the underlying trends in the given training outputs from the past input data.
data.
Stages involved in the machine learning
Both these will give poor predictions in the 1. Define the goal
machine learning. Common problem is overfitting. 2. Identify the machine learning technique
that should be used
Ensemble method – It combine multiple learning
3. Data is collected for input
models and arrive at one prediction to increase the
4. Feature engineering
accuracy of the output.
5. The data is divided into two – training data
and testing data.
6. Choosing ML Method according to the ML
Two aspects of error: Bias and variance
technique.
Bias – Refers to errors by the model due to 7. Model validation
erroneous assumptions. 8. Final model – New data
9. Test Prediction in real-time
Variance – refers to errors by the model due to
sensitivity to small fluctuations in the training data AI Architecture
set. Data
Complexity of the model is low – High Bias and Low Information Management
variance
 Databases
Increasing Complexity – Low bias and high Variance  Spark
Unsupervised Learning  Hadoop
We train the algorithm with the unlabeled data. It AI Technologies
will group the input data based on the patterns,
similarities and differences.  Machine Learning
 NLP
Types – Clustering and Association mining  Robotics
Clustering – Processes the input data and finds the  Image analytics
inherent clusters or groups.  Expert systems

Association mining – depends on the association Models


rules that helps discover the relationship from the Action
large data set.
FNOL – First notice of loss.
Reinforcement Learning
The algorithm is not explicitly told on how to AI in Business
perform the task. It learns by experiencing reward Managing of data can provide the business to
and pain. improve their profit and value. The use cases are
taken from retail, health care, banking,
It learns by itself to maximize the rewards and manufacturing and energy.
minimize the penalty. This used when the training
data is not available and it is done by interacting AI in retails
with the environment.  Conversational Commerce
5|Page
 Contextual Commerce Some of the programming languages,
 Actionable analytics – getting access to tools and frameworks popularly used are
relevant data in the correct context listed below
 Predictive Marketing – Marketing that uses  Programming Languages: Python, Java, R,
big data to develop accurate forecasts of Scala, SQL
future customer behavior  Big Data Tools: Spark, HBase, Kafka, HDFS,
 Guided Sales – understanding needs and Hive, Hadoop, MapReduce, Pig
suggesting the best match  Machine Learning Frameworks:TensorFlow,
PyTorch, Keras, Theano, Caffe,, MXNET
 Cloud Platforms: AWS, Azure, GCP, IBM
AI in Healthcare Cloud
 Automated Image Diagnosis  Computer Vision: Keras, Open CV, Pytorch
 Robot Assisted Surgery  Natural Language Processing: NLTK,
 Virtual Nursing Assistants Gensim, SpaCy, Keras, StancoreNLP, Text
 Fraud Detection blob
 Preliminary Diagnosis  ML Operation (MLops): Flask, MLflow,
 Administrative Workflow Assistant Apache Airflow, Kubeflow, Seldon
 AI Governance: What-if Tool, LIME,
AI in Banking DeepLIFT, Skater, Shapley, AIX360
 Risk assessment
 Investment Management
 Trading
 Credit Approval Process
 Customer Support
 Regulations and Compliance
AI in Manufacturing

 Digital Twins
 Adaptive Manufacturing
 Predictive Maintenance
 Automated Qualitative Control
 Demand Driven Production
AI in Energy

 Smart Exploration
 Better Development
 Faster production

6|Page
7|Page
INTRODUCTION TO DATA SCIENCE
Data Science The very beginning of this is method by starting to
Data science is an amalgamation of different ask number of questions.
scientific methods, algorithms and systems which
What is Data Science
enable us to gain insights and derive knowledge
Data science is primarily a combination of Data &
from data in various forms.
Science
Data science is a cross functional field, emerging at
Data Science is the empirical synthesis of
an intersection of probability and statistics, linear
actionable knowledge from raw data through the
algebra, calculus and other mathematical branches,
complete data lifecycle process.
machine learning, and computer science.
Components of Data Science
There are really big data that should be analyzed
properly for proper usage. Different organization 1. Probability and statistics
needs to methodically align data to their 2. Linear Algebra
advantage. 3. Machine Learning
4. Computer Science
Adoption of Data Science led to the following
benefits in business: Probability and Statistics
Probability is a mathematical subject which
 Cost reduction
enables us in determining or predicting how likely
 Increase in productivity
it is that an event will happen.
 Reduction in time taken to solve problems
 Process improvements Statistics is another mathematical subject which
 Competitive advantages deals primarily with data. It helps us draw
inferences from data by having procedures in place
Methodical Alignment of data for collecting, classifying and presenting the data in
an organized manner.

Linear Algebra
This deals with the theory of systems of linear
equations, matrices, vector spaces and linear
transformations.
Most of the complex science problems are
converted int problems of vectors and matrices
and then solved using linear models. Linear Algebra
works as a computational engine for most of the
data science problems because of its performance
advantages over iterative methods.

Machine learning
Making the machine to learn and make it to make
its own decision.
 Supervised
 Semi-supervised
8|Page
 Unsupervised Life Cycle of a Data Science Project
 Reinforcement

Characteristics of a successful data science project


1. A clear, verifiable and quantifiable goal
2. Set realistic expectations for all
stakeholders
3. An unbiased data set covering examples
that are sufficient to represent the entire
data
4. The right model based on the business
scenario
5. Correctly represented and deployed model
in order to meet the predefined goal

Computer Science
Computer Science provides us with the necessary
programming languages, database management
systems, statistical analysis and machine learning
tools.
To solve the given business model, many blocks
should be integrated. The major steps are
1. Writing the core algorithm
2. Algorithm that uses linear algebra
3. Statistical Computations should be done on
the given data
4. All structured, un-structured and semi-
structured data should be managed using
data management systems.

9|Page
PYTHON FOR DATASCIENCE
Python Libraries Dtype – data type of object (for example, a
Python libraries are collections of pre-written codes integer)
to perform specific tasks. This eliminates the need
Accessing using index
of rewriting the code from scratch.
Python Libraries for Machine learning,

 NumPy – Scientific Computation


 Pandas – Data Structure and Data Analysis
 Matplotlib – Plotting and Visualization
 Scikit-Learn – Machine Learning Tools

NumPy
It is python libraries used for working with arrays. It
is also known as array processing package.
Numeric-Python (NumPy) is a Python library that is It is accessed using square brackets
used for numeric and scientific operations. It serves
as a building block for many libraries available in
Python.
Data Structure in NumPy
The main data structure of NumPy is the ndarray or
n-dimensional array (it is a multidimensional
container of elements of the same type. Images as a NumPy Matrix

Advantage of NumPy Images are stored as arrays of hundreds, thousands


or even millions of picture elements called
 When the array size increases the NumPy as pixels.
can execute more parallel operations,
making the program run faster. Appendix
 It has many optimized built-in mathematical
functions.
 It has multidimensional array data
structured that can represent vectors and
matrices.
NumPy object creation
Syntax: np.array(object,dtype)
Object – A python object (for example, a list)

10 | P a g e
Pandas Dtype – this represents the data type used in
Pandas is an open-source library for real world data the series
analysis in python. Using Pandas, data can be
o By default, series creates an integer index.
cleaned, transformed, manipulated, and analyzed.
The custom index can be defined.
The steps involved to perform data analysis using
Pandas are, Data Frame – is a collection of series where each
series represents a column from a table
o Read the data – Reading the data can be
done in multiple format such as Syntax: pd.DataFrame(data, index, columns)
‘.csv’,’.json’, ‘.xlsx’ etc.
o Explore the data Data - data can contain Series or list-like
o Perform operations on the data – Grouping, objects. If data is a dictionary, column order follows
insertion-order.
sorting, masking, merging, concatenating
o Visualize the data – to get a clear picture of index- index for dataframe that is created.
various relationship among the data. Scatter By default, it will be RangeIndex(0, 1, 2, …, n) if no
plot, box plot, bar plot and histogram and explicit index is provided
many more.
columns- If data contains column labels, it
o Generate insights – All the above steps will
will use the same. Else, default to RangeIndex(0, 1,
generate the insight about the data.
2, …, n).
Advantage of Pandas:
There are different approaches to create a
1. Has the capability to load huge sizes of data DataFrame,
easily
o From a single series object
2. Provides us with extremely streamlined
o From a list of dictionaries
forms of data representation
3. Can handle heterogenous data, has o From a dictionary o series object
extensive set of data manipulation features, o From a existing file
and makes data flexible and customizable. The axis Keyword
Pandas Objects One of the important parameters used while
Pandas objects are advanced versions of NumPy performing operations on DataFrames is 'axis'. Axis
structured arrays in which the rows and columns takes two values: 0 and 1.
are identified with labels instead of simple integer  axis = 0 represents row specific operations.
indices. Basic data structure of Pandas are series  axis = 1 represents column specific
and Data Frame. operations
Series – one dimensional labelled array Pandas can read a variety of files,
Syntax: pd.Series(data, index, dtype)
Data – it can be a list, a list of lists or even a
dictionary
Index – The index can be explicitly defined for
different values if required

11 | P a g e
1. Head and tail- to view 1st few or last few marker = shape in case of specific
rows plots like a scatter plot
2. Describe – used a generate a quick kind = type of plot
summary of data statistics Matplotlib
3. Info – to know about the datatypes and The graphical representation of data or information
number of rows containing null vales for using visual elements like graphs, charts and maps
respective columns is known as data visualization.
4. Dropping null values
5. Selecting a subset of the data Plot – The plot is the basic visualization elements
that helps to visualize the data
‘iloc’ and ‘loc’ are two indexing techniques that
help us in slecting specfic rows and column.  Figure- It is the top-level container that acts
as the window or page on which everything
Iloc – Access by integer index is drawn
 Syntax – df.iloc[rows, columns]  Axes- The axes are the area on which data
is plotted.
loc – access a group by custom index  The plot of comprises several elements such
Operations in Pandas as title, label, axes, legend etc.,

Masking- The masking operation replaces values Matplotlib


where the condition is TRUE. Python library for data visualization.
 Syntax – DataFrame.mask(cond, other = matplotlib.pyplot is used for two-dimensional
nan, inplace = False,axis = None) graphics in python programming.
 F = marks_df < 33 There are two approaches to plotting in Matplotlib,
 Marks_df.mask(f, ‘Fail’)
1. MATLAB way of plotting using
Index Preservation- Pandas preserves the index matplotlib.plyplot. It is simple to use
and column labels in the output. For binary 2. Object-oriented way of plotting for more
operations such as addition and multiplication, control and customization
Pandas will automatically align indices when
passing the objects to the functions. The library is imported by import matplotlib.pyplot
as plt.
Broadcasting refers to a set of rules to operate
between data of different sizes and shapes. Syntax – plt.plot(x,y)

Apply - This method is used to apply a function Plotting using object-oriented approach, following
along an axis of the DataFrame. method are followed,

Pandas Plot  Creating a figure


 Setting up the axes
It can visualize the data in the form of plots.  Creating a plot using the axes object
 Syntax – df.plot(X, y, marker, kind)  Creating multiple plots using the same axes
object
X = value on X axis  Setting up the title, label, and legend for a
y = value on y axis plot

12 | P a g e
Types of plots Text annotates – text can be added to
describe the plots
1. Boxplot- Gives a good indication of
distribution of data about the median. For saving the plot as image,
Boxplots are a standardized way of plt.savefig(“filename.jpg”,dpi =200) is used.
displaying the distribution of data based on
the five-number summary (“minimum”, first
quartile (Q1), median, third quartile (Q3),
and “maximum”).
2. Scatter plot – uses dots and markers to
represent a value on the axes. It is the
simplest plot that can accept bot
quantitative and qualitative values, with a
wide variety of application in primitive data
analysis.
Syntax- ax.scatter(x, y, marker) Machine Learning using sklearn
3. Bar Chart – graph with rectangular bars that Scikit-learn (also referred as sklearn) is a python
usually compare different categories. library widely used for machine learning. It is
Syntax- ax.bar(x, height, width, bottom, characterized by a clean, uniform and streamlined
align) API.
4. Histogram- represents data as rectangular
The objective here is to introduce the usage of a
bars. It is used for continuous data.
scikit-learn library for different stages of ML model
Syntax – ax.hist(x,bins)
building.
5. Pie chart- divides the entire datast into
distinct groups. Follow the order,
Syntax – ax.pie(x,labels)
 Loading the data
To enhance the analysis customizing other
 Preprocessing data
parameter will take place
 Training the model
 Explode - To get an elevated view for
the selected pie.  Testing the model
 Colors - To customize the colors for the  Evaluating model performance
plot.
 Autopct - To add the percentage of the
distribution in the pie chart.
 Shadow - To add shadow to the plot.
 Startangle -To change the starting angle
of the pie chart.
6. Line Chart – Drawn by interconnecting all
data points using straight line segments in
sequential order.
Line style – Matplotlib supports many lines
style like dashed line, dotted lines, dash-dot
etc.

13 | P a g e
DATA VISUALIZATION USING PYTHON
Data Visualization i. Temporal Data - Data with a time
It is the concept of graphical representation of data component attached to it.
or information using visual elements like graphs, ii. Geospatial data - Data with a physical
charts, and maps. location as an attribute.
o It helps in finding patterns and connections iii. Topical Data - Data concerned with topics.
between variables iv. Network Data - Data in the form of nodes
o Requires less effort from the reader to and links between nodes.
v. Tree Data - Data which is basically network
understand the visuals
data but with some hierarchy in it.
o Condenses a large amount of information
into a small space for quick analysis Two types of data
o Provides relevant answers and clarity on
i. Qualitative/Categorical Data
certain questions swiftly
 Binary
When huge data is given, and the pattern is to be  Nominal
found then it is time consuming if it is done  Ordinal
manually or with any other method. Data ii. Quantitative/Numerical Data
visualization using the plots can reduce the time  Discrete
and complexity in finding the pattern in the given  Continuous
data set.
Different kind of plots that can be used for data
Data Visualization Stakeholders: visualization
There could potentially be two types of  Box plot
visualizations based on the types of stakeholders  Scatter plot
involved.  Line chart
1. For self-consumption during data  Bar graph
exploration, feature engineering, etc.  Histogram
2. For presenting or communicating the  Dist plot
insights (from the data) with a target  Pie chart
audience, typically decision makers. This  Joint plot
sort of visualization is usually performed to  Pair plot
prepare the results/reports that may enable  Heat map
the target audience in decision making. Outliers - Outliers are the extreme values present
Types of Data collected in the dataset. They affect the properties of data
14 | P a g e
like mean and variance which are used in model groups. The chart consists of a circle split
into slices and each slice represents a
building. Hence, they may impact the accuracy of
group.
the model. Dist Plot depicts the variation in a data distribution.
It represents the overall distribution of
Quartiles – Divides the number of data points into continuous data variables.
four equal-sized groups, or quarters Joint Plot A joint plot is a combination of two
univariate and one bivariate plot.
Inter-quartile Range – It is also called Mid-spread, The bivariate plot (in the center) helps in
H-spread, or IQR, indicates where most of the data analyzing the relationship between two
variables. The univariate plot describes the
is lying. The upper limit and lower limit is
distribution of data in each variable as a
calculated using the following formula marginal plot.
Pair plot A depicts pairwise relationships between
all the variables in a dataset in a matrix
format. Each row and column in the matrix
represent a variable in the dataset.
Head Map is a graphical representation of data where
similar values are depicted by the same
Any values lie outside the limit is called the outlier. colors. The colors vary based on
the intensity of the results

Plot Type Description Network – It is a set of objects (called nodes or


Box plot A box plot gives good indication of the
distribution of data about the median.
vertices) that are connected to each other. The
Boxplots are a standardized way of connection between the nodes is called edges or
displaying the distribution of data based on links.
a five-number summary (minimum, first
quartile (Q1), median, third quartile (Q3), If the edges in network are directed called a
and maximum) directed network (arrows are drawn to indicate the
Scatter Plot Uses dots or markers to represent a value
in the hyperplane. The scatter plot is one of direction) and if all edges are bidirectional or
the simplest plots which can accept both unidirectional, the network is an undirected
quantitative and qualitative values, with a network.
wide variety of applications in primitive
data analysis. A Work Cloud is a visual representation of free
Line Chart A line chart is drawn by interconnecting all
form text, which is like a collage. It is typically used
data points using straight line segments. It
is used to analyze historic variations and to depict keyword metadata of websites, articles,
trends in data. reviews, feedbacks etc.
Bar Chart A bar chart is a graph with rectangular bars
that compares different categories. Each The frequency and significance of the words are
bar represents a particular category, and depicted by the font, font size and color of the text
the length of a bar indicates the total
number of values or items in that category.
in the cluster.
Histogram It represents data as rectangular bars.
A choropleth map is a pictorial representation of
Unlike the bar chart, it is used for
continuous data. Each bar groups the data on a geographical map. The intensity of color
numbers into intervals (bins) and the in a region on the map corresponds to the
height of the bar is based on the number of respective values.
values that fall into the corresponding
intervals
Pie Chart It divides the entire dataset into distinct
15 | P a g e
Data Visualization Using Python
The python libraries used for data visualization are,

 Matplotlib
 Seaborn
 Plotly

EXPLORE MACHINE LEARNING USING PYTHON


Machine Learning Regression Technique
Machine learning algorithms build a mathematical
Regression analysis is a statistical process for
model based on sample data, known as "training
estimating the relationships between
data", in order to make predictions or decisions
variables. It can be used to build a model to predict
without being explicitly programmed to do so.
the value of the target variable from the predictor
Machine Learning Techniques are classified in to variables.
two types
 y= f(X), where y is the target or dependent
 Supervised Learning – works on labelled variable and X is the set of predictors or
data. Mapping input to the output. independent variables
 Unsupervised – has no explicitly defined  One predictor variable – simple linear
output. The idea is to discover knowledge regression model
or structure in the data. This task of  Multiple predictor – Multiple Linear
discovering inherent clusters or groups in Regression model
the data is known as Clustering.
SIMPLE LINEAR REGRESSION
Supervised learning is further classified as
Steps in working with the Regression Models
1. Regression - When the output variable
1. Creating regression Models
can take continuous numerical values,
2. Visualizing the speculated regression
e.g., price of a car, delivery time, credit
models
limit.
3. Analyzing the Speculated Models
2. Classification - When the output
4. Analyzing Models
variable takes categorical or discrete
5. Finding the Best Fit Model Manually
(non-continuous) values, e.g., whether
6. Visualizing the Best Fit Model
an email is a spam, whether a
transaction is fraudulent etc. Best Fit Model – The goal of linear regression is to
create a model that predicts the value accurately
Introduction to Regression and consequently the lowest sum of squared errors
Regression is a supervised machine learning (SSE).
technique that helps in predicting continuous
numerical values or quantity.

16 | P a g e
Coefficient of Determination (R2)

SST – the sum of squared difference between


actual and mean target values

∑ ¿¿ ¿
SSR – the sum of squared differences between
predicted and mean target values

∑ ¿¿ ¿

Relation Between SST, SSR and SSE : SST = MultiCollinearity - In multiple regression model it
SSR+SSE is possible that one predictor can be linearly
predicted from the others, with a substantial
MULTIPLE LINEAR REGRESSION
degree of accuracy. In such a situation, the
This have multiple predictors and one dependent predictors are said to be highly correlated. In
variable. Steps involved in creating a model statistics, this phenomenon is called
multicollinearity, or in other words collinearity
 Visualizing the dataset
between variables.
 Building a Multple Linear Regression model
 Visualizing Multiple Linear Regression Note: The obtained best fit model shall be valid if
model the predictor variables are linearly independent of
 Finding the correlation between the each other. Linearly dependent if the correlation
predictors values are close to –1 or 1.
 Finding the coeffiecient of determination Variance Inflation Factor – to determine if the
 Adjusted R-squared (increasing for making predictor variables are independent of each other.
the model more valid)

17 | P a g e
Logistic Regression is a supervised Machine
Learning algorithm, primarily used for binary
classification. It computes the probability of a
1. VIF = 1 -> no correlation between variables sample belonging to each of the classes.
2. 1 to 5 -> slightly correlated
3. Greater than 5 -> Highly correlated The probabilities are computed using the non-
linear sigmoid or logistic function given below:
R2 values can be inflated by including more and
more predictor variables
The adjusted R2 is defined as

Measuring Model Performance using Confusion


Matrix – Confusion matrix helps in assessing how
good a model is by comparing the actual target
values with the predicted target values.
n - no of observations, k is the number of predictor From confusion matrix precision, recall and F1
variables in the model. score can be derived.
Categorial variables: The variables which take Precision - The precision for a class A indicates how
labels as values. In python, we cannot use
accurate the model is in identifying class A.
categorical predictor variables directly to build a
machine learning model.
The get_dummies() of Pandas library can help us
get one-hot encoding done easily. Recall - The recall for a class A indicates how good
the model is in fetching/retrieving instances of
Model performance on train and test data is poor
class A.
(High RMSE) – Underfit

Model performance on train data is good (Low


RMSE, High R-squared) but on test data is poor
F1-score - This metric is the harmonic mean of
(High RMSE) – Overfit
precision and recall and can indicate how good the
Classification model is in classifying instances of a particular
Classification is a supervised Machine Learning class.
technique that helps in predicting categorical or
discrete (non-continuous) outputs.

 Logistic Regression
 Decision Trees DECISION TREES
 K-Nearest Neighbors (kNN) This is another kind of algorithm to build a model
 Support Vector Machines (SVM) with data that can be properly visualized into a
LOGISTIC REGRESSION graph. A decision is taken on how to split the data
at each node of the tree, this algorithm is called a
Decision Tree.

18 | P a g e
A decision tree is a tree-like structure in which: The k-Nearest Neighbors (k-NN)
algorithm determines the target value of a new
 the root node and each internal node
data point or instance by comparing it with existing
represent a "test" on an attribute of an
data points or instances that are closest to it.
instance in the dataset.
 the outcome of each test is represented by Euclidean Distance
the corresponding branches.
The Euclidean distance between two (tuples) -
 the node that does not branch further is
X1 (x11, x12, x13... x1n) and X2 (x21, x22, x23... x2n) can be
called a leaf node and represents the class
computed as:
labels.

Limitation: that attributes with larger ranges


contribute more value to the Euclidean distance. So
the numerical attributes should be normalized
Splitting the dataset - A subset which contains before used for computing the Euclidean distance.
instances belonging to only one class label is called SUPPORT VECTOR MACHINE (SVM)
a pure (homogeneous) subset. The predictor
attribute on which the dataset is split to obtain the It classifies data into one category or the other by
maximum number of pure subsets is called using hyperplanes. There can be many planes
the best attribute. between two data, so it is required to find an
optimal hyperplane.
For creating a decision tree algorithm, the
following steps are carried out, When data is not separable by a line or a plane,
SVM maps the data into a higher dimensional
1. Instances – the set of instances or samples, space, where it can be separated using a linear
for which, class labels are already known. hyperplane. The mapping function used to
2. Target_Attribute – the class label attribute. transform the data into the higher dimensional
3. Attributes_List – the list of predictor space is called 'kernel' function.
attributes.
Multi-class classification - SVM can also be used
Attribute selection measures compare different for multi-class classification. This can be done in 2
predictor attributes and rank them for the purpose ways
of model building.
1. One-vs-One Classification: It builds a binary
Three of the most commonly used attribute classification model for each pair of classes.
selection measures to induct a decision tree are: 2. One-vs-All Classification: It compares every
 Information gain class with the remaining classes thereby
 Gain ratio building a model for every class.
 Gini Index SVM usually performs well where -
CLASSIFICATION USING k-NN

19 | P a g e
 The dataset has fewer number of classes
(preferably 2) in the target variable.
 The dataset is high dimensional.
 The dataset is balanced.
Hyperparameters are model properties which
guide the training process i.e., they cannot be
learnt from the training data. While building a
machine learning models, situation such as
overfitting or underfitting were encountered. These
can be controlled by tuning the hyper parameters
of the model.
Ensemble methods – These are techniques that
aim at improving the prediction accuracy in models
by creating and combining multiple models instead
of using a single model. Two commonly used
ensemble methods
o Bagging - multiple models are trained using
the same algorithm on different subsets of
the training data. Once multiple models are
trained in this manner, they are aggregated
using maximum voting or simple
aggregation methods such as averaging.
o Boosting - Boosting is another ensemble
learning technique where the models are
built sequentially. Each new model is built
by taking into account the mistakes made
by the previous model in predicting target
value.

Clustering Analysis
Unsupervised learning deals with historical data
which contains no labels. The aim of clustering is to
group similar records together and make sure that
the members of different groups are significantly
different from each other.
Clustering can be performed using several
algorithms and one of the widely used clustering
algorithms is the K-means algorithm.

20 | P a g e

You might also like