0% found this document useful (0 votes)

7 views18 pages

Intro To Machine Learning New

The document provides an overview of machine learning models, contrasting them with classical econometrics, and detailing various types of machine learning including supervised, unsupervised, and reinforcement learning. It discusses data preparation techniques, the importance of data cleaning, and methods for dimensionality reduction such as principal components analysis. Additionally, it covers the concepts of overfitting and underfitting, the role of training, validation, and test datasets, and introduces natural language processing as a key application of machine learning.

Uploaded by

p24aakankshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views18 pages

Intro To Machine Learning New

Uploaded by

p24aakankshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

FRM PART 1

MACHINE LEARNING MODELS

[email protected]

(+91) 8652653607
www.vardeez.com
LO 25.a: Discuss the philosophical and practical differences between machine learning techniques and classical econometrics.
What is Machine Learning
Machine learning [ML] is an umbrella term that covers a range of techniques in which a model is trained to recognize patterns in data to
suit a range of applications, including prediction and classification.
Here, machine learning allows data to decide what the models will include, with no specific hypothesis from an analyst tested as part of
the process.

What Is Classical Econometrics

Econometrics is about using statistics to study how different economic factors are related to each other and to make predictions.
Here, An analyst chooses the model and variables, while a computer algorithm is used to estimate parameters and test their
significance

Why Machine Learning is Better

It can handle large amounts of data and also provides greater flexibility in managing the information
It can also handle non-linear variable interactions.
LO 25.g: Differentiate among unsupervised, supervised, and reinforcement learning models.
Supervised Learning Unsupervised Learning
This is concerned with prediction. Learning from This is concerned with recognizing patterns in data
labeled data, where the model is trained to map inputs with no explicit target. So basically it learns the pattern
to outputs based on a set of examples. in the data by itself. There is no labeled data provided
to the model.
When we use this :
If you want to predict (house price) or classify the loans When we use this :
as will pay or will default. When we want to cluster the data or finding a small
number of factors that explain the data.

Algorithms for SL: Algorithms for UL:

Logistic Regression Model, K-means clustering.
Decision trees. Principle Component Analysis
K-nearest neighbor Anomaly detection

Labeled Data

Size, Location etc

DATA PREPARATION
Many machine-learning approaches require all the variables to be measured on the same scale; otherwise, the technique will not be
able to determine the parameters appropriately.
There are broadly two methods to achieve this rescaling

1) Standardization : This involves subtracting the sample mean of each variable from all observations on that variable and dividing by
its standard deviation.

standardization for bank A

and the customers feature
2) Normalization: In this process, also called min-max transformation, creates a variable between zero and one which will not usually
have a zero mean or unit variance.

the normalization
calculation for bank A and
the customers feature
Data Cleaning
Data cleaning is an important part of machine learning that can take up to 80% of a data analyst’s time.
Large data sets usually have issues that need to be fixed, and good data cleaning can make all the difference between a successful and
unsuccessful machine-learning project

Several reasons for data cleaning include:

1. Inconsistent recording. For data to be read correctly, it is important that all data is recorded in the same way.
2. Unwanted observations. Observations not relevant to the task at hand should be removed.
3. Duplicate observations. These should be removed to avoid biases.
4. Outliers. Observations on a feature that are several standard deviations from the mean should be checked carefully, as they can
have a big effect on results.
5. Missing data. This is the most common problem. If there are only a small number of observations with missing data, they can be
removed. Otherwise, one approach is to replace missing observations on a feature with the mean or median of the observations on
the feature that are not missing. Alternatively, the missing observations can be estimated in some way from observations on other
features
LO 25.d: Use principal components analysis to reduce the dimensionality of a set of features.
A popular statistical technique for dimension reduction in unsupervised learning models is principal components analysis (PCA).
The goal of PCA is to produce almost the same amount of information using a small number of uncorrelated components (i.e.,
variables) that a large number of correlated components can provide.
Thus, in a machine learning model, PCA is used to reduce the number of features.

PCA is often applied to yield curve movements by producing a small count of uncorrelated components that describe the movements of
the curve

The total variance is equal to the following: (12.96)2 + (5.82)2 + (2.14)2 + (1.79)2 = 209.62

Based on a review of seven Treasury rates over a 10-year period (120 months), the first three observed components were responsible
for 99% of the overall variation in yield movements due to the high correlation between yield movements
LO 25.e: Describe how the K-means algorithm separates a sample into clusters.
To identify the structure of a dataset, an unsupervised K-means algorithm can be used to separate dataset observations into clusters.
The value K represents the number of clusters and is set by an analyst.

The centers of the data clusters are called centroids and are initially randomly chosen.
Each data point is allocated to its nearest centroid and then the centroid is recalculated to be at the center of all the data points
assigned to it. This process continues until the centroids remain constant.
The Euclidean Distance :
The Euclidean distance is the square root of the sum of the squares of the distances between the feature for one bank and the
corresponding feature for the other bank summed over all the features.

The Manhattan distance measure:

The Manhattan distance is the sum over all the features of the absolute differences between the corresponding feature pairs:
Concept of Inertia
The goal of the K-means algorithm is to minimize the distance between each observed point and its centroid. A model’s fit is better
when the individual data points are close to their centroid. A lower inertia implies a better cluster fit. However, because inertia will
always fall as more centroids are added, there is a limit to which adding more centroids adds value

• Inertia, a measure of the distance (d) between each data point (j) and its centroid, is defined as

As an alternative approach, a silhouette coefficient can be used to choose K by comparing the distance between an observation and
other points in its own cluster to its distance to data points in the next closest cluster.
The highest silhouette score will produce the optimal value of K.
LO 25.b: Explain the differences among the training, validation, and test data sub-samples, and how each is used.
1) The Training Set : The training set is employed to estimate model parameters, this is the part of the data from which the computer
actually learns how best to represent its characteristics.

2) The Validation Set: It is used to select between competing models. We are comparing different alternative models to determine
which one generalizes best to new data.

3) The Testing Set : This is retained to determine the final chosen model’s effectiveness.
Few Concerns that needs to be understood for these data sets.
1) An obvious question is, how much of the overall data available should be used for each sub-sample ?
Although there is no set allocation for how much a given sample should go to the respective sets above, a typical allocation is two-
thirds of the data going to the training set, one-sixth going to the validation set, and the other one-sixth going to the test set.

2) What if the data is small ?

If the training sample is too small, this can introduce biases into the parameter estimation, whereas if the validation sample is too
small, model evaluation can be inaccurate so that it is hard to identify the best specification

If the data set is relatively small, k-fold cross-validation may be utilized

This technique combines training and validation data into a single sample, with the combined data (n) allocated into k samples

Note : The larger the data set, the lower the risk of improper allocations.
LO 25.c: Understand the differences between and consequences of underfitting and overfitting, and propose potential remedies for
each
Overfiiting :
Overfitting is a situation in which a model is chosen that is “too large” or excessively parameterized
Overfitting gives a false impression of an excellent specification because the error rate on the training set will be very low (possibly
close to zero). However, when applied to other Testing data, the model’s performance will likely be poor and the model will not be able
to generalize well.

lower bias and higher variance

Underfitting:
Underfitting is the opposite problem to overfitting and occurs when relevant patterns in the data remain uncaptured by the model.

higher bias and lower variance errors

Bias - It represents the difference between the Actual and Predicted value of data.
Variance – means estimation errors
LO 25.h: Explain how reinforcement learning operates and how it is used in decision-making.
Reinforcement learning
It involves the creation of a policy for decision-making, with the goal of maximizing reward.
Best Inv
It uses a trial and-error approach. Example: Game Playing, Robotics Strategy

The key areas of reinforcement learning are known as states, actions, and rewards:
a) States (S): define the environment.
b) Actions (A): represent the decisions taken.
c) Rewards (R): maximized when the best possible decision is made.

Reinforcement learning has many applications in risk management.

For example, it is used to determine the optimal way to buy or sell a large block of shares, to determine how a portfolio should be
managed, and to hedge derivatives portfolios

A disadvantage of reinforcement learning algorithms is that they tend to require larger amounts of training data than other machine-
learning approaches.

.
To determine actions taken for each state, the algorithm will choose between the best action already identified (known as exploitation)
and a new action (known as exploration).
The probability assigned to exploitation and exploration is p and 1 – p, respectively.
As more trials are completed and the algorithm has learned the superior strategies, the value of p increases

The Q-value is the expected value of taking an action (A) in a certain state (S). The best action to take in any given state (S) is whatever
the value of A is that maximizes the expression below:

The Monte Carlo method may be deployed to evaluate actions (A) taken in states (S) and the subsequent rewards (R) that may result.
The formula is shown as follows, with the α parameter set at a number like 0.01 or 0.05.

The temporal difference learning method, an alternative to the Monte Carlo method, assumes the best strategy thus far is the one to
be made going forward and will only look one decision ahead.
LO 25.f: Be aware of natural language processing and how it is used.
Natural language processing (NLP) : It is sometimes also known as text mining, is an aspect of machine learning that is concerned with
understanding and analyzing human language, both written and spoken.
Example : Automated Virtual Assistants, Newswire Statement classification like news is Corporate, Educational and so on.
Benefits of NLP : NLP offers the benefit of speed and document review without inconsistencies or bias found in human reviews.

Steps in NLP :
1) Capturing the language in a transcript or a written document; 2) Pre-processing the text; and 3) Analyzing it for a particular purpose.

Preprocessing text requires the following steps:

1) Tokenize : The document must be tokenized, which means identifying only the words (i.e., removing punctuation, symbols, and
spacing) and modifying all of them into lowercase.
2) Removing stopwords : such as “the,” “has,” and “a.” These words are designed to make sentences flow but have no other value.
3) Stemming, which means replacing words with their stems. For example, “arguing,” “argued,” and “argues” maps to “augu.”
4) Lemmatization, which is replacing words with their lemmas. For example, “worse” maps to “bad.” This is a similar concept to
stemming, but the lemma will be an actual word.
5) N-grams may be considered, which are groups of words that have meaning when placed together as opposed to being considered
individually. For example, the trigram “exceed analyst expectations” may be more meaningful than the separate words “exceed,”
“analyst,” and “expectations.”
Tokenized

Stopwords Removal

Stemming

Lemmatization

NLP will use an inventory of sentiment words to assess whether things like corporate news releases are considered positive, negative,
or neutral

Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Unsupervised ML Clustering
No ratings yet
Unsupervised ML Clustering
15 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
ML Answer Key (M.tech)
No ratings yet
ML Answer Key (M.tech)
31 pages
Chapter 15 - Machine Learning New
No ratings yet
Chapter 15 - Machine Learning New
19 pages
Decision Tree, Clustering
No ratings yet
Decision Tree, Clustering
73 pages
Unit 5
No ratings yet
Unit 5
77 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
4 pages
Machine Learning Qs
No ratings yet
Machine Learning Qs
10 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
97 pages
Unit 3
No ratings yet
Unit 3
55 pages
ML Questions Answers
No ratings yet
ML Questions Answers
4 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Statistic Inference Unit 2 Notes
No ratings yet
Statistic Inference Unit 2 Notes
34 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
ML Lec-10
No ratings yet
ML Lec-10
19 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Fiche Econo 2
No ratings yet
Fiche Econo 2
14 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Machine Learning Insem-01 QP
No ratings yet
Machine Learning Insem-01 QP
6 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Fam Question Bank CT
No ratings yet
Fam Question Bank CT
14 pages
ML Imp QB
No ratings yet
ML Imp QB
34 pages
ML Document-1 - Merged
No ratings yet
ML Document-1 - Merged
19 pages
Wa0007
No ratings yet
Wa0007
48 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Module 3 - ML
No ratings yet
Module 3 - ML
101 pages
Empirical Finance
No ratings yet
Empirical Finance
5 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
ML Question Bank CA-II
No ratings yet
ML Question Bank CA-II
10 pages
Aam Ut-2 QB Ans
No ratings yet
Aam Ut-2 QB Ans
29 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
MLT - MKC
No ratings yet
MLT - MKC
10 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
ML 22-23 Sem, GPT
No ratings yet
ML 22-23 Sem, GPT
14 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
MLE
No ratings yet
MLE
15 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
MLquestions
No ratings yet
MLquestions
26 pages
Data Science
No ratings yet
Data Science
64 pages
Summary - Data Analytics& Machine Learning
No ratings yet
Summary - Data Analytics& Machine Learning
18 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Module 3
No ratings yet
Module 3
63 pages
Finance (Fin) : Kent State University Catalog 2024-2025
No ratings yet
Finance (Fin) : Kent State University Catalog 2024-2025
5 pages
FrontLine - 19 November 2023
No ratings yet
FrontLine - 19 November 2023
118 pages
Today's Topics To Be Read (TTBR) : Mark Zuckerberg's Apology To Parents: Sorry Is Not Enough
No ratings yet
Today's Topics To Be Read (TTBR) : Mark Zuckerberg's Apology To Parents: Sorry Is Not Enough
22 pages
Binder1 Compressed
No ratings yet
Binder1 Compressed
2 pages
How Can Technology Improve Our Lives As We Age
No ratings yet
How Can Technology Improve Our Lives As We Age
3 pages
Mint Delhi 24-01-2024
No ratings yet
Mint Delhi 24-01-2024
18 pages
Business Today - February 4 2024
No ratings yet
Business Today - February 4 2024
110 pages
SETDA 2024 Survey Responses
No ratings yet
SETDA 2024 Survey Responses
14 pages
Reducing VMT
No ratings yet
Reducing VMT
47 pages
The Caravan - April 2023
No ratings yet
The Caravan - April 2023
100 pages
Hobbies - IIM T.U.M
No ratings yet
Hobbies - IIM T.U.M
3 pages
Computer (Eng) SSC CHSL 2024 All 70 Questions (RBE)
No ratings yet
Computer (Eng) SSC CHSL 2024 All 70 Questions (RBE)
8 pages
Quiz Ecology
No ratings yet
Quiz Ecology
9 pages
Conti USA IFS Hydraulic Hoses Fittings Catalog 2016
No ratings yet
Conti USA IFS Hydraulic Hoses Fittings Catalog 2016
444 pages
ESD Assignment
No ratings yet
ESD Assignment
14 pages
Silk and Silkworms Powerpoint English - Ver - 1
No ratings yet
Silk and Silkworms Powerpoint English - Ver - 1
8 pages
Michel Peletz - Kinship Studies in Late Twentieth-Century Anthropology
No ratings yet
Michel Peletz - Kinship Studies in Late Twentieth-Century Anthropology
31 pages
Cidam Layout
No ratings yet
Cidam Layout
40 pages
Group 2 - Aspects of Connected Speech
No ratings yet
Group 2 - Aspects of Connected Speech
31 pages
Unit 13 Listening
No ratings yet
Unit 13 Listening
1 page
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
No ratings yet
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
17 pages
Motor, Filter, Kühlsystem Und Auspuff
No ratings yet
Motor, Filter, Kühlsystem Und Auspuff
18 pages
Analysis of Consumer Satisfaction and Lo 300543b7
No ratings yet
Analysis of Consumer Satisfaction and Lo 300543b7
18 pages
Matrix of Activities DSPC 1
No ratings yet
Matrix of Activities DSPC 1
4 pages
Do Hhjmbfujhfddgbkod
No ratings yet
Do Hhjmbfujhfddgbkod
1 page
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
No ratings yet
S71937 - Enabling Intelligent Storage To Process Data For Ai Application Ibm
21 pages
Fertility: Overview, 2012 To 2016: Report On The Demographic Situation in Canada
No ratings yet
Fertility: Overview, 2012 To 2016: Report On The Demographic Situation in Canada
19 pages
The Manual For The Quality Management of Educational Programmes in Myanmar
100% (1)
The Manual For The Quality Management of Educational Programmes in Myanmar
160 pages
Eggd Parking
No ratings yet
Eggd Parking
1 page
FSBC01 The Use of Repair and Maintenance Budget For Buildings
No ratings yet
FSBC01 The Use of Repair and Maintenance Budget For Buildings
5 pages
Tendernotice - 1 (5) - 1
No ratings yet
Tendernotice - 1 (5) - 1
4 pages
Manual Slake Durability Device
No ratings yet
Manual Slake Durability Device
40 pages
Chemistry Class 10
No ratings yet
Chemistry Class 10
8 pages
CTSD-Lab Mannual Final - 241204 - 102238
No ratings yet
CTSD-Lab Mannual Final - 241204 - 102238
54 pages
Jithin Original
No ratings yet
Jithin Original
2 pages
Expression of Interest Bhushan - 1
No ratings yet
Expression of Interest Bhushan - 1
6 pages
Masafi
No ratings yet
Masafi
2 pages
GVI Seychelles Marine Report Jan 2017 - Dec 2017 (Cap Ternay)
No ratings yet
GVI Seychelles Marine Report Jan 2017 - Dec 2017 (Cap Ternay)
82 pages
Lexicology Summary 1
No ratings yet
Lexicology Summary 1
1 page
Alyssamari Aurereyes
No ratings yet
Alyssamari Aurereyes
2 pages
The Trade - Offs of ChatGPT To Filipino Freelance Content Writers A Diffusion of Innovation Theory Perspective
No ratings yet
The Trade - Offs of ChatGPT To Filipino Freelance Content Writers A Diffusion of Innovation Theory Perspective
7 pages