Intro To Machine Learning New
Intro To Machine Learning New
(+91) 8652653607
www.vardeez.com
LO 25.a: Discuss the philosophical and practical differences between machine learning techniques and classical econometrics.
What is Machine Learning
Machine learning [ML] is an umbrella term that covers a range of techniques in which a model is trained to recognize patterns in data to
suit a range of applications, including prediction and classification.
Here, machine learning allows data to decide what the models will include, with no specific hypothesis from an analyst tested as part of
the process.
Labeled Data
1) Standardization : This involves subtracting the sample mean of each variable from all observations on that variable and dividing by
its standard deviation.
the normalization
calculation for bank A and
the customers feature
Data Cleaning
Data cleaning is an important part of machine learning that can take up to 80% of a data analyst’s time.
Large data sets usually have issues that need to be fixed, and good data cleaning can make all the difference between a successful and
unsuccessful machine-learning project
PCA is often applied to yield curve movements by producing a small count of uncorrelated components that describe the movements of
the curve
The total variance is equal to the following: (12.96)2 + (5.82)2 + (2.14)2 + (1.79)2 = 209.62
Based on a review of seven Treasury rates over a 10-year period (120 months), the first three observed components were responsible
for 99% of the overall variation in yield movements due to the high correlation between yield movements
LO 25.e: Describe how the K-means algorithm separates a sample into clusters.
To identify the structure of a dataset, an unsupervised K-means algorithm can be used to separate dataset observations into clusters.
The value K represents the number of clusters and is set by an analyst.
The centers of the data clusters are called centroids and are initially randomly chosen.
Each data point is allocated to its nearest centroid and then the centroid is recalculated to be at the center of all the data points
assigned to it. This process continues until the centroids remain constant.
The Euclidean Distance :
The Euclidean distance is the square root of the sum of the squares of the distances between the feature for one bank and the
corresponding feature for the other bank summed over all the features.
• Inertia, a measure of the distance (d) between each data point (j) and its centroid, is defined as
As an alternative approach, a silhouette coefficient can be used to choose K by comparing the distance between an observation and
other points in its own cluster to its distance to data points in the next closest cluster.
The highest silhouette score will produce the optimal value of K.
LO 25.b: Explain the differences among the training, validation, and test data sub-samples, and how each is used.
1) The Training Set : The training set is employed to estimate model parameters, this is the part of the data from which the computer
actually learns how best to represent its characteristics.
2) The Validation Set: It is used to select between competing models. We are comparing different alternative models to determine
which one generalizes best to new data.
3) The Testing Set : This is retained to determine the final chosen model’s effectiveness.
Few Concerns that needs to be understood for these data sets.
1) An obvious question is, how much of the overall data available should be used for each sub-sample ?
Although there is no set allocation for how much a given sample should go to the respective sets above, a typical allocation is two-
thirds of the data going to the training set, one-sixth going to the validation set, and the other one-sixth going to the test set.
Note : The larger the data set, the lower the risk of improper allocations.
LO 25.c: Understand the differences between and consequences of underfitting and overfitting, and propose potential remedies for
each
Overfiiting :
Overfitting is a situation in which a model is chosen that is “too large” or excessively parameterized
Overfitting gives a false impression of an excellent specification because the error rate on the training set will be very low (possibly
close to zero). However, when applied to other Testing data, the model’s performance will likely be poor and the model will not be able
to generalize well.
Underfitting:
Underfitting is the opposite problem to overfitting and occurs when relevant patterns in the data remain uncaptured by the model.
Bias - It represents the difference between the Actual and Predicted value of data.
Variance – means estimation errors
LO 25.h: Explain how reinforcement learning operates and how it is used in decision-making.
Reinforcement learning
It involves the creation of a policy for decision-making, with the goal of maximizing reward.
Best Inv
It uses a trial and-error approach. Example: Game Playing, Robotics Strategy
The key areas of reinforcement learning are known as states, actions, and rewards:
a) States (S): define the environment.
b) Actions (A): represent the decisions taken.
c) Rewards (R): maximized when the best possible decision is made.
A disadvantage of reinforcement learning algorithms is that they tend to require larger amounts of training data than other machine-
learning approaches.
.
To determine actions taken for each state, the algorithm will choose between the best action already identified (known as exploitation)
and a new action (known as exploration).
The probability assigned to exploitation and exploration is p and 1 – p, respectively.
As more trials are completed and the algorithm has learned the superior strategies, the value of p increases
The Q-value is the expected value of taking an action (A) in a certain state (S). The best action to take in any given state (S) is whatever
the value of A is that maximizes the expression below:
The Monte Carlo method may be deployed to evaluate actions (A) taken in states (S) and the subsequent rewards (R) that may result.
The formula is shown as follows, with the α parameter set at a number like 0.01 or 0.05.
The temporal difference learning method, an alternative to the Monte Carlo method, assumes the best strategy thus far is the one to
be made going forward and will only look one decision ahead.
LO 25.f: Be aware of natural language processing and how it is used.
Natural language processing (NLP) : It is sometimes also known as text mining, is an aspect of machine learning that is concerned with
understanding and analyzing human language, both written and spoken.
Example : Automated Virtual Assistants, Newswire Statement classification like news is Corporate, Educational and so on.
Benefits of NLP : NLP offers the benefit of speed and document review without inconsistencies or bias found in human reviews.
Steps in NLP :
1) Capturing the language in a transcript or a written document; 2) Pre-processing the text; and 3) Analyzing it for a particular purpose.
Stopwords Removal
Stemming
Lemmatization
NLP will use an inventory of sentiment words to assess whether things like corporate news releases are considered positive, negative,
or neutral