Summary - Data Analytics& Machine Learning
Summary - Data Analytics& Machine Learning
所属 algorithms:
2. unsupervised machine learning: no data labels, no feedback, search for hidden patterns
Unsupervised learning is machine learning that does not make use of labelled data.
Variable Types: Categorical: Nominal (hair color), Ordinal (age, 序数,第几); Numerical:
Discrete (goals shot), Continuous (time till dinner)
Encoding Categorical Variables: Most machine learning algorithms work almost
exclusively with numeric data. Therefore, we need to encode categorical features into
numerical features. 就是将 categorical variables 转化成 numerical。有两种转换方法:
Label Encoding: replace categorical value with a numeric values between 0 and number of
classes -1 (缺点:appearance of a relationship which does not exist in reality. 比如 0 小于
1,但实际上它们代表的东西(例如 BS IS)之间不存在这个关系)
One-hot Encoding: for each category of a feature, we create a new column (dummy variable)
with binary encoding to denote whether a particular row belongs to this categoty (category 的
个数与 column 的个数相同) (缺点:dimensionality: category 一多,column 就多,太占空
间了,memory issue.) drop_first: determines whether to get k-1 dummies out of k categorical
levels by removing first level. (the default is false)
Data Transformation:
Scaling (data transformation) is a process of adjusting the range of a feature by shifting and
changing the scale of data. Variables such as age and income can have a diversity of ranges
that result in a heterogeneous training dataset→to the same scale. 共四种 techniques,即方
法:1 Min-max scaling = (x-max)/(max-min)→feature values are in range [0,1]; 2 Standard
scaling = (x-mean)/standard deviation 对 normal distribution 尤其好用,弄完了就成了
standard normal distribution 了, process of centering (subtract mean so the new mean is zero)
and scaling (divide)→removes the mean and scales the data to unit variance; 3 Robust scaling
= (X-median)/(75th quantile-25th quantile) 优点:不受 outliers(极端值)影响, ignore outliers
(data points different from others, e.g., measurement errors)
选哪个?对于 normal distribution 而言,直接选 standard scaling。否则直接上 Min-max
scaling。Both are sensitive to outliers.
Binning (discretization): make linear models more powerful on continuous data by splitting it
up into multiple features (e.g., age groups for different ages); Interactions: x1 and x2 are
values of two features and x1x2 represents interaction between two; Polynomials: a form of
regression analysis in which the relationship between the independent v. x and the dependent
v. y is modelled as an nth degree polynomial in x. Using polynomial features together with a
linear regression model yields the classical model of polynomial regression.
Univariate nonlinear transformations: adding squared or cubed features→linear models for
regression; log (+1, not defined at 0), exp, or sin→nonlinear. Such transformations can
improve inference, but they need to be applied using expert judgement.
Statistics vs. ML: number of observations – samples; data set size – sample size; variables –
features; dependent variable – label; coefficient – weight.
Applying the ML algorithm to a data set to infer the pattern between the inputs and output is
called “training” the algorithm. Once the algorithm has been trained, the inferred pattern can
be used to predict output values based on new inputs.
Training set: On this part of the data, we train the ML model. Test set: On this part of the
data, we test our model’s performance.
Why do we split the data into training and test data? Answer: to prevent overfitting.
Overfitting refers to learning a function that perfectly explains the training data that the
model learned from but does not generalize well to unseen test data. If done improperly, data
leakage can occur through which one can introduce biases into the data.
Problems when basic split: 对于 time-series data 而言: “shuffle=False” to maintain the time
order of data, which is important for time-series data.对于 imbalanced data 而言: “stratify=变
量名字” to ensure that both the training and test sets will have possibly identical distribution
of the specified variable.
Classification vs. Regression: Classification is the task of predicting a discrete class label.
Regression is the task of predicting a continuous quantity. Both use labelled variables to make
predictions.
No free lunch theorem: Each model is a simplification of reality. Simplification is based on
assumptions. Assumptions fail in certain situations. No model works best for all possible
situations!
Use of supervised machine learning in finance: Credit default prediction, derivative
pricing, Robo-advisory, Stock price prediction, asset allocation.
Linear Regression: Linear regression, or ordinary least squares (OLS) is a linear model, e.g.,
a model that assumes a linear relationship between the input variables (x) and the single
output variable (y).
30.000(rows)*25(columns) with 1 column for default situation and is dummy variable. .pop
splits the dependent variable and independent variable. Now 30.000rows * 24 columns for X
and 30.000*1 for Y. Split train dataset and test dataset into 80:20. Now train data
X_train:24000*24, Y_train: 24.000*1 and test data: X_test: 6000*24, Y_test:6000*1. ML
model will learn on X_train and Y_train. Once it has learned, it will apply y=f(X) to test
X_test and will produce y_pred. .fit trains the algorithm with training data to determine the
weights. .predict lets the algorithm do the predictions with the test data. The predictions will
be compared to the actual results, and accuracy could be calculated based on achieved
performance.
Training the data in two steps: S1: Define a loss function: how inaccurate the model’s
predictions are (Residual sum of squares (RSS): the squared sum of the differences between
the actual and predicted values) S2: find parameters that minimize loss: look at the difference
between each real data point and our model’s prediction. Square differences to avoid negative
numbers and penalize large differences and then add them up and take the average.
Strengths: Easy to understand and interpret. Linear regression has no parameters, but it also
has no way to control model complexity. Main weaknesses: Prone to overfitting.
Multicollinearity assumption. Does not work well when there is a non-linear relationship
between predicted and predictor variables. Complexity control. Main parameters: None.
Since Linear Regression faces the following challenges, the regularized regression is used.
Challenges of OLS :1 Interpretability: OLS cannot distinguish variables with little or no
influence. These variables distract from the relevant regressors. 2 Overfitting: OLS works
well when number of observations m is bigger than the number of predictors p, i.e., m is
much bigger than p. If m≈p, overfitting results into low accuracy on unseen observations. If
m≤p, variance of estimates is infinite and OLS fails. As a remedy, one can identify only
relevant variables by feature selection.
The main idea of regularized regression: Fit linear models with least squares but impose
constraints on the coefficients. Regularization means explicitly restricting a model to avoid
overfitting. Simply put, it is a penalty mechanism that applies shrinkage to model parameters.
三种操作方式:
Shuffle-split Cross Validation: each split samples train size many points for the training_set
and test_size many (disjoint) points for the test set. This splitting is repeated n_iter times.
Shuffle split works iteratively, KFold just divides the dataset into k folds.
Cross Validation with Groups: Used when groups in the data that are highly
related→ensure the training and test sets contain images of different people.
Example: Logistic regression (c=10), repeat 10 times. Cross-validation=10. Average
accuracy=sum of 10/10
Grid Search(improve the model’s generalization performance by tuning its parameters):
The most commonly used method is grid search, which basically means trying multiple
possible combinations of the parameters.
Parameters vs. Hyperparameters: Parameters: learnt during training; internal to model;
e.g., node weights in a NN Hyperparameters: cannot be learnt but set beforehand; external to
model; e.g., learning rate, hidden layers.
Cross Validation 与 Grid Search 的结合: Grid Search with Cross Validation: while the
method of splitting data into a training, a validation and a test set is workable, it is quite
sensitive to how exactly the data is split. For a better estimate of generalization performance,
we can use cross-validation to evaluate the performance of each parameter combination. Use
GridSearchCV to specify parameters you want to search over using a dictionary.
GridSearchCV will then perform all the necessary model fits. GridSearch CV will cross-
validation in place of the split into a training and validation set that we used before. However,
we still need to split the data into a training and a test set to avoid overfitting the
parameters. 10-fold CV, with 6 gamma and C values each→6x6x10=360 models we need to
train. Heat map can be used to visualize accuracy scores. It is important to make sure that the
ranges for the parameter are large enough.
Supervised Performance Metrics:
Regression: Mean Absolute Error(MAE): measures the average magnitude of the errors in
a set of forecasts, without considering their direction. Linear score, which means that all the
individual differences are weighted equally in the average. It gives an idea of how wrong the
predictions were. The measure gives an idea of the magnitude of the error, but no idea of the
direction (e.g., over- or underpredicting). Mean Squared Error (MSE): represents the
average squared difference between the actual values and the estimated values (residuals).
Root mean squared error (RMSE). R^2: the “goodness of fit” of the predictions to actual
value. In statistics, this measure is also called the coefficient of determination. Adjusted
R^2: how well terms fit a curve or line but adjusts for the number of predictors in a model.
Predictive accuracy→ RMSE is the best choice. Explanatory purposes by indicating how well
the selected independent variable(s) explains the variability in the dependent variable→R 2
and adjusted R 2 are often used for
Classification: here binary classification problems.
Confusion Matrix: True positives (TP) Predicted positive and are actually positive. False
positives (FP) = Type I error Predicted positive and are actually negative. True negatives
(TN) Predicted negative and are actually negative. False negatives (FN) = Type II error
Predicted negative and are actually positive. Accuracy (A) = (TP + TN)/(TP + FP + TN + FN)
or Accuracy (A) = (TP + TN)/Total. Precision (P) = TP/(TP + FP). Recall (R) = TP/(TP +
FN) Accuracy is the number of correct predictions made as a ratio of all predictions made.
This is the most common evaluation metric for classification problems and is also the most
misused. It is most suitable when there are an equal number of observations in each class
(which is rarely the case) and when all predictions and the related prediction errors are
equally important, which is often not the case. In case of imbalanced data (data sets where
one of two classes is much more frequent than the other one), 99% non-fraudulent
transactions, 1% fraudulent. →99% accuracy, but imbalanced data is the norm and it is rare
that the events of interest have equal or even similar frequency in the data. Precision:
percentage of positive instances out of the total predicted positive instances. Precision is also
known as positive predictive value (PPV). Precision is a good measure to determine when the
cost of false positives is high (e.g., email spam detection). Recall is the percentage of positive
instances out of the total actual positive instances. Recall is also known as sensitivity, hit rate,
or true positive rate (TPR). Recall is a good measure when there is a high cost associated with
false negatives (e.g., fraud detection, disease diagnostics). F1 score: harmonic mean of
precision and recall. F1 score = (2 * P * R)/(P + R). F1 score is more appropriate than
accuracy when unequal class distribution is in the dataset and it is necessary to measure the
equilibrium of precision and recall. High scores on both of these metrics suggest good model
performance.
Area under ROC curve (AUC): evaluation metric for binary classification problems.
Receiver Operating Characteristic (ROC): probability curve, and AUC represents degree
or measure of separability. It tells how much the model is capable of distinguishing between
classes.
Model Selection Criteria: Simplicity, Training time, Presence of non-linearity in the data,
Robustness to overfitting, Size of the dataset, Number of features, Model interpretation
Computational linguistics, also known as natural language processing (NLP): the
subfield of computer science concerned with using computational techniques to learn,
understand, and produce human language content. Natural language processing (NLP) is a
branch of AI that deals with the problems of making a machine understand the structure and
the meaning of natural language as used by humans. Several techniques of machine learning
and deep learning are used within NLP.
Goals of NLP: 1 aiding human-human communication, (e.g., in machine translation (MT)); 2
aiding human-machine communication (e.g., with conversational agents); or 3 benefiting both
humans and machines by analyzing and learning from the enormous quantity of human
language content that is now available online.
NLP has many applications in the finance sectors in areas such as: 1 sentiment analysis 2
chatbots 3 document processing 4 risk management (liquidity risk management, credit default
modelling, etc.)
Automation: Automation using NLP is well-suited in the context of finance. It reduces the
strain that repetitive, low-value tasks put on human employees. It tackles the routine,
everyday processes, freeing up teams to finish their high-value work. In doing so, it drives
enormous time and cost savings.
Sources of the text: A lot of information, such as sell side reports, earnings calls, and
newspaper headlines, is communicated via text message, making NLP very useful in the
financial domain. (Refinitiv Transcripts database that covers historical archives in the mkt)
Uses of the text: Parsing documents It is unfortunately quite common for companies to
obscure machine-readable disclosure by inserting tables into the documents in the ”picture
format” (.jpg, .png, etc) Sources of text data could be sell side reports, earning calls and
newspaper headlines. Using text data helps to parse documents – obscure machine-readable
disclosure by inserting tables into the documents in picture format. A potential problem can
be the fact that the tables in documents cannot be easily read, but the pictures can be parsed
using optical character recognition (OCR).
Terminology: data set is often called corpus. Each data point, represented as a single text, is
called a document. Token is equivalent to a word.
Types of data represented as strings: Not every string is text data, because there are four
kings of string data: Categorical data, free strings that can be semantically mapped to
categories, structured string data and text data. Categorical data can be easily mapped into a
variable, e.g., balance sheet, cash-flow statement, income statement; Free strings are the
sources of data with subcategories the same as above, but with some shortcuts and
misspellings like P&L statement, CF statement; Structured string data are data items with a
certain underlying structure like address and phone number; Text data are phrases, sentences
that do not belong to any of the above groups.
NLP processing pipeline: consists of preprocessing (tokenization, stemming, lemmatization,
PoS tagging, named entity recognition, stop words removal), feature representation (BoW,
Co-occurrence matrix, TF-IDF; Word2vec, GloVe) and Inference (Supervised, unsupervised
and reinforcement).
Data preprocessing:3 python packages for data preprocessing – NLTK, TexBlob and spaCy.
Tokenization : splitting a text into meaningful segments – tokens, which can be words,
punctuation, numbers or other special characters that are the building blocks of a sentence.
Stop words removal :helps remove extremely common words that offer little value in
modeling. In finance, we do not always drop all stop words, because some of them can have
an important tole in e.g., differentiating sentiment.
Both lemmatization and stemming are methods for normalization that try to extract some
normal form of a word. These words would likely result in overfitting and poor generalization
performance, e.g., vocabulary contains singular and plural versions of some words, different
verb forms and a noun relating to the verb. Stemming: each word can be represented using its
stem. It is done by using a rule-based heuristic like dropping common suffixes. (E.g.,
connection, connections, connective, connected, connecting→connect) The process is referred
to lemmatization when a dictionary of known word forms is used and the role of a word in the
sentence is important. Lemmatization is the process of converting inflected forms of a word
into its morphological root. For example, the lemma of analyzed and analyzing is analyze.
Lemmatization is computationally more expensive and advanced. The difference between the
two processes is that stemming can often create nonexistent words, whereas lemmas are
actual words.
Part-of-Speech(PoS) tagger: uses language structure and dictionaries to tag every token in
the text with a corresponding part of speech. Some common POS tags are noun, verb, adj…
Named entity recognition (NER): an optional step to locate and classify named entities in
text into predefined categories, e.g., names of persons, locations, …
Additional preprocessing methods : lowercasing (removes distinctions among the same
words due to upper and lower class), removal of non-alphanumeric characters (include any
characters that are not letters or digits such as >. Filter() can be used to remove all non-
alphanumeric characters from a string), dependency parsing (extracting dependency parse of
a sentence to represent its grammatical structure & defines the dependency relationship
between headwords and their dependents), and coreference resolution (connecting tokens
that represent the same entity. E.g., Tom has a headache. He did not sleep well. →Tim=He)
Feature representation: Word embeddings convert textual data into numerical data. The
process of converting NLP text into numbers is called vectorization. Two main methods for
computing word embeddings are frequency-based/ count-based (count vectorization, TF-
IDF vectorization and Co-occurrence vectorization) and prediction- based/ learning- based
(pretrained models, e.g., Word2vec and GloVe and customized deep learning-based feature
representation).
In count-based models: the semantic similarity between words is determined by counting the
co-occurrence frequency. Bag of Words (BoW): documents are described by word
occurrences while ignoring the relative position information of the words, which means any
information about the structure of the sentence is lost. Although the resulting matrix can be
very large in memory, the amount of data can be reduced by using sparse matrices. Problem
of BoW is discarded word order. A solution could be n-grams. N-grams : considers the
counts of pairs or triplets (or more) of tokens that appear next to each other. N-grams are
representations of word or token sequences. They can offer invaluable contextual information
that can complement and enrich unigram. Co-occurrence matrix: how often things co-occur
in some environment. An alternative to BoW is TF-IDF (Term Frequency-Inverse
Document Frequency): calculated word frequencies. A word frequency score tries to
highlight words that are more interesting. To get a complete representation of the value of
each word, TF at the sentence level*IDF of a word across the entire dataset. TF-IDF values
can be useful in measuring the key terms across a compilation of documents and can serve as
word feature values for training an ML model. (TF-IDF=TF x IDF). Higher values indicate
words that appear more frequently within a smaller number of documents, which signifies
relatively more unique terms that are important. Lower values indicates terms that appear in
many documents.
Prediction-based word embedding techniques: in predictive models, the word vectors are
learnt by trying to improve on the predictive ability (minimizing the loss between the target
word and the context word). Word2Vec: king-man+woman=queen. GloVe: the distance
between king→queen is roughly the same as the one between man→woman, or
brother→sister.
Inference: Supervised NLP: Naïve Bayes is one of the most frequently used as it can
produce reasonable accuracy using simple assumptions. Unsupervised NLP: Latent Dirichlet
Allocation (LDA) has been used for topic modelling – NLP practitioners build probabilistic
generative models to reveal likely topic attributions for words, as an unsupervised NLP
method.
Topic modelling (unsupervised ML): provides methods for automatically organizing,
understanding, searching and summarizing large electronic archives by discovering the
hidden themes in the collection, annotating the documents according to these themes and
using annotations to organize, summarize, search and form predictions. LDA tries to find
groups of words (the topics) that appear together frequently, e.g., wordcloud.
Sentiment analysis: Robo-readers are automated programs used to analyze large quantities
of text like news articles and social media. In this way, robo-readers are being used by
investors to examine how views expressed in text relate to future company performance.
Robo-readers often look to analyze sentiment polarity- how positive, negative or neutral a
particular phrase or statement is regarding a target. Sentiment provides invaluable predictive
power, both along and when coupled with structured financial data, for predicting stock price
movements for individual firms and for portfolios of companies.
Text curation: uses database Financial Phrase Bank and presents data (cross-sectional data,
not time-series data) in a text document format. The sentiment of each sentence has been
labeled with positive, negative or neutral. The sentiment classes are provided from an
investor’s perspective and maybe useful for predicting whether a sentence may have a
corresponding positive, negative or neutral influence on the respective company’s stock price.
A supervised ML model is trained, validated and tested using these data. The final ML model
can be used to predict the sentiment classes of sentences present in similar financial news
statements.
Text cleansing involves removing punctuations, numbers and spaces that may not be
necessary for model training or incorporating appropriate substitutions for, potentially
extraneous information present in the text.
Stop words: not removed because some of them (e.g., not, more, very and few) carry
significant meaning in the financial texts that is useful for sentiment prediction. Some words
like a, an, the can be removed. However, overall to avoid confusion no words are removed.
Document term matrix (DTM): The last step of text preprocessing is using the final BoW
after normalizing to build a document term matrix (DTM). It is a matrix that is similar to a
data table for structured data and is widely used for text data. Each row belongs to a
document and each column represents a token. The number of rows of DTM is equal to the
number of documents in a sample dataset.
Unsupervised ML is used to draw inferences from data sets consisting of input data without
labeled responses. There are two types of unsupervised ML: dimensionality reduction and
clustering. Main challenge: Algorithm performance evaluation, i.e., whether the algorithm
learned something useful.
Dimensionality reduction: process of reducing the number of features or variables while
preserving data information and overall model performance. The most frequently-used
techniques for dimensionality reduction include Principal component analysis (PCA), Kernel
principal component analysis (KPCA) and t-distributed stochastic neighbor embedding (t-
SNE). PCA is a kind of linear algorithms that forces the new variables to be linear
combinations of the original features. Nonlinear algorithms such as KPCA and t-SNE can
capture more complex structures in data. Dimensionality reduction can be used in portfolio
management, yield curve construction and interest rate modelling as well as speed and
accuracy enhancement of a trading strategy.
Clustering algorithms (focus on minimizing dissimilarity, i.e., distance between data
points) allows us to discover hidden structures in data. The goal of clustering is to find a
natural grouping in data so that items in the same cluster are more similar to each other than
to those from different clusters, seek to learn, from the properties of the data, an optimal
division or discrete labeling of groups of points based on similarity. Three typical techniques
are k-means clustering, hierarchical and affinity propagation clustering. Clustering can be
used in portfolio construction, investor classification and risk management. For example,
Pairs trading is a non-directional, relative value investment strategy that seeks to identify two
companies or funds with similar characteristics such as Audi and Mercedes-Benz whose
equity securities are currently trading at a price relationship that is out of historical trading
range. This will entail buying the undervalued security while short-selling the overvalued
security, all while maintaining market neutrality. Another example is investor classification to
determine the investor’s ability and willingness to take risk.
Under k-means clustering, we need to tell model how many groups we want
(hyperparameter). It is a centroid-based algorithm/ distance-based, which tries to find cluster
centers that are representative of certain regions of data. The algorithm assigns each data
point to the closest cluster center and then sets each center as the mean of data points that are
assigned to it. The algorithm is finished when the assignment of instances to clusters do not
change. It divides a set of N samples X into K disjoint clusters S, each described by the mean
of samples in the cluster. The means are commonly called the cluster centroids and they are
not in general points from X, although they live in the same place. The k-means algorithm
aims to choose centroids (center of cluster calculated as arithmetic mean) that minimize the
inertia (a measure of how internally coherent clusters are), also known as the within-cluster
sum-of-squares criterion. Main strengths are simplicity, wide range of applicability, fast
convergence, linear scalability to large data while producing clusters of an even size and is
most useful when the exact number of clusters k is known beforehand. Main weaknesses:
have to tune the hyperparameter – number of clusters, lack of a guarantee to find a global
optimum and its sensitivity to outliers, and it can only capture relatively simple shapes. The
optimal number of clusters in data can be found by ‘knee finding’ and ‘elbow finding’.
If k increases, average distortion will decrease, each cluster will have fewer constituent
instances, and the instances will be closer to their respective centroids. However, the
improvement in average distortion will decline as k increases. The elbow method plots the
value of the cost function produced by different values of k. Median (here: medoid) is
preferred over the mean in the presence of outliers. K-medoids is an approach to overcome
extreme values in dataset that can disrupt a clustering solution significantly. K-medoids uses
median/medoids as the center point of a cluster which means that the center of a cluster must
be one of the observations in that cluster.
Hierarchical clustering involves creating clusters that have a predominant ordering, i.e., a
hierarchy so that we do not need to specify the number of clusters. Two types of hierarchical
clustering: agglomerative and divisive hierarchical clustering. Agglomerative hierarchical
clustering is a bottom-up approach, where each observation starts in its own cluster and pairs
of clusters are merged as one moves up the hierarchy. Divisive one is a top-down approach
where all observations start in one cluster and splits are performed recursively as one moves
down the hierarchy. Hierarchical clustering can be visualized by dendrogram. Main
strengths of agglomerative hierarchical clustering are: no need to pre-specify the number of
clusters; results can be visualized. Main weaknesses: it fails at separating complex shapes in
data structure; the choice of both distance metric and linkage criteria is often arbitrary. Main
strengths of divisive hierarchical clustering: bottom-up methods make clustering decisions
based on local patterns so that they do not need to take account the global distribution. Top-
down clustering benefits from complete information about the the global distribution when
making top-level partitioning decisions. Main weakness is the sensitivity to initialization due
to the possible divisions of data into two clusters at the first steps.
Affinity propagation clustering also does not need to select number of groups and creates
clusters by sending messages between pairs of samples until convergence. Then a data set is
described using a small number of exemplars, which are identified as those most
representative of other samples. The messages sent between pairs represent the suitability for
one sample to be the exemplar of the other, which is updated in response to the values from
other pairs. This updating happens iteratively until convergence, at which point the final
exemplars are chosen and hence the final clustering is given. In comparison with k-means, it
describes a dataset using a small number of exemplars. These are members of the input set
that are representative of clusters. The centroid in k-means clustering does not have to be one
of the data points, while the exemplar in affinity propagation clustering is one of the data
points. Main strengths: choose the number of clusters based on data provided; algorithm is
fast. Main weaknesses: time- and memory-complexity; only appropriate for small and
medium sized data sets; converge only to suboptimal options and can fail to converge.
Presentation:
1. How to speak when AI is listening: Pi-audio analysis → advantage of easily
implementing the tool myself
2. ChatGPT is trained at too low price and have the bias removed in the training data
removed by Kenian workers paid with low wage; wont end up with the Amazon one
3. Jocky vs. Horses: show and disscuss then showed that we should go for the
business→horses, rather than the charasmantic of CEO
4. Log run McDonald’s dictionary is easily accessible and it’s updated
5. Argentine example: Hedge fund managers sometimes go to surprising lengths and
measures and try to seize assets of countries such as vessels and planes. When
country defaults, sometimes investors go to extra lengths and try to ….