0% found this document useful (0 votes)
4 views

Extracted Text

The document is a presentation on machine learning, covering various regression techniques including polynomial, multiple, and logistic regression, as well as multi-output and multivariate regression. It discusses the use of different sklearn regressors, the importance of feature selection, and the evaluation metrics for classification problems. The presentation includes practical exercises and resources for further learning on implementing these concepts in Python using scikit-learn.

Uploaded by

Tonis Petronis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Extracted Text

The document is a presentation on machine learning, covering various regression techniques including polynomial, multiple, and logistic regression, as well as multi-output and multivariate regression. It discusses the use of different sklearn regressors, the importance of feature selection, and the evaluation metrics for classification problems. The presentation includes practical exercises and resources for further learning on implementing these concepts in Python using scikit-learn.

Uploaded by

Tonis Petronis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 391

--- Content from 1v-QtRy61cARo3E4Bq0dqmJb94WE97UBS.

pptx ---
Artificial Intelligence
Introduction to Machine Learning
2024
Lecturer
Mindaugas Bernatavičius

Today you will learn


Polynomial regression
01
02
Different sklearn regressors
Introduction to Machine Learning
00
Multiple regression
Multi-output regression
03
04
Logistic regression
05
06
07
Multivariate regression
Support Vector Machines
Decision Trees
08
K-Nearest Neighbors
We already tried solving a multiple linear regression problem with the forest fire
area prediction.
Video explaining in detail what it is: https://www.youtube.com/watch?v=zITIFTsivN8
Let’s try solving another problem.
People always wanted to know how long they will live. Insurance companies are using
such predictions to calculate on aggregate guaranteed profits from life insurance
(insurence in general is calculated mathematical expectations based on
probabilities that certain events will happen).
Implement the following tutorial: https://www.enjoyalgorithms.com/blog/life-
expectancy-prediction-using-linear-regression ( https://archive.ph/575il )
Answer what are the new things I have learned? (lr.score() is R2 - coefficient of
determination, pd.get_dummies() - one-hot encoder, ordinal encoder could be
preferred to not explode features set, scater plot of residuals for regression,
pycountry_convert lib)
Answer which model would be more predictive for a lithuania: one trained with only
data for Lithuania or the one trained with all the data? Which is better - less
specific data but more of it or more specific data but less of it. Does it depend
on the sample size?
Answer can we determine most predictive feature? How about pair of features? Would
getting the coefficients help (lr.coef_) does scaling have any effect on judging
the importance based on feature coefficients? How about the feature that positively
or negatively correlates with life expectancy? See:
https://stats.stackexchange.com/questions/422769/feature-importance-for-linear-
regression - if we were able to reduce the predictive feature size we might be able
to train the model faster and visualize it better (3D scatter plot).
Implement: detect which single feature gives the most predictive model by applying
the LinearRegression() to each collumn and seeding which R2 is the highest? Is is
the same that correlates with the life expectancy most.
Implement: is Lasso, ElasticNet or Ridge regression be more accurate? What is the
variance between runs is bigger than variance between models?
Implement: don’t forget plotting! Plot the models predictions against the data? How
can we do it if it’s multivariate? See:
https://stats.stackexchange.com/questions/73320/how-to-visualize-a-fitted-multiple-
regression-model
Research: is such data available for your country? Maybe better data?

Multiple regression
Introduction to Machine Learning
Polynomial regression
What if your data is more complex than a straight line? Surprisingly, you can use a
linear model to fit nonlinear data. A simple way to do this is to add powers of
each feature as new features, then train a linear model on this extended set of
features. This technique is called Polynomial Regression.
Scikit-Learn’s PolynomialFeatures() class to transform our training data, adding
the square (second-degree polynomial) of each feature in the training set as a new
feature.
PolynomialFeatures(n) constructor accepts the degree of the polynomial function -
for cubic functions you need to use 3 (or more) - anything below 3 would underfit /
have a lot of bias if the true function that explains the data is indeed cubic. N
is yet another hyperparameter.
The added polynomial features increase the size of the models internal state i.e.
makes it more powerful.
Naming confusion. You will sometimes hear people saying “polynomial linear
regression” - this is the correct full name as the model as it is linear with
respect to the weights that are learnt. In reality polynomial regression is
considered a special case of multiple linear regression, see:
https://stats.stackexchange.com/questions/92065/why-is-polynomial-regression-
considered-a-special-case-of-multiple-linear-regres and
https://en.wikipedia.org/wiki/Polynomial_regression
Problems being solved:
https://iq.opengenus.org/polynomial-regression-using-scikit-learn/
https://prutor.ai/implementation-of-polynomial-regression/
https://github.com/karthikeyanthanigai/Polynomial-regression...
https://www.kaggle.com/rahulkadam0909/polynomial
https://www.askpython.com/python/examples/polynomial-regression-in-python
https://codekarim.com/sites/default/files/ML0101EN-Reg-Polynomial-Regression-Co2-
py-v1.html
Exercise: take the forest fire dataset we have and apply polynomial regression. See
if a better model is produced. Does the polynomial model take longer to train
(longer to predict)?

Introduction to Machine Learning


Different sklearn regressors
Many ML algorithms are able to train regression models to solve regression
problems.
Scikit offers many “regresors” (regression models) based on many algorithms. Ref:
https://scikit-learn.org/stable/supervised_learning.html and https://scikit-
learn.org/stable/modules/classes.html#module-sklearn.linear_model
We will talk about many of them as we go though all the ML algorithms like KNN,
SVM, DecisionTree and so on.
The questions we will try to answer latter on (note - you can think about how would
you answer these questions yourself):
how can an algorithm be used to create models for both classification and
regression?
which regressor/classifier/clusterizer is best in which circumstance?
What is the algorithm behind the scikit LinearRegression() is OLS:
https://www.kaggle.com/general/22793 - this simply solves an equation. You can’t
choose an error function to optimize against using this method.
There is a commonly use stochastic gradient descent algorithm for regression - this
one is more tunable (we can change the loss function, max iterations and so on:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor
.html#sklearn.linear_model.SGDRegressor)

Introduction to Machine Learning


Multi-output regression
Multi-output regression involves predicting two or more numerical variables (not
[x1 , x2 , … xn ] → y, but [x1 , x2 , … , xn ] → [y1, ,…, yz], where n > z).
Unlike normal regression where a single value is predicted for each sample, multi-
output regression requires specialized machine learning algorithms that support
outputting multiple variables for each prediction.
Simple examples:
Another example predict ranges of flat prices (size, room count, kitchen size → min
price, expected price, max price)
An example might be to predict a coordinate given an input, e.g. predicting x and y
values.
Another example would be time series forecasting that involves predicting multiple
future time series of a given variable.
Object bounding box regression (we will see later).
In multioutput regression, typically the outputs are dependent upon the input and
upon each other. This means that often the outputs are not independent of each
other and may require a model that predicts both outputs together or each output
contingent upon the other outputs.
Some scikit models are inherently multi-output:
LinearRegression (and related)
KNeighborsRegressor
DecisionTreeRegressor
RandomForestRegressor (and related)
For other models (like LinearSVR) we have two approaches:
Direct Multioutput - independent model for each numerical value to be predicted
(MultiOutputRegressor() wrapper). Useful when all ys are independent of each other.
This would be determined by tests or a crude way - just try both direct and chained
multioutput approaches.
Chained Multioutput - sequence of dependent models (RegressorChain()). Useful when
y’s are dependent on each other. order=[0,1] would first predict the 0-th output,
then the 1st output, whereas an order=[1,0] would first predict the last output
variable and then the first output variable in our test problem. Question: Can you
come up with an idea of how can we determine the sequence of models to apply (the
order parameter)?
For more information and examples: https://machinelearningmastery.com/multi-output-
regression-models-with-python/
Introduction to Machine Learning
Multi-output regression
In the example we discussed before, if we wanted to predict min and max flat price
(so a range) in the chained case we would use min price to predict the max price.
That is correct.
Introduction to Machine Learning
Multivariate regression
Multivariate regression pertains to a problem where multiple dependent variables
are mapped to multiple independent variables.
General form: y1,y2,...,ym = f(x1,x2,...,xn).
So essentially it’s just a multiple multi-output regression problem.
See: https://stats.stackexchange.com/questions/2358/explain-the-difference-between-
multiple-regression-and-multivariate-regression
Introduction to Machine Learning
Logistic regression
Introduction to Machine Learning
Classification problem - predict discrete value (indicating the belonging to a
specific class) based on known / historical data (supervised).
We distinguish between 4 classification problems:
Binary (binomial) classification - predict if this set of features describing a
thing belong to class A or B.
Multi-Class (multinomial) classification - predict if this set of features
describing a thing belong to class A or B or C (exclusive) or …
Multi-Label classification - predict if this set of features describing a thing
belong to class A and B and C or A and C or …
Imbalanced classification - predict if this set of features (X’s) describing a
thing belongs to class A or/and B (...Z) when the dataset skewed. Of course
imbalance classification is completely different criterion for categorizing
classification problems, so we can have binary imbalanced and multilabel imbalanced
problems, etc.
Examples of binary: comment → positive vs. negative; email → spam vs. not spam;
tumor → benign or malignant; transaction fraudulent or not.
Examples of multi-class: given data about an animal predict it’s kind - lion,
zebra, dog (and other biological classifications). Credit scoring (1-5), x: salary,
assets valuation (colateral), liabilities, salary / monthly liabilities ratio → 1-5
credit risk category.
Examples of multi-label: predict whether a certain commend should have #cooking,
#travel, #yolo (stackoverflow, twitter).
There are multiple models to solve this problem, one of the most popular is
logistic regression.

Logistic regression
Logistic Regression (also called Logit Regression) is commonly used to estimate the
probability that an instance belongs to a particular class (what is the probability
that this email is spam). If the estimated probability is greater than 50%, then
the model predicts that the instance belongs to that class (called the positive
class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs
to the negative class, labeled “0”). This makes it a binary classifier (by
default).
One again, we need to understand 3 things for each “model”: error, learning
(training), predicting (inference).
Predicting:
Like Linear Regression model, Logistic Regression model computes a weighted sum of
input features (+ bias term)
Instead of outputting result directly like Linear Regression model, it outputs the
logistic of this result.
The logistic is the sigmoid function - so the sigmoid function of the prediction is
the output.
The results are interpreted as probabilities that the item belongs to class 1 or
class 0.
Error / cost:
Objective of training is to set parameter vector θ so that model estimates high
probabilities for positive instances (y = 1) and low probabilities for negative
instances (y = 0).
The cost function - log loss - is calculated over the whole training set is the
average cost over all training instances.
Usually maximize the log-likelihood function (LLF) (there are others like cross
entropy loss) and the process is called “maximum likelihood estimation”.
Learning / training:
Bad news is that there is no known closed-form equation to compute the value of θ
that minimizes this cost function (there is no Normal Equation that the OLS
algorithms solves for).
Good news is that this cost function is convex (question: what is a convex
function, can you give examples).
Gradient Descent (or any other optimization algorithm (liblinear, lbfgs)) is
guaranteed to find the global minimum (if the learning rate is not too large and
you wait long enough) since the cost function is convex.
We will talk about gradient descent in depth in future slides. For now let’s just
define it as an optimization algorithm for minimizing any arbitrary function by
calculating contributions to the total error by each model weight and adjusting the
weights to the direction that minimizes the error by some magnitude (number),
called learning rate.
Regularization also applies to logistic regression and can improve model
performance on unseen data by preventing overfitting.

Introduction to Machine Learning


Logistic regression
There are a few libraries that implement logistic regression solvers - of course we
are going to use scikit.
Always remember you could implement it with numpy (“from numpy-scratch”) to
understand it deeper.
In sickit we use: LogisticRegression() to create a model, otherwise the API is very
similar to LinearRegression().

Demo: minimal working example of univariate logistic regression.


Demo: simple but more realistic dataset for univariate logistic regression.
Demo: vizualization of simple solution.

For general introduction see these resources:


For more information see this playlist: https://www.youtube.com/playlist?
list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
This article: https://realpython.com/logistic-regression-python/
This article: https://machinelearningmastery.com/logistic-regression-for-machine-
learning/

Introduction to Machine Learning


Logistic regression
The 4th thing besides error, prediction and learning we need to know about ML
models is performance evaluation metrics.

Binary classification has four possible types of results (arranged in a confusion


matrix):
True negatives: correctly predicted negatives (zeros)
True positives: correctly predicted positives (ones)
False negatives: incorrectly predicted negatives (zeros)
False positives: incorrectly predicted positives (ones)
Let’s understand the confusion matrix better: https://www.youtube.com/watch?
v=Kdsp6soqA7o

Sometime confusion matrices can be very similar, with tradeoffs (ebola vs. finding
potential protestors against chinese gov. and/or “vatnik detector” (what features
would you use for that?)).
Other indicators of classification “accuracy”:
Ratio of the number of correct predictions (TN + TP) to the total number of
predictions (P + N) = TN + TP / P + N ⇒ ACC (accuracy)
Precision (Positive predictive value: PPV) - ratio of true positives to the sum of
true and false positives ⇒ TP / (TP + FP) ⇒ TP / All Possitives measured
Negative predictive value (NPV) - ratio of true negatives to all negatives ⇒ TN /
(TN + FN) ⇒ TN / All Negatives measured
Sensitivity (or recall or true positive rate) - ratio of number of true positives
to the number of actual positives ⇒ TP / (TP + FN)
Specificity (or true negative rate) is the ratio of the number of true negatives to
the number of actual negatives ⇒ TN / (TN + FP)
We choose the indicator that is most important for us based on the problem at hand
- for example in medicine, FN rate might need to be minimized even if the FP rate
increases in that case.

Demo: scikit confusion matrix, see:


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.
html
Demo: scikit confusion matrix visualization (metrics.plot_confusion_matrix is
deprecated).
Demo: scikit classification_report, see:
https://scikit-learn.org/stable/modules/...

Introduction to Machine Learning


Logistic regression
model.score() is the accuracy. Note a random number generator will be able to guess
correctly approx. 50% of the time. For 3-class classification problem it will be
right about 33% procent of the time and so on. So don’t be surprised if a simple
model produces 80% accuracy and also remember that 60% accuracy for binary
classification is really not good. In regression problems the minimal benchmark is
the population average - model needs to perform better than just guessing the avg.
You need to remember how scikit prints the confusion matrix as it is not standard
way!
Let’s figure out the metrics in the classification_report():
See: https://datascience.stackexchange.com/a/64443
See: https://muthu.co/understanding-the-classification-report-in-sklearn/
Best video on f1 score: https://www.youtube.com/watch?v=8d3JbbSj-I8
Accuracy vs. f1 score: https://deepai.org/machine-learning-glossary-and-terms/f-
score “The accuracy has the advantage that it is very easily interpretable, but the
disadvantage that it is not robust when the data is unevenly distributed, or where
there is a higher cost associated with a particular type of error.”
Note: there are way more metrics (see next slides) a few of which we will take a
look at in the future, however it is important to note that they are used in
different circumstances (remember the medicine example). To get an intuitive
understanding of when to use which you should play around with different
classification scenarios and see how the metrics differ:

Introduction to Machine Learning


Logistic regression
Ref: https://www.slideshare.net/RebeccaBilbro/learning-machine-learning-with-
yellowbrick
We will discuss RoC and AuC measurements in then future when we will talk about
neural networks and classification.

Introduction to Machine Learning


Logistic regression
Ref: https://en.wikipedia.org/wiki/F-score#Diagnostic_testing

Introduction to Machine Learning


Logistic regression
Linear Separability
Logistic regression can only reach ACC, F1, Selectivity and Sensitivity of 1 if the
categories are linearly separateble (in all dimensions).
Absence of linear separability does not preclude the usage of this technique, but
will not reach ideal metrics.
There is a range of techniques that allows for testing for linear separability
ranging from simple inspection of the data for low volume datasets, visualization
to advanced techniques like fitting SVMs, perceptrons and Linear programming
techniques, see:
https://www.sciencedirect.com/science/article/abs/pii/S0925231201005872 and
https://stats.stackexchange.com/a/94537/162267 . A good article:
https://www.tarekatwan.com/index.php/2017/12/methods-for-testing-linear-
separability-in-python/
Why is this useful? If dataset is not linearly separable logistic regression will
not be optimal.
Exercise: create a dataset by hand that would not be linearly separable.
Data in 1D, 2D, 3D can be non linearly separable. If the data is not linearly
separable in N-th D that does not mean that it’s not separable in N+1th D.
Tunning

values of 𝑏₀ and 𝑏₁ (model parameters). Different values of 𝑏₀ and 𝑏₁ imply a


C - larger C means weaker regularization, or weaker penalization related to high

change of the logit 𝑓(𝑥), different values of the probabilities 𝑝(𝑥), a different
shape of the regression line, and possibly changes in other predicted outputs and
classification performance. High value of C tells the model to give high weight to
the training data, and a lower weight to the complexity penalty. A low value tells
the model to give more weight to this complexity penalty at the expense of fitting
to the training data. Basically, a high C means "trust this training data a lot",
while a low value says "This data may not be fully representative of the real world
data, so if it's telling you to make a parameter really large, don't listen to it".
Other important tunning parameters: solver, penalty (default=’l2’), tol -
https://stats.stackexchange.com/a/255380/162267
Exercise: can you improve the model w/o adjusting the C value? Try different
options.

Introduction to Machine Learning


Logistic regression
Solvers

the error function wrt to the model learnable parameters ( 𝑏₀ and 𝑏₁ ).


Solvers are just implementation of different optimization algorithms that minimize

See: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
See: https://towardsdatascience.com/dont-sweat-the-solver-stuff-aea7cddc3451
See: https://stackoverflow.com/questions/38640109/...
L1 is synonymous with Lasso regularization (sparse regression and classification)
L2 is synonymous with with Ridge.
Note: you would need to study each of these solvers to understand how each of them
learns the weights as these are precisely learning algorithms for logistic
regression. Example: Newton's Method is important because it's an iterative process
that can approximate solutions to an equation with incredible accuracy.
Same is true of gradient descent, but it is not a newtonian optimization method.
More on them: https://www.baeldung.com/cs/gradient-descent-vs-newtons-gradient-
descent

Introduction to Machine Learning


Logistic regression
Demo:
Implement the following tutorial: https://heartbeat.comet.ml/logistic-regression-
in-python-using-scikit-learn-d34e882eebb1
Another explanation of the columns:
https://github.com/DaliniPillai/Student_Pass_Fail_Data_Machine_Learning/blob/main/
Student_Pass_Fail_Data_Machine_Learning.ipynb
While implementing it, see if you notice anything new / not yet covered.
Add the traditional evaluation metrics to your implementation.
Find the smallest amount of data that would make the models still be useful (how
would you define a useful vs. useless model)?
Will the model become less useful if you introduced asymmetry between classes (10%,
20% and on)?
Can you determine which variable is more predictive self_study or tuition (the
problem of feature importance)? How would you do that? What if one of the features
was only predictive with 50% accuracy (in a binary classification problem)? Could
it be dropped? Is the symmetry of classes somehow related to the variable
importance?

Exercise - random guesser.


We have mentioned, that for binary classification a random number generator
guessing the classes would be around 50% right.
When class count is n the random guesser should be 100 / n-% accurate (if they are
balanced)
Prove that it’s correct or incorrect. You do not need a model, random guesser is a
dummy model itself (stand-in for a real model)
Use either your own generated data or function: make_classification().
Then check if the same holds for a dataset with 50/50 distribution (student
pass/fail dataset).
Print the confusion matrix - is these anything interesting in it. What should be
the distribution, 25% in each section?
IMPORTANT: if you have an imbalance classification these simple calculations should
be critically evaluated - if you have 98% samples that are “NOT SICK” then just by
guessing always “NOT SICK” you will achieve 98% accuracy. Thats why accuracy is not
the only metric we are interested in with classification problems.

Introduction to Machine Learning


Logistic regression
Multi-class classification
Ref: https://scikit-learn.org/stable/modules/multiclass.html and
https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-
classification/
Binary classification models (classic logistic regression, SVM) do not support
multi-class classification natively and require meta-strategies.
One-vs-Rest (OvR) strategy splits multi-class classification into one binary
classification problem per class [c1, c2, c3] → [c1 | c2+c3] & [c2 | c1+c3] & [c3 |
c2+c1].
One-vs-One (OvO) splits multi-class classification into binary classification per
pair of classes [c1, c2, c3] → [c1 | c2], [c1, | c3], [c2 | c3] i.e. N*(N-1)/2 (vs.
N). So if we had [c1, c2, c3, c4] → [c1 | c2], [c1, | c3], [c1, | c3], [c2 | c3],
[c2 | c4], [c3 | c4].
There are obvious combinatorics implications when comparing OvR and OvO which can
be relevant for performance. But there are also machine learning specific
implications, for example OvO can be less prone to class imbalance effects
theoretically (consult your preferred frameworks documentation for details on
whether it accounts for class imbalances in these metastrategies). E.g.: [10%, 45%,
45%] → [10% | 45%+45%] & [45% | 10%+45%] & [45% | 45%+10%] . Question: could we
provide evidence that this is the case?
Note in scikit: for multiclass, the training algorithm uses the one-vs-rest (OvR)
scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss
if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’
option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Problem 1: figure out how to create a multiclass problem make_classification() and
see how good of a model can you train.
Problem 2: ovr is simple to setup with scikit, let’s use the same data and train a
ovo model: OneVsOneClassifier(). Create loop to get diff measured.
Problem 3: it is common to use pictures for multiclass classification - we will
talk how pixels in pictures are treated as features (either way they are just
numbers indicating color or brightness), for now let’s implement this:
https://realpython.com/logistic-regression-python/ - experiment with
regularization, tell what new thing have you learned (StandardScaler() - scikit
object that will standardize all the features in a dataframe. We will fully
introduce it latter).

Introduction to Machine Learning


Logistic regression
Multi-label classification
Definition: one item in the dataset can belong to multiple categories at once.
Example problems: picture with multiple items in it, text classification - stack
overflow question that can be tagged with different categories, transcriptase
sequences and corresponding resistance information (see:
https://academic.oup.com/bioinformatics/article/29/16/1946/200578 ). Another
example: https://stackoverflow.com/questions/61977692/how-to-use-multinomial-
logistic-regression-for-multilabel-classification-problem
There are two general ways to solve this problem: meta-strategries to transform /
reduce the problem and 2nd - use models natively supporting multi-label.
Problem is also transformed and meta-strategies are used when an algorithm is not
natively multi-label adapted.
3 meta-strategies:
Binary Relevance - independently training one binary classifier for each label and
predict based on a positive result for a specific class.
Classifier Chain - chained binary classification, output of all previous
classifiers (positive or negative for particular label) are input features to
subsequent classifiers
Label Powerset - creates one binary classifier for every label combination present
in the training set [[A, B], [B, C], [C]] → [A, B], [B, C], [C] , no [A, C], [A],
[B]
Adapted, native models: ML-kNN, Decision Trees, Neural Networks. More:
https://en.wikipedia.org/wiki/Multi-label_classification#Adapted_algorithms
More: https://en.wikipedia.org/wiki/Multi-label_classification &
https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-
classification/
Note on scikit: LogisticRegression() does not natively support multilabel, you can
wrap it in MultiOutputClassifier() wrapper (which fits one classifier per target -
binary relevance method). Another option is ClassifierChain() which uses classifier
chain methodology. Label powerset is supported by another package, scikit-
multilearn LabelPowerset() wrapper:
http://scikit.ml/api/skmultilearn.problem_transform.lp.html which also affers
BinaryRelevance() wrapper.
Practice problem 1: https://www.kaggle.com/roccoli/multi-label-classification-with-
sklearn/notebook you can find data here:
https://www.kaggle.com/stackoverflow/statsquestions
Practice problem 2: https://towardsdatascience.com/multi-label-text-classification-
with-scikit-learn-30714b7819c5
Practice problem 3: https://towardsdatascience.com/journey-to-the-center-of-multi-
label-classification-384c40229bff
Additionally we need to note, that there is such a thing as multi-label confusion
matrix:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.multilabel_confus
ion_matrix.html and
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Introduction to Machine Learning


Logistic regression
Imbalanced classification - classify data when the examples of some class(es) are
more abundant.
Commonly encountered interview question and job task in the real world for all
learning algorithms - classical ML and DL.
Imbalanced classes are common: credit card transaction classification into
potentially fraudulent and not will be highly imbalanced (few fraudulent, majority
good). Disease (rare possitives). Spam (abundant compared to normal emails), search
keywords and websites that serve them (minority relevant (thank god for search
engines)), unfit candidates for jobs.
There are several ways to deal with imbalances in classes, they can be split into
two categories: data-based and model-based:
Downsampling the majority (data). Problem: which items to drop? Those who do not
help in deciding on the boundary.
Upsampling the minority (simple copying, advanced SMOTE - Synthetic Minority
Oversampling Technique) (data)
Hybrid: SMOTE + undersampling of majority (non-confusing outlier elimination)
(data)
Cost-sensitive models that are configured to adjust for imbalance (class_weigh)
(model)
Tree based models (model)
Other techniques: https://elitedatascience.com/imbalanced-classes
Practice: create an MWE with the make_classification() function of an imbalance
dataset and see how it performs w/o tunning for imbalances.
Demo: imbalanced classification with credit card data: https://www.kaggle.com/mlg-
ulb/creditcardfraud
This dataset is preprocesed with PCA dimensionality reduction technique. We will
cover it in unsupervised learning section of the course - for now let’s just say
that PCA reduces the number of features (x) to however many synthetic features -
called principal components - by performing an orthogonal linear transformation of
our feature space. This means that the features mean nothing to us and intelligent
feature engineering is not possible - we can’t come up with new meaningful
features, we could only make synthetic features by performing some mathematical
operations on the numbers in each feature column, but they would not be meaningful.
We will use standardization - zero mean and std of 1 to put all the feature columns
on the same scale!
We will use stratify parameter when splitting data - it provides a stratified
sample i.e. splits data in a way that distributional imbalance will be preserved in
test and train datasets. Example: working and non-working males and females.
We will improve knowledge of k-fold CV. Use it to understand the model is prone to
overfitting (remember it is also used in turning and candidate model selection) -
if it did then the CV loss would be much bigger than the training loss - if close,
then we dont need to regularize more. We will also use StratifiedKFold()
Since this is an imbalanced classification problem we need to choose how to handle
the imbalance. We will use sensitivity setting: class_weight=balanced. This will
cause the model to be much more sensitive to the correct and incorrect individual
predictions for the undersampled class when it is training.
Regularization (C=30) is also a very important parameter - trusting data more when
classes are not balanced seems to yield better results.
We will also introduce the PR curve, the ROC curve and area under ROC - auROC as
performance evaluation metrics, see: https://www.youtube.com/watch?v=4jRBRDbJemM .
There is a difference between the PR curve and the ROC curve, explained here:
https://stats.stackexchange.com/a/7210/162267 and
https://www.biostat.wisc.edu/~page/rocpr.pdf

Introduction to Machine Learning


Logistic regression
Class imbalance is problem that affects quite a few ML pipeline steps:
Introduction to Machine Learning
Logistic regression
Imbalanced classification
SMOTE - Synthetic Minority Oversampling Technique, ref:
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-
classification/
SMOTE works by selecting examples that are close in the feature space that belong
to the same undersampled category and constructing a new example that is within the
bounds determined by the examples close to it. Specifically:
a random example from the minority class is first chosen.
k of the nearest neighbors for that example are found (typically k=5).
randomly selected neighbor is chosen
synthetic example is created at randomly selected point between two examples in
feature space
synthetic instances are generated as a convex combination of the two chosen
instances a and b
convex combination simply means that the new point is within the bounds between the
a and b, see: https://en.wikipedia.org/wiki/Convex_combination
Note: it is suggested in the original SMOTE paper, that a combination of SMOTE and
under-sampling of majority performs better than plain under-sampling or plain
SMOTE.
Disadvantages:
synthetic examples are created without considering the majority class, possibly
resulting in ambiguous examples if there is a strong overlap for the classes.
Question: can you explain why this is the case?
tuning process is more complex when using SMOTE as you have more parameters to
tune. Question: can you check the article at the top and identify how many?
SMOTE can generate points that simply can’t exist in reality. This is important
when we have >3D data or
discrete data? Number of rooms/label-encoded days of the week.
Variations:
Borderline-SMOTE - oversample difficult instances, providing resolution only where
it’s required.
SVM based SMOTE - use SVM for generation, not KNN.
Adaptive Synthetic Sampling (ADASYN) - more synthetic data is generated for
minority class samples that are harder to learn compared to those minority samples
that are easier to learn.

Introduction to Machine Learning


Logistic regression
Imbalanced classification.
SMOTE criticism - many say it does not work.
The main ideas are here: https://arxiv.org/abs/2201.08528#
https://datascience.stackexchange.com/questions/106461/why-smote-is-not-used-in-
prize-winning-kaggle-solutions/108363#108363
https://stats.stackexchange.com/questions/585173/is-smote-any-good-at-creating-new-
points
https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-
problematic-and-how-does-oversampling-purport-to-he
Introduction to Machine Learning
Support Vector Machines
General introduction
One of most popular ML methods for regression and classification, even outlier
detection (unsupp).
SVMs are well suited for classification of complex (large feature space) small or
medium-sized datasets.
SVMs are considered to be large margin classifiers (LMC) because they choose the
solution with the largest margin from all the possible solutions (picture at the
top). By definition on linearly separable dataset you can draw >= 1 lines
separating the categories, LMC models, like SVM will choose the line between two
points that would ensure the widest margin.
LMCs do not care about the data arrangement outside the margin, they pay attention
to the items that are the hardest to classify - closest to the other category,
“outliers”. This has an additional implication that SVMs are memory efficient.
The margin (aka “street”) is defined as the largest area between two support
vectors.
Support vectors >=2 points from each category closest to the items of other
category in Euclidian space.
The middle of the margin is the hyperplane used as a decision boundary.
SVMs are sensitive to feature scaling - you must at least try scalled and unscalled
data. It almost always a good idea to to scale the feature when using SVMs. Also
sensitive to the changes in support vectors.
Why largest margin? Because it means that the decision hyperplane is farthest away
from the support vectors and not too close, so there is the lowest probability of
misclassiffication if there are datapoints that are even fartherer away than the
support vectors (picture in the next slide).
Disadvantage: do not output probabilities when predicting. They need to be
calculated using five-fold cross-validation, ref:
https://scikit-learn.org/stable/modules/svm.html#scores-probabilities
Refs:
General concepts overview: https://www.youtube.com/watch?v=iEQ0e-WLgkQ
See playlist: https://www.youtube.com/playlist?list=PLblh5JKOoLUL3IJ4-
yor0HzkqDQ3JmJkc

Introduction to Machine Learning


Support Vector Machines
Compare the black and the red boundaries.

Introduction to Machine Learning


Classification of SVMs, based on behavior
Hard margin classifiers do not permit misclasifications. Two disadvantages: classes
must be linearly separable and these models are highly sensitive to outliers. With
scikit we achieve this by setting the regularization hyperparameter C to a very
high value (hundreds or thousands) so that in those rare cases when your data is
linearly separable you will be able to have a very precise model.
Soft margin classifiers permit false misclasifications. Classes can overlap. This
is a more common and more flexible model. With scikit we achieve this by setting
the regularization hyperparameter C to a very low value close to 0.

Classification of SVMs, based on data


Linear SVM classifiers only able to separate classes using a line.
Non-linear SVMs classifiers can fit more complex curves separating classes (like
Logistic Regression with polynomial features - Polynomial Logistic Regression).
Obviously the same multi-class, multi-label and imbalanced classification problems
are also relevant and can be grounds for classification.
Support Vector Machines
Introduction to Machine Learning
Scikit and SVMs:
With scikit learn we can initiate many SVM algorithms with slight differences. The
main categories to think about are - linear and non-linear problems
(classification).
SVC() and NuSVC() are similar, but accept slightly different sets of parameters and
have different mathematical formulations.
LinearSVC() another (faster than SVC w/ linear kernel) implementation w/ linear
kernel (does not accept kernel param - assumed to be linear, can be used with poly
features).
SVMs decision function depends on subset of training data - support vectors -
properties of these can be found in attributes: support_vectors_, support_ and
n_support_
SGDClassifier(loss="hinge") - mostly useful for online classification or out-of-
core training when data does not fit into RAM.
Demo: LinearSVC(), SVC(kernel="linear"), and SGDClassifier() on iris dataset. We
will also introduce a function to plot the decision boundary (hyperplane). You can
find an example of which here:
http://www.cse.chalmers.se/~richajo/dit866/backup_2019/lectures/l6/Nonlinear
%20classification%20toy%20example.html
You can see time complexity and other characteristics in the above table.
For non-linearly separable problems we have a few approaches (basically 2 -
transform features with poly features or implement kernel trick):
Polynomial features - model-independent approach and can be used with Linear
Regression, Logistic Regression and so with SVMs. However at low polynomial degree,
it can’t deal with complex datasets, and with high polynomial degree it creates
huge number of features, making model slow(er).
Kernel trick - makes it possible to get the same result as if you had added many
polynomial features, even with very high-degree polynomials, without actually
having to add them. No combinatorial explosion. This is implemented by the SVC()
class, i.e.: SVC(kernel="poly", degree=3). Kernel trick should be remembered!
Proximily landmarks / similarity features approach. This essentially creates
projections at some or all datapoints by applying some function to those datapoints
transforming the original features so that they would be almost or completely
linearly separable. This can be accomplished with scikit using SVC(kernel="rbf",
gamma=5, C=0.001) (called Gaussian RBF kernel) gamma acts like a regularization
hyperparameter: if your model is overfitting, you should reduce it; if it is
underfitting, you should increase it. Note, kernel trick is still used, just in a
different way than with the poly kernel.
Which one to choose? Advanced kernels - there is no need to generate polynomial
features if you can use the kernel trick (unless the problem requires it).
Demo: non-linear classification with make_moons() and make_circles() function using
polynomials features and the kernel trick.
Other kernels exist but are used much more rarely. Some kernels are specialized for
specific data structures. String kernels are sometimes used when classifying text
documents or DNA sequences (e.g., using the string subsequence kernel or kernels
based on the Levenshtein distance).
Regression with SVMs using scikit: SVC - last C for classification, SVR for
regression. The theory will be explained in the slides latter.

Support Vector Machines


Introduction to Machine Learning
Support Vector Machines
Same 3 things: error, predicting, learning.
Predicting: linear SVM classifier model predicts the class of a new instance x by
simply computing the decision function wTx + b = w1x1 + ⋯ + wnxn + b - the w is the
weight matrix, x is the feature vector (xi being the individual features) and b is
the bias term (which was the intercept in linear regression). If the result is
positive, the predicted class ŷ is the positive class (1), and otherwise it is the
negative class (0);
Error: no loss function for hard margin SVMs as misclasifications do not exist
there is no error to minimize. However with scikit we use a loss function always,
as there is no separate hard margin SVM implementation. So in general you should
remember that hinge loss is used for each prediction in soft margin SVMs, ref:
https://stats.stackexchange.com/questions/74499/what-is-the-loss-function-of-hard-
margin-svm . The total loss / cost is the sum over all examples and is very similar
to the logistic regression loss - take a look at the term: y * cost + (1 - t) *
cost.
Learning: there are several ways to solve the optimization problem and find the
support vectors with both hard and soft margins: quadratic programming and or
solving the “dual problem”. Both of these are outside of scope for this course and
many, many ML courses. It’s a specific topic requiring a lot of research an
advanced mathematics. We mention this for you to know the terminology that would be
key in starting on the journey to deeper understanding of SVMs.

Introduction to Machine Learning


Support Vector Machines
SVMs for regression
To use SVMs for regression instead of classification, the trick is to reverse the
objective: instead of trying to fit the largest possible street between two classes
while limiting margin violations, SVM Regression tries to fit as many instances as
possible on the street while limiting margin violations (i.e., instances off the
street).
The width of the street is controlled by a hyperparameter, ϵ.
To tackle nonlinear regression tasks, you can use a kernelized SVM model.
In scikit SVR() is used for regression and supports the kernel trick, so no need to
generate polynomial features (which has benefits even for plotting).
The LinearSVR() class scales linearly with the size of the training set (just like
the LinearSVC() class), while the SVR() class gets much too slow when the training
set grows large (just like the SVC() class).
Ref: https://stats.stackexchange.com/a/287855/162267

Outlier detecton
OneClassSVM() can be used to find abnormal data - outliers (unsupervised).
Ref: https://scikit-learn.org/stable/modules/outlier_detection.html

Introduction to Machine Learning


Decision Trees
Intro
Versatile ML models that can perform both classification and regression tasks, even
multioutput tasks.
Very powerful, can easily overfit. Very few assumptions about the data are made
(not biased, like linear models) - non-parameteric model - no predetermined number
of parameters, like for linear models(w1x1+...+wnxn+b, b=wn+1 * 1) is dictated by
feature space in the data)). Due to being potentially unbound it requires somewhat
more tunning, especially regularization - restricting the degrees of freedom - when
training. Read more about parametric and non-parametric models (also, don’t confuse
models with non-parametric statistics, the concepts vaguely similar):
https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-
algorithms/
Very little data preparation required: no need for scaling + centering
(standardization and normalization). One should not need to encode text data to
numbers also, but some implementations require that (scikit and xgboost, see:
https://datascience.stackexchange.com/a/52103 )
Very exaplainable and easy to understand (at least how the prediction is made) -
considered to be “white box” models (unlike Random Forests and Neural Networks,
considered “black box”). Provide nice, simple classification rules that can even be
applied manually if need be (e.g., for flower classification). You can use it as a
decision making framework or similar.
Fundamental component of random forests (amongst most powerful ML algorithms
available).
Due to all the advantages are among the most liked ML algorithms (need
confirmation) and also among the most popular: https://customerthink.com/top-
machine-learning-algorithms-frameworks-tools-and-products-used-by-data-scientists/

Introduction to Machine Learning


Decision Trees
Classification
How does it work? We DecisionTreeClassifier() with scikit, which when fitted
creates a tree like structure of chained decisions based on features obtained by
the dataset. You can think of them as a bunch of if-else statements the order and
each condition of which is learned automatically.
Demo: iris dataset classification and tree visualization:
https://mljar.com/blog/visualize-decision-tree/
Reading the decision tree: start at the root node, if you petal length is <= 2.45
it is “setosa” if not you interpret the right subtree. You continue with the
questions until you reach the leaf node - that is your category for this particular
example from the sample class=virginica or similar. node’s samples attribute counts
how many training instances it applies to. value attribute tells you how many
training instances of each class this node applies to. gini attribute measures its
impurity: a node is “pure” (gini=0) if all training instances it applies to belong
to the same class - this is the gini impurity measure.
Estimate the probability that an instance belongs to a particular class k - find
the leaf node for your example and check the class values [0, 49, 5] → [0/54,
49/54, 5/54] → [0%, 90.7%, 9.3%], or use the predict_proba() function.
Demo: drawing the decision boundary aka: decision surface. You can find examples:
https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html and
https://github.com/ageron/handson-ml2/blob/master/06_decision_trees.ipynb

Introduction to Machine Learning


Decision Trees
Learning, Error and Prediction:
Learning: CART algorithm - Classification and Regression Tree. It produces only
binary trees: nonleaf nodes always have two children (i.e., questions only have
yes/no answers). Other algorithms such as ID3 can produce decision trees with nodes
with >2 children. CART works by first splitting the training set into two subsets
using a single feature k and a threshold / value t of that feature (e.g., “petal
length ≤ 2.45 cm”). How does it choose k and t ? It searches for the pair <ki, ti>
that produces the purest subsets weighted by their size - that is the cost function
that the algorithm tries to minimize. Once the CART algorithm has successfully
split the training set in two, it splits the subsets using the same logic, then the
sub-subsets, and so on, recursively. It stops recursing once it reaches the maximum
depth (defined by the max_depth hyperparameter), or if it cannot find a split that
will reduce impurity. A few other hyperparameters control additional stopping
conditions (min_samples_split, min_samples_leaf, min_weight_fraction_leaf, and
max_leaf_nodes). Additional details: https://www.youtube.com/watch?v=7VeUPuFGJHk
As you can see, the CART algorithm is a greedy algorithm: it greedily searches for
an optimum split at the top level, then repeats the process at each subsequent
level. It does not check whether or not the split will lead to the lowest possible
impurity several levels down. A greedy algorithm often produces a solution that’s
reasonably good but not guaranteed to be optimal. Unfortunately, finding the
optimal tree is known to be an NP-Complete problem: it requires O(exp(m)) time,
making the problem intractable even for small training sets. This is why we must
settle for a “reasonably good” solution. You can also use entropy, not gini
impurity to split the tree, albeit they lead to similar trees. Gini impurity is
slightly faster to compute, so it is a good default. However, when they differ,
Gini impurity tends to isolate the most frequent class in its own branch of the
tree, while entropy tends to produce slightly more balanced trees (tree balance is
important for inference time complexity - O(log2n) vs. O(n) in rare cases when you
want to predict fast you might look into the entropy based node spliting).
Error: purest subset calculation - loss function should evaluate a split based on
the proportion of data points belonging to each class before and after the split.
Sum of squared error: https://stats.stackexchange.com/a/512343/162267
Predicting: just traverse the tree with your data point you want to classify.

Computational Complexity
Inference: O(log2m), where m is the nodes count. So very scalable as decision trees
are general quite well balanced.
Training: algorithm compares all features (or less if max_features is set) on all
samples at each node, so O(n x m log2m). For small training sets (less than a few
thousand instances), Scikit-Learn can speed up training by presorting the data (set
presort=True). Doing that slows down training for larger training sets.
Introduction to Machine Learning
Decision Trees
Tunning - increasing min_* hyperparameters or reducing max_* hyperparameters will
regularize the model.
max_depth - easy to understand, does not allow splitting indefinitely -
don’t want overfit - restrict the tree, want more power - allow the tree to grow.
min_samples_split - minimum number of samples a node must have before it can
be split.
min_samples_leaf - min number of samples leaf node must have.
min_weight_fraction_leaf - same, expressed as fraction of the total number of
weighted instances.
max_leaf_nodes - maximum number of leaf nodes,
max_features - maximum number of features that are evaluated for
splitting at each node.
Remember: decision trees are highly stochastic - random - so if the random seed
parameter is not specified you will get different trees. It is also stochastic
w.r.t. to hyperparamters.

Regression
How does it work? Predicts value per tree node. Predicted value for each region is
always the average target value of the instances in that region. The CART algorithm
works mostly the same way as earlier, except that instead of trying to split the
training set in a way that minimizes impurity, it now tries to split the training
set in a way that minimizes the MSE. Also see: https://www.youtube.com/watch?
v=g9c66TUylZ4
Demo: California housing dataset.

Limitations:
Sensitive to rotations in the data since the decision boundaries are orthogonal.
Advisable to use PCA when a simple linearly separable problem does not separate due
to rotation.
Sensitive to small variations in the training data - remove single datapoint that
was responsible for a single split and you will have a different tree.
Greedy algorithm limitations.

Introduction to Machine Learning


Decision Trees
Advanced usecases:

Decision trees and causal inference:


https://www.youtube.com/watch?v=IEj8uzIG7C8&t=1s
Uplift modeling: https://www.aboutwayfair.com/tech-innovation/modeling-uplift-
directly-uplift-decision-tree-with-kl-divergence-and-euclidean-distance-as-
splitting-criteria
Uplift modeling: https://link.springer.com/content/pdf/10.1007/s10115-011-0434-
0.pdf
Uplift modeling: https://proceedings.mlr.press/v67/gutierrez17a/gutierrez17a.pdf
… https://www.diva-portal.org/smash/get/diva2:1328437/FULLTEXT01.pdf

Exporting the learned decision tree as efficient code (learned rule extraction):
https://mljar.com/blog/extract-rules-decision-tree/
https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-
from-scikit-learn-decision-tree
https://github.com/papkov/DecisionTreeToCpp
Introduction to Machine Learning
K-Nearest Neighbors
K-nearest neighbors (k-NN or KNN) is a non-parametric (again, the number of
parameters learned (not hyperparameters) grows with the training data and is not
set or fixed in place), supervised, ML algorithm first developed 1951 and later
expanded on.
KNN belongs to a subcategory of nonparametric models - instance-based. Models based
on instance-based learning are characterized by memorizing the training dataset,
and lazy learning is a special case of instance-based learning that is associated
with zero cost during the learning process.
Analogy: can be remembered by the proverb “tell me who your friends are and I will
tell you who you are”.
Can be used for classification and regression tasks and in scikit it is inherently
multiclass and multilabel.
The output depends on whether k-NN is used for classification or regression:
In k-NN classification output is class membership by plurality vote of its
neighbors, with object being assigned to the class most common among its k nearest
neighbors. If k = 1, then the object is simply assigned to the class of that single
nearest neighbor.
In k-NN regression - the average of the values of knn.
Once again: learning, error and inference:
learning: there is no explicit training / learning step or process.
error: no error function, just distance metric between neighbors and k - how many
get a vote.
inference: find k-nearest neighbors of the sample that we want to classify, by
distance. Assign the class label by majority vote or value by averaging in
regression tasks.
See: https://www.youtube.com/watch?v=HVXime0nQeI
Demo: simple KNN model with scikit.
Introduction to Machine Learning
K-Nearest Neighbors
Advantages:
No training step or process.
Simple to understand and tune
Easily implemented from scratch: https://www.youtube.com/watch?v=ngLyX54e1LU
Main advantage of memory-based/instance-based approach - immediately adapts as we
collect new training data.
No inertia in the training process because there is no training process (no
partial_fit(), warm_start, etc, considerations).
Disadvantages:
If features represent different physical units / come in different scales
normalizing training data can improve its accuracy dramatically.
Computational complexity for classifying new samples grows linearly with the number
of samples (assuming constant feature count, in general O(nm)). But the algorithm
can be implemented using efficient data structures such as KD-trees.
Space complexity: can't discard training samples since no training step - storage
space can become a challenge for large datasets. Other models that have internal
state that represent the data they have been trained on have a compressed version
of knowledge about that data. KNN does not have an internal state so the full
dataset needs to be always available. Other models only require a new X sample to
classify X, but KNN requires the entire dataset on each prediction.
Sensitive to useless features. If you are using a KNN based pipeline, it would be
useful to implement feature importance check step or at least combinatorially try
to pass different feature sets to KNN and see what works (you can use decision
trees, random forests for the purpose if not specific techniques - you will only
need to use them 1 time to deptermine which features are most important, after that
you would only need the KNN model).
Suggestion for mini-research: all of these statement can be experimentally
verified.
Suggestion for mini-research: feature-importance cross reference. Can you prove or
disprove that all models will see the same features as important or not?
Introduction to Machine Learning
K-Nearest Neighbors
Tunning
KNN is sensitive to the local structure of the data. Useful technique: assign
weights to the contributions of the neighbors, so that the nearer neighbors
contribute more to the average than the more distant ones. For example, a common
weighting scheme consists in giving each neighbor a weight of 1/d (d is distance to
the neighbor). Implemented in scikit with weights : {'uniform', 'distance',
callable’}. And If the neighbors have similar distances, the algorithm will choose
the class label that comes first in the training dataset.
How to choose K? Start with small number, then try and see. The range is usually
between 3 and 7, but can grow into 10’s or more.
How to choose distance metric? Euclidian - common choice, Minkowskian - same as
Euclidian when p=2, Manhatan when p=1. There are 2 well known distance metrics,
cosine and Hamming in addition to the previous ones (and some less known ones, like
Mahalanobis). See: https://www.analyticsvidhya.com/blog/2020/02/4-types-of-
distance-metrics-in-machine-learning/
How to choose the algorithm?
Brute Force may be the most accurate method due to the consideration of all data
points.
For small data sets, Brute Force is justifiable, however, for increasing data the
KD / Ball Tree is better alternatives due to their speed and efficiency, see more:
https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4 a nice explanation of kd-
tree: https://www.youtube.com/watch?v=Glp7THUpGow .
Note that you can use “auto” that will attempt to select the appropriate parameter
- this is described here:
https://scikit-learn.org/stable/modules/neighbors.html#choice-of-nearest-neighbors-
algorithm . This is a decision based on dimensions, not on dataset size (D < 15).

Introduction to Machine Learning


K-Nearest Neighbors
Some more advanced topics:
Neighborhood Components Analysis (NCA) this is a distance metric learning algorithm
that can improve KNN by learning the a lower dimensional representation of the data
we use for classification (it will become clearer what it does after we learn PCA).
See: https://scikit-learn.org/stable/modules/neighbors.html#neighborhood-
components-analysis
Demo: let’s see if we can improve classification with NCA (a simple demo is in the
scikit docs).
MLkNN - if you have a multilabel problem, you can use scikit … or you can use an
optimized version from scikit-multilearn:
http://scikit.ml/api/skmultilearn.adapt.mlknn.html . If you want to know a bit more
about the differences between the two, you can take a look here:
https://stackoverflow.com/questions/57901145/multilabel-classification-ml-knn-vs-
knn
Introduction to Machine Learning
GridSearch
GridSearch applies to all ML models that can exhibit performance differences based
on a how well a set of hyperparameters are chosen during their training process. It
is a hyperparameter tuning technique and is often used as a part of the AutoML
pipeline. AutoML refers to the use of machine learning algorithms to automate
various stages of the machine learning process, including data preparation, feature
engineering, model selection, and hyperparameter tuning.
Scikit offers GridSearchCV class that accpets a model and a dictionary of parameter
names and collections of parameters to try, for example { n_estimators: [10, 100,
1000] } will tell scikit to try and perform cross validation training the model
with the n_estimators set to 10, 100 and 1000 respectively. There are a few things
to know about GridSearchCV:
it evaluates all possible combinations of hyperparameters in the dictionary, so
{ 'n_estimators': [3, 10], 'max_features': [2, 4, 8] } = 6 and if the cross
validation size is 5 then it will be 5 * 6 = 30. Be mindfull of this for obvious
performance reasons.
it is/can be done in parallel with n_jobs
you can use ranges, like: np.arange(1, 20) to generate parameters to try
if you specify the parameter grid like a list of dictionaries, these are treated as
a separate experiment set or group,
if you have a pipeline you can specify parameters for each estimator by prepending
it’s name: ('knn', KNeighborsC...), → {'knn__n_neighbors': [3, 5, 9, 13]}
gs.best_params_ - returns which params were best and gs.best_estimator_ returns the
constructor signature of best estimator with parameters.
if you get values at the edge (min/max) you need to try with bigger / smaller
params as it is possible that the optimal parameters is bigger / lower. In general
you should notice after some exposure to ML/DL models that a change in a
hyperparameter will lead to gradual improvements and then gradual decline in
performance, it is often not sudden. Hence it might be useful to explore the
hyperparameter values around the last values in the GridSearch if they were.
it can be used to optimize data-transformers, like scallers , polynomial feature
and other generators and so on.
NOTE: GridSearchCV finds not only the best hyperparameters for the model, but also
for the pipeline steps like data-transformers. So it not strictly correct to say
that this technique is only useful for hyperparameter search.
Demo: finding the right parameters with GridSearch:
https://machinelearningknowledge.ai/knn-classifier-in-sklearn-using-gridsearchcv-
with-example/
In addition to GridSearch you should take a look at RandomizedSearchCV and
HalvingRandomSearchCV

Introduction to Machine Learning


Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Introduction to Machine Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1ohujjVgHtgl2CgZFptjHEl0Rb87pisOZ.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Linear Algebra with Numpy
01
02
Linear Algebra Roadmap
Python Crash Course
00
Linear Algebra Intro
03
Linear Algebra in Computer Graphics
04
Linear Algebra in ML / DL
05
06
07
Linear Algebra in SE / CS
//
//
Linear Algebra Intro
Linear algebra touches on many fields in modern science and engineering and is used
in may disciplines and human endevours.
There is often a misconception about Linear Algebra as being about matrices,
vectors or eigenvectors and calculations on them, however the broader picture of
how these can be used is much more interesting and important.
Definition: L.A. is a branch of mathematics concerning linear equations and linear
maps, their representations and linear transformations of them. Ref:
https://en.wikipedia.org/wiki/Linear_algebra
Linear algebra can be understood from two perspectives: algebraic and geometric.
Ref: https://www.youtube.com/watch?v=kjBOesZCoqc
Ref: https://www.youtube.com/watch?v=ZKUqtErZCiU
… we also can remember the difference between conceptual math and
numeric/computational math.

Recommended courses:
https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
https://www.youtube.com/playlist?list=PLHXZ9OQGMqxfUl0tcqPNTJsb7R6BqSLo6
https://www.youtube.com/playlist?list=PLybg94GvOJ9En46TNCXL2n6SiqRc_iMB8
Python Crash Course
Linear Algebra Intro
Usage in general. In his classical book on the topic titled “Introduction to Linear
Algebra“, Gilbert Strang provides a chapter dedicated to the applications of linear
algebra. In it, he demonstrates specific mathematical tools rooted in linear
algebra. Briefly they are:
Matrices in Mechnical Engineering, such as a line of springs.
Graphs and Networks - analyzing networks with adjacency matrices (computer science)
Markov Matrices, Population, and Economics, such as population growth.
Linear Programming, the simplex optimization method.
Fourier Series: Linear Algebra for functions, used widely in signal processing.
Linear Algebra for statistics and probability, such as least squares (OLS) for
regression.
Computer Graphics, such as the various translations, rescalings and rotations of
images.
Physics: Einsteins Relativity with Tensors.
Electronics and circuits.
Civil Engineer (?), Epidemiology (?)
Usage in datascience. Ref: https://www.youtube.com/watch?v=X0HXnHKPXSo
Code vectorization
Image recognition / image filters (edge detection, bluring)
Dimensionality reduction
Graph theory
Linear algebra is at the heart of google “50billion-dollars-just-sitting-not-
invested-in-their-bank-account” business: https://www.youtube.com/watch?
v=qxEkY8OScYY
Python Crash Course
Linear Algebra with Numpy
This is all great of course, but how can numpy help? Well numpy is very efficient
when performing all the operations required in these applications, so if you work
with this numpy can be used. But what if you do not have a job an where to apply
these skills? Well numpy can be used efficiently with any tutorial online to verify
their results, extend the examples shown and learn linear algebra in a more
entertaining way - not just writing down matrices on pen and paper.
Python Crash Course
Linear Algebra with Numpy
Numpy provides powerful methods for manipulating multidimensional arrays which is
exactly what is needed for L.A.
LA applications with numpy:
https://colab.research.google.com/drive/1k9OVrpvNnxt9JDQBdfZzDlk12QohbAQM?
usp=sharing
Specific functions from LA package:
https://github.com/derekbanas/NumPy-Tutorial/blob/master/NumPy%20Tut.ipynb
We will implement most of the practical demonstrations of matrices with numpy from
this video: https://www.youtube.com/watch?v=rowWM-MijXU

Additional resources:
https://www.youtube.com/watch?v=kZwSqZuBMGg
https://www.youtube.com/watch?v=tVQZvJwi-ec
Python Crash Course
Linear Algebra Roadmap (TBD)
Research on all the practical usages of LA. How is it used in Google’s Page Rank
algorithm? How is it used in computer networking? Biology? Computer graphics? Prime
/ trick your brain to understand that this is important. You don’t have to learn
any subject the way you learned it at school.
Matrices as transformations
Dot products as repeated transformations
Matrix multiplication
Determinants …
Eigenvectors and Eigenvalues and how they are used in dimensionality reduction.
Eigenvector of a matrix is a vector that is only scaled by the transformation
represented by that matrix.
Eigenvalue the scaling factor by which the transformation scales the eigenvector (2
would be 2x).

Python Crash Course
Linear Algebra in Computer Graphics
Ref: https://www.youtube.com/watch?v=vQ60rFwh2ig
DEMO: A few simple object rotations implemented using Pygame.
Additional research: recreate the same demo with FPS display - compare FPS when
numpy is used and when it is not used. When single matrix multiplication is used
and when it is not used.
Python Crash Course
Linear Algebra in ML / DL
PCA / Dimensionality reduction
Synaptic weight calculations / Forward propagation algorithms.
Backprop
TBD
Python Crash Course
Linear Algebra in SE / CS
Googles Pagerank Algorithm
Python Crash Course
Summary
Linear Algebra basis is vectors [1, 2, … , n]. Physics, maths and CS definitions.
Vectorization of features is the basis of ML/DL (most of ML and ~all DL models
accept numbers).
Vector addition: physical intuition - two forces acting on an object together: [a1
+ b1 , a2 + b2]
Matrices - transformations (of space of vectors).
[ 1 5 ] [5]
[ 5 6 ] [3]
Matrix multiplication - is a combination of transformations. Translation @ rotation
→ M
[ 7 5 ] | [ 1 5 ] [5]
[ 5 6 ] | [ 5 6 ] [3]
Dot product interpretation - similarity metric of two vectors. Input two vectors
and you get a scalar indicating how closely two vectors align, in terms of the
directions they point.
Dot product is not the same as matrix multiplication, but there is a special case:
https://math.stackexchange.com/a/4189120/438982
Determinant: when you apply a matrix to a space by how much is the area of the
space scaled.
Eigenvector is the vector that is only scaled by a matrix (application: PCA).
Prerequisites: Determinant, Matrices as transformations, Vectors, … .
Eigenvalue is the scaling factor of the the matrix.
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 11rPGxYZsSW9pbwTaNeUkPW7mjdT4I7bH.pptx ---


Artificial Intelligence
Generative Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Process
01
02
Implementation
Generative Deep Learning
00
Introduction
Further explorations
03
DeepDream is an artistic image modification technique that uses the representations
learned by convolutional neural networks.
It was first released by Google in the summer of 2015, as an implementation written
using the Caffe deep-learning library (this was several months before the first
public release of TensorFlow).
It quickly became an internet sensation thanks to the trippy pictures it could
generate , full of algorithmic pareidolia artifacts, bird feathers, and dog eyes—a
byproduct of the fact that the DeepDream convnet was trained on ImageNet, where dog
breeds and bird species are vastly overrepresented. More:
https://en.wikipedia.org/wiki/DeepDream
Analogy: DeepDream is sometimes compared to when a child watches clouds and tries
to interpret random shapes, DeepDream over-interprets and enhances the patterns it
sees in an image.
This process was dubbed "Inceptionism" (a reference to InceptionNet, and the movie
Inception).
Original google blog: https://ai.googleblog.com/2015/07/deepdream-code-example-for-
visualizing.html
Introduction
Generative Deep Learning
The DeepDream algorithm is almost identical to the convnet filter-visualization
technique, consisting of running a convnet in reverse: doing gradient ascent on the
input to the convnet in order to maximize the activation of a specific filter in an
upper layer of the convnet.
DeepDream uses this same idea, with a few simple differences:
maximize the activation of entire layers rather than that of a specific filter,
thus mixing together visualizations of large numbers of features at once.
start not from blank, slightly noisy input, but rather from an existing image—thus
the resulting effects latch on to preexisting visual patterns, distorting elements
of the image in an artistic fashion.
input images are processed at different scales (called octaves), which improves the
quality of the visualizations. They have kind of fractal-ish pattern to them where
the pattern reapears at different levels (more on fractals:
http://jasser.nl/about/fractal-geometry/ - fractals are structures that have self-
similarity at different scales, an example of recursive generation in the physical
world)
Introduction
Generative Deep Learning
In short:
forwarding an image through the network
calculating the gradient of the image with respect to the activations of a
particular layers of CNN.
image is then modified to increase these activations, enhancing the patterns seen
by the network, and resulting in a dream-like image.
this approach would work, but an improvement that was discovered is that we can
perform the aforementioned image modification at different scales called octaves.
This will allow patterns generated at smaller scales to be incorporated into
patterns at higher scales and filled in with additional detail (detail
reinjection).
Process
Generative Deep Learning
Step by step:
Load the original image - anything you want.
Define a number of processing scales and other parameters - these are the "octaves"
we mentioned before, from smallest to largest.
Define layer settings - layer_name : it’s importance (from the pretrained NN -
you need to know the architecture well to do it).
Import model - inception_v3.InceptionV3(weights="imagenet",
include_top=False)
Create the output dict: - (layer.name, layer.output) for layer in
[model.get_layer(name) for name in layer_settings.keys()]
Define Deep Dream loss
Set up the gradient ascent loop for one octave
Compute the octaves/shapes
Resize the original image to the smallest scale
Run the training loop, iterating over different octaves/shapes
For every scale, starting with the smallest (i.e. current one):
Run gradient ascent (we are running several hundred iterations for each shape)
Upscale image to the next scale
Calculate lost detail: lost_detail = same_size_original -
upscaled_shrunk_original_img
Reinject the detail that was lost at upscaling time: img += lost_detail
Stop when we are back to the original size.
Loss has to increase during training, see image on the side. Why does this work?
The only way the loss will increase is by systemic distortion of the image
Because it is systemic, it not just noise - it is something that we want to
achieve.
There is no loss that will tell you that image is distorted in a desirable way, you
will need to evaluate.
Process
Generative Deep Learning
Question for discussion - could we create a loss function (or just a validation
function to validate the intermediate deep dreams) that would detect random noise
vs. desired deep dream artifacts.
In the field of signal processing there are functions that can detect noise (and
it’s type: white, brown, etc.). Hypothesis: if we were to combine the noise
detection function and the loss function we should be able to detect that the
dreaming is going to the right direction - if the loss increases, but the amount of
random noise does not!
In general the noise should increase even if the deep dream is very pleasant.
Although there might be some upperbounds.
Process
Generative Deep Learning
Any pre-trained network (from transfer learning) would work for this.
You can use all available in keras: https://keras.io/api/applications/
VGG16, VGG19, Xception, ResNet50, and so on, all available with weights pretrained
on ImageNet. You can implement DeepDream with any of them, but your base model of
choice will naturally affect your visualizations, because different architectures
result in different learned features.
The convnet used in the original DeepDream release was an Inception model, and in
practice, Inception is known to produce nice-looking DeepDreams, so we’ll use the
Inception V3 model that comes with Keras.
The InceptionV3 architecture is quite large. For DeepDream, the layers of interest
are those where the convolutions are concatenated. There are 11 of these layers in
InceptionV3, named 'mixed0' though 'mixed10'. Using different layers will result in
different dream-like images.
Deeper layers respond to higher-level features (such as eyes and faces), while
earlier layers respond to simpler features (such as edges, shapes, and textures) -
they also train slower.
Implementation
Generative Deep Learning
Lower layers result in geometric patterns, whereas higher layers result in visuals
in which you can recognize some classes from ImageNet (for example, birds or dogs).
We will define which layers we will use (we have to look up the InceptionV3 model
layer names, and we even can become experts in that model itself studying it’s
architecture) and we define their contributions!
Implementation
Generative Deep Learning
List of more original concepts for this topic:
Octave / rescaling mechanism
Border effect avoidance (tiling and other means)
Durring detail reinjection the dreamt up features are not destroyed
Implementation
Generative Deep Learning
Like with deep dreaming process, we will not train the network just update the
generated image
Additional counter intuitive idea is gradient ascent (not descent) - we maximize
the loss, see:
https://www.tensorflow.org/tutorials/generative/deepdream#calculate_loss … in
short: once we have calculated the loss for the chosen layers, all that is left is
to calculate the gradients with respect to the image, and add them to the original
image.
Adding the gradients to the image enhances the patterns seen by the network. At
each step, you will have created an image that increasingly excites the activations
of certain layers in the network.
Implementation
Generative Deep Learning
Our current model produces “burned image coloring” although when trained with many
epochs the patterns are quite interesting just discoloration does not allow for the
patterns to be meaningful. Need to TSHOOT why that is. Possible scenarios: big
gradients that make the pixels go to 0 (white); noise introduced for some reason.
There are many implementations on the internet: blogs, articles, videos. Find if
those implementations use the same process. Compare and analyze.
Implement TF version: https://www.tensorflow.org/tutorials/generative/deepdream -
create deep dreams from your own photos (you city, hour house, your pets, etc.)
The TF version mentioned above implements a more advanced tile-based approach. You
can try that as part of the further explorations option - what are the benefits of
it.
How is detail reinjection performed, create an MWE.
Implement gradient ascent.
TSHOOT localized burned pixels.
Further explorations
Generative Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Generative Deep Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1geg1eeR4L0ECnxOHDTXlqWb01eivBo5B.pptx ---


Artificial Intelligence
AutoML
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
AutoML tools
01
02
Auto-Sklearn
AutoML
00
What is AutoML
Google Automl Tables
03
AutoML (Automated Machine Learning) is the process of automating the end-to-end
process of applying machine learning to real-world problems. Because we think about
ML (and all of its branches) as automation - this is metaautomation or automation
of automation. AutoML Involves automating tasks such as: data pre-processing,
feature selection/engineering, model selection, hyperparameter tuning, model
evaluation, causal discovery. Great presentation: https://www.youtube.com/watch?
v=SEwxvjfxxmE

CASH - Combined Algorithm Selection and Hyperparameter optimization vs. AutoML


(strong vs. weak automl). Autosklearn more CASH (according to some).

AutoML is designed to help non-experts in machine learning to build high-quality


models without requiring extensive knowledge of the underlying algorithms or
programming with low-code solutions. AutoML has become popular in recent years as a
way to democratize machine learning and make it more accessible to a wider audience
(so that a business person would be able to just feed his data and get some
results). It has been used in a variety of applications, including image
classification, natural language processing, and time series forecasting.

AutoML tools typically work by searching for the best combination of data pre-
processing techniques, feature selection methods, algorithms, and hyperparameters
for a given machine learning problem. The search can be performed using techniques
such as grid search, random search, Bayesian optimization, and genetic algorithms.
The objective is to find a model that performs well on a validation dataset while
minimizing the risk of overfitting.

Areas of research:
NAS - Automated Neural Architecture Search (notably Auto-Pytorch, AutoKeras).
HPO - Automated Hyperparameter Optimization
Meta Learning - learning to learn, the science of systematically observing how
different machine learning approaches perform on a wide range of learning tasks,
and then learning from this experience, or meta-data, to learn new tasks much
faster than otherwise possible.
Multi-Objective DL - multiple objectives like accuracy and performance to be
determined automatically.
What is AutoML
AutoML
//
AutoML tools
AutoML

//
AutoML tools
AutoML

Ref: https://www.automl.org/automl/auto-sklearn/
Ref: https://arxiv.org/abs/2007.04074#
Ref: https://neptune.ai/blog/a-quickstart-guide-to-auto-sklearn-automl-for-machine-
learning-practitioners
Classifiers:
https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/
components/classification
Regressors:
https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/
components/regression
Data preprocessors:
https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/
components/data_preprocessing
Feature preprocessors:
https://github.com/automl/auto-sklearn/tree/master/autosklearn/pipeline/
components/feature_preprocessing
When learning you can try and find better model-hyperparameter combinations than
AutoML library. This is not that easy. How much restriction on time you have impose
to AutoML metamodel to still beat it easily?
Auto-Sklearn
AutoML
import autosklearn.classification
import sklearn.datasets
import sklearn.metrics

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y,
random_state=1)

model =
autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=60)
model.fit(X_train, y_train)

print("Accuracy: ", sklearn.metrics.accuracy_score(y_test, model.predict(X_test)))


AutoSklearnClassifier() parameters for limiting the search space.
Auto-Sklearn
AutoML
Short intro: https://www.youtube.com/watch?v=tWbiOuHae0c
Short demo: https://www.youtube.com/watch?v=7Z7-L1nIoyI
Deprecated in 2022 due to low adoption rate.
Vertex AI for Tabular is the replacement:
https://cloud.google.com/vertex-ai/docs/beginner/beginners-guide#tabular
End-2-end demo (with deployment and prediction with curl):
https://www.youtube.com/watch?v=ZZadMQTKJXk
Google Automl Tables
AutoML
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

AutoML
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1CyLXQpK42HVFF7FmMzGqWie0p7rWLtXS.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Numpy basics
01
02
Exercises and Project Work
Python Crash Course
00
Numpy definition
Numpy definition
NumPy is a scientific computing library, containing many mathematical, array and
string functions that are very useful and much faster compared to Python.
Along with all the basic math functions you'll also find them for Linear Algebra,
Statistics, Random Number Generation (distribusions), Fourier Transform, etc.
It is centered around single and multi dimensional array objects.
Used by numerous other Python Data Science libraries sitting on top of Numpy
(Scipy, Pandas). Some even consider numpy to be foundational knowledge for DS in
Python ecosystem and consider numpy to be THE library that pushed Python into the
prominence of the scientific and data communities (it has a lot to do with
performance since both these fields use big data).
Ref: https://numpy.org/ and https://github.com/numpy/numpy
Let’s see basic operations before diving into performance and advanced cases.
Python Crash Course
Numpy basics
Installing numpy is simple, just do it with pip (available on collab already)
Importing numpy uses the convention import numpy as np (don’t be evil: import
tensorflow as np :D )
Version np.__version__: 1.19.5
Numpy at it’s core provides a fast and powerful multidimensional array object
allong with some powerful functions around it! It has a shape - see images above.
When to use numpy: in general when we need to process large arrays fast (this
usecase includes even webapps, scripts, etc.) or when need some functionality that
numpy provides like: linear algebra, stats functions, random distributions, etc.
(this is mainly for data science projects).
Creating arrays:
Python Crash Course
Numpy basics
Creating arrays continued.
Python Crash Course
Numpy basics
Printing arrays: notice that that the amount of brackets on the sides indicate the
dimensions: [[[ -- 3 brackes - 3 dimensions
Additionally summarization / truncation can be switched off using print options,
see: https://stackoverflow.com/questions/2891790/how-to-pretty-print-a-numpy-array-
without-scientific-notation-and-with-given-pre
Python Crash Course
Numpy basics
Array operations: element wise addition, division. Element wise multiplication and
dot product.
See dot product visualization: https://gfycat.com/ajarselfassuredgoldfish-matrix-
multiplication-literature-subject
Python Crash Course
Numpy basics
Additional operations
Python Crash Course
Numpy basics
Universal functions: functions for performing mathematical operations - function
that operates on ndarrays in an element-by-element fashion, supporting array
broadcasting and other standard features, list:
https://numpy.org/doc/stable/reference/ufuncs.html#available-ufuncs
Python Crash Course
Numpy basics
//
Python Crash Course
Numpy basics
Indexing and slicing - access elements efficiently.
Common way to reverse the array by omitting the start, end and then stepping back:
a[::-1]
<U7 - little endian unicode dtype:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
Python Crash Course
Indexing and slicing continued.
Numpy basics
Python Crash Course
Numpy basics
Python Crash Course
Iterating over the array.
The .flatten() function performs row based flattening (row by row) by default.
Numpy basics
Python Crash Course
Iterating over the array continued.
nditer() is different than .flatten() - .nditer() is just for iteration,
while .flatten() returns a flattened array.
Array will be writable if special flags are passed
Numpy basics
Python Crash Course
Reshaping the array.
Ravel is similar to flatten, but does not create a copy when it is not needed.
Specifying -1 when reshaping means that you do not know how big will that dimension
be, it’s called: “one unknown dimension”
Numpy basics
Python Crash Course
Splitting the array
np.split(x, [4, 7]) → split the array at index 4 and at index 7
Numpy basics
Python Crash Course
Image manipulation
Images are N-dimensional arrays: RGB (255, 0, 0), 3-chanell images, 3D array.
Grayscale (0.0 - 1.0) - 1 chanell, 2D array.
Interms of np.shape (X, Y, 1) - grayscale, (X, Y, 3) - RGB.
Numpy basics
Python Crash Course
Array views - shallow copy. Will point to the same underlying data so any
modifications on the copy will be reflected.
We can change the data in the original though the view.
Reshaping a view does not reshape the original. We can have multiple shapes.
Calling reshape() either way returns a new view, assigning .shape property
Numpy basics
Python Crash Course
Deep copies. Not efficient to make them, but when you need them you need them.
Exercises and Project Work
Python Crash Course
Please complete:
This quiz: https://www.w3schools.com/python/numpy/numpy_quiz.asp
These exercises: https://www.w3schools.com/python/numpy/numpy_exercises.asp (can be
done part by part)
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
https://docs.google.com/spreadsheets/d/1DmSKXClV4xOkmz-
GjKq6ew4EF0cXed6uda5PZYhtEGU/edit?usp=sharing
Additional information

--- Content from 1Pdgi5ty31dp63H4oWvk79pQNHTeVZh2E.pptx ---


Artificial Intelligence
Scikit Tips and Advanced Topics
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Advanced Topics
01
02
//
Scikit Tips and Advanced Topics
00
Tips
//
03
04
//
05
06
07
//
//
08
//
//
09
//
Know your Data (toy, real and generated):
https://scikit-learn.org/stable/datasets.html
Know your GridSearch and friends (AutoML):
https://scikit-learn.org/stable/modules/classes.html#hyper-parameter-optimizers
Know your Estimator API, know which ones is most appropriate for situations:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Know your DummyModels (strategy == uniform) & random guesser creation:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy
Know your Scallers - to scale the data automatically:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
Know your Pipeline - simple decorator to wrap data transformations and model
training steps into one single variable
Know your Incremental (“datreniruojami modeliai”) learning (partial fit):
https://stackoverflow.com/questions/49841324/what-does-calling-fit-multiple-times-
on-the-same-model-do
Know your Model Persistence:
https://scikit-learn.org/stable/modules/model_persistence.html
Tips
Scikit Tips and Advanced Topics
Estimators:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Tips
Scikit Tips and Advanced Topics
Unsupervised Neural Networks? RBM for feature extraction
This can be considered an advanced topic in ML or DL, but the nice thing is we can
implement examples with scikit and learn about it with this simple library then
progress to more advanced implementations (like reimplementing it with Keras).
Explanation and terminology:
https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html
Tutorial with scikit:
https://scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_cla
ssification.html

GPU support:
Experimental array API: https://scikit-learn.org/stable/modules/array_api.html
Hummingbird: https://github.com/microsoft/hummingbird and small tutorial:
https://www.youtube.com/watch?v=GbC1BujV-J4
Alternatives like the praised NVIDIA RAPIDS cuML lib:
https://docs.rapids.ai/api/cuml/stable/estimator_intro.html#Linear-regression-and-
R^2-score … they also present themselves as a window to distributed GPU based ML
and analytics: https://developer.nvidia.com/blog/scikit-learn-tutorial-beginners-
guide-to-gpu-accelerated-ml-pipelines/
Advanced Topics
Scikit Tips and Advanced Topics
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Scikit Tips and Advanced Topics


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 10bVPeanz7d_9AZYyH81sK1qNWDHwTrsJ.pptx ---


Artificial Inteligence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Organizing larger programs
Distributing packages
01
02
03
Package plugins (optional)
Python Crash Course
00
Modularity
06
Git for version control
07
Module management with pip and venv
08
Version management with pyenv
09
Project management with pipenv (optional)
04
Standard library modules
10
Project management with poetry (optional)
05
Logging and config files
11
Distributable applications (optional)
Modularity
Creating larger programs with Python requires understanding of how modules work.
A broad discussion about modularity is a prerequisite for a discussion of project
architecture and organization.

Modular programming refers to the process of breaking a large, unwieldy programming


task into separate, smaller, more manageable subtasks or modules. Individual
modules can then be cobbled together like building blocks to create a larger
application. However the benefit lies in the ability to generate many projects from
those modules - reusability. +Understandablity. +Replaceability.

Functions, modules (python) and packages (python) are all constructs in Python that
promote code modularization (and reusability/dry-ness):
The most fine grained python modularity level in python is a function. As a peace
of reusable code it replaces repeating logic in our python script by extracting
that logic, putting it into one place with a name and then giving the ability to
reuse it.
Functions can be grouped into classes (optional).
A group of functions (and variables, constants, classes) defined in a file with
a .py extensions are called a module. Modules allow us to separate project
functionality into parts that are responsible for one part of the application.
(“Panašus prie panašaus”)
A group of modules is called a package and is usually associated with a directory
containing >= 1 modules.
Libraries and products/projects are the highest level we care about (threads,
processes, microservices, applications … dynamic modularity - modularity during the
execution of said application. We are talking about modularity in terms of code
(static)).

Since we already covered functions quite extensively, we will jump into modules
next. The important thing is that you should think about the new mechanisms we are
about to learn as extensions of previous mechanisms - generalization.
Python Crash Course
Modularity
We can sa that there are 3 types of modules in python:
written in Python itself (files that we create with .py)
written in C and loaded dynamically at run-time (re, numpy)
built-in module in the interpreter, (itertools module, time, sys, os, time)
A module can be imported and used by calling the functions inside - as a library
using the import statement. Or it can be run.
Import does not make all the names automatically available to the calling modules
symbol table - still need to call them explicitly. It only adds the imported module
name to the calling module’s symbol table. You can check the symbol table with
empty call to dir()
Module code is executed on import, but only once - on first import. All top level
statements are run (def is also a statement so it is run, remember why we can’t use
mutable data as default parameters? See: https://florimond.dev/en/posts/20... ).
To be able to recognize if the module needs to be imported or executed we use the
__name__ check. When module is imported the name check will allow not to execute
code inside the module, just make the functions / classes available . But when we
launch it directly i.e.: python blah.py then it will executed. It is recommended to
make all your python modules (so all .py files) importable, since testing is easier
that way.
Ref: https://www.w3schools.com/python/python_modules.asp ;
https://docs.python.org/3/tutorial/modules.html

Demo:
Creating a module
Importing via repl
Recognizing if we want to call or import the module with the dunder __name__
Importing into the app (another file)
Aliases (from functions import median as med, average as avg)
Go to definition (win: ctrl + click or ctrl + b) and back (win: ctrl + alt + ←)
dir(), locals(), globals()
Python Crash Course
Modularity
What happens when we import modules? The interpreter searchers for <module_name>.py
file in a list of directories:
The directory from which the input script was run or the current directory if the
interpreter is being run interactively (REPL)
The list of directories contained in the PYTHONPATH environment variable, if it is
set. (The format for PYTHONPATH is OS-dependent but should mimic the PATH
environment variable.)
An installation-dependent list of directories configured at the time Python is
installed (sys.path)
This implies that if you have a module import error you need to inspect the sys
path and probably append it.

You can always check where the modules resides: using the module_name.__file__
dunder field.
You can use various ways to import: from <module_name> import <name> as
<alt_name>[, <name> as <alt_name> …] see: https://realpython.com/python-modules-
packages/
You can also import inside a function / confitionally - then the module will be
imported only when the function is called. It is NOT reimported every time you call
the function. This can sometimes be beneficial for small performance gains, when
the module is used rarely but usually is not needed and most style guides recommend
importing modules at the top.
Python Crash Course
Modularity
The difference between script, module and application?
Python is sometimes called a scripting language, but that is a contextual term. You
can write small scripts that automate tasks with python. However python is much
more than batch or bash scripts are in windows or linux, so it should not be
considered as “only a scripting language”. The presence of complex libraries,
frameworks and entire ecosystems of tools prove that.
Since it is recommended to make every python file importable, your scripts should
be also - that means script ~~ file.
Python Crash Course
Modularity
Packages are directories that hold one or more related modules - it’s a special
kind of module that can hold other modules.
Packages allow for a hierarchical structuring of the module namespace using dot
notation. In the same way that modules help avoid collisions between global
variable names, packages help avoid collisions between module names.
We have 2 types of packages, ref: https://stackoverflow.com/questions/37139786/is-
init-py-not-required-for-packages-in-python-3-3
Regular packages → contain __init__.py file, you should prefer these. It contains
package initialization code, often empty.
Namespace packages → do not contain __init__.py file, these have a specific
usecase, serve for package nesting
Python Crash Course
Modularity
__init__.py is useful for library developers to hide internal structure of the
library and make import easier (example: requests lib imports get, post and other
methods in the __init__ file). Additionally __all__ = [ … ]
Sibling imports / Subpackages
use ..
example: from ..cacl.functions import print_hi

If you will try to import / launch files directly from non-project-root then you
will need to use os.getcwd(), sys.path.append() and knowledge of file system paths.
Python Crash Course
# src.__init__.py
from src.cacl.functions import *
from src.math.my_math import *

from src.cacl.functions import print_hi


from src.math.my_math import print_hi_with_exclamation

__all__ = [ 'print_hi', 'print_hi_with_exclamation' ]

# main.py
from src import print_hi
from src import print_hi_with_exclamation
Modularity
When writing modular code we need to document it well.
We use docstrings inside the function for documentation.
This allows use to read the documentation imported function using help()
Docstring conventions: https://www.python.org/dev/peps/pep-0257/
Sphinx uses it w/ some additional syntax (used by scrapy, bs4, etc.).
Also noticeable: https://about.readthedocs.com/

To be able to launch a file w/o constantly writing python <filename.py> we can use
shebang in *nix (read: “unix-like”) type systems.
And from Python 3.3 it should also work with windows.
Python Crash Course
Modularity
Python Crash Course
Modularity
Python Crash Course
Modularity
Executable diretory
Putting a __main__.py file at the root of the directory makes it an executable
directory
Python will execute it if launched with python dirname , and when launching the
directory is put at the beginning of the sys.path
Executable zip
We can zip executable directories and distribute them
Python knows how to launch them (conceptually similar to java jar)
Executable package
if you put a __main__.py inside a package python will execute it if launched with
python -m package_name
example (note, this is not an executable package, only a well know example of
calling a python module directly):
$ echo "{"\"a\"":{},"\"b"\":{}}" | python -m json.tool

See more on that: https://stackoverflow.com/questions/4042905/what-is-main-py


Python Crash Course
Organizing larger programs
If you create python programs while not working with a framework you heave to
choose the project structure
If you use a framework (like scrapy, django) you can't easilly dictate the project
structure (RT*M). It is better to work with the framework than fight it.
This is one of the ways to organize your program.
It is common to have /logs directory and/or /config for configuration files.
Example structure (often encounterable):
name/ → not a package, just a folder
README.<rst / md> → readme file, short overview, short selling pitch for the
project, more in the /docs (reStructuredText)
docs/ → documentation, various formats, see:
https://github.com/graphql-python/graphene/tree/master/docs
setup.py / pyproject.toml → setup script to be able to launch the program and
package it (setuptools)
requirements.txt (?) → packaged that need to be installed with pip to run
the project: pip install -r requirements.txt
src/ → all the source code packages and subpackages
name/ → actuall package, can have same name as application:
https://github.com/pallets/flask/tree/main/src/flask
tests/ → test code
unit/
conftest.py
integration/
e2e/ → multiple types of tests
logs/ → logs for actions, errors and exceptions (sometimes logs
are kept in the root diretory), i.e.: geckodriver.log
config/ → configuration files with passwords and so on
(sometimes configs are kept in the root directory)
LICENSE → license file (GPL2, MIT, BSD)
app.py → possible programm entrance point (can be inside
src/)
Examples from OSS: https://github.com/maxcountryman/flask-login/tree/main ,
https://github.com/psf/requests
Python Crash Course
Organizing larger programs
Illustrating sibling package import with tests
Pycham automatically imports root project folder to sys.path, so test/test_x.py
file will resolve all the imports successfully, i.e. from src.x.y import z
If, however, you would launch the tests from /tests directory or from the root
directory - the imports would not work.
To fix that we need to always somehow import the root project directory into the
sys.path.
Approach no. 1: Append ../ to path if tests is directly inside root. This will only
lauch successfully from the tests/ directory.
Approach no. 2: We should’nt hardcode absolute paths into the code, so we need to
dynamically compute the necessary path.
Some discussions:
https://docs.python-guide.org/writing/structure/
https://stackoverflow.com/questions/193161/what-is-the-best-project-structure-for-
a-python-application
https://realpython.com/python-application-layouts/
https://dev.to/codemouse92/dead-simple-python-project-structure-and-imports-38c6
Python Crash Course
import os
print(os.getcwd())
print(os.path.dirname(os.path.realpath(__file__)))

import sys
# importing project root will allow all directories to be searched
# ... even if you try to launch the tests from tests directory: python .\test_x.py
# ... or from the root directory: python .\tests\test_x.py
# ... or from anywhere: python .\tests\abc\def\test_x.py
sys.path.append(r"C:\Users\Mindaugas\Desktop\Projects\CAAI\Proj")
# sys.path.append(r"../")
# sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__),
r'..\')))
# sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__),
"..\\..\\")))#
from src.some.string_returner import return_string
print(return_string())
Package plugins (optional)
This is quite a fuzzy topic in many programming language ecosystems - how do you
develop a system of independent modules that can be added separately to an
application after it already has been developed for weeks, months or years? How do
you need to modify your application to make this happen?
Often when creating a Python application or library you’ll want the ability to
provide customizations or extra features via plugins.
Because Python packages can be separately distributed, your application / library
may want to automatically discover all plugins available.
Define extension points that the plugins can implement.
Everything in this topic revolves around importlib package.
The package will use discovery techniques to load the packages at runtime.
namespace packages and pkgutil mechanism → core package treats subpackage as
extension point, commonly called plugin. The core package scans the subpackage at
runtime to see which plugins have been configured. The plugins can exists in
different directories on the path, but the folder structure of the plugin must
match the folder structure of the core package.
setuptools entry points mechanism → a setup.py file is defined where a set
of extension points are defined. At runtime the core package iterates over the
extension points can calls the extensions when needed.
3rd party, library based or custom mechanisms → the most flexible, but non-
standardized.
Important note: there is a PEP 518 standard (aka: pyproject.toml) that replaced
setuptools packaging mechanism and it became the de facto packaging method. In the
future it is recommended to take a look into that.

There are more ways to implement extendability in your python project. You can
think yourself, how would you create a plugin system for the users of your
framework / app.
https://packaging.python.org/gui des/creating-and-discovering-plugins/
https://stackoverflow.com/questions/932069/building-a-minimal-plugin-architecture-
in-python
https://www.computerandnet.com/2021/09/collection-of-python-plugin-frameworks.html
(existing plugin frameworks in python)
https://github.com/mitsuhiko/pluginbase
https://www.youtube.com/watch?v=cbot48lckOs
Python Crash Course
Distributing packages
Steps
Ensure unique package name - one that does not exist in pypi.
Create a new project in the IDE of your choice.
Create the following project structure:

<lib_name>/
setup.cfg
pyproject.toml
<... other files like readme …>
<lib_name>/
__init__.py
<code_file>.py

Add configuration to pyproject.toml


Add configuration to setup.cfg (be careful with non-ascii chars, you will need to
configure locale correctly for that: https://github.com/pypa/setuptools/issues/1062
)
Add code to <code_file>.py
Run local install: pip install .
We can create our own libraries in python to reuse code between projects.
The following is a good starting point: https://towardsdatascience.com/how-to-
package-your-python-code-df5a7739ab2e (free link provided by the author through his
youtube video: https://www.youtube.com/redirect?event=video_description . This
tutorial uses setup.cfg)
This https://packaging.python.org/tutorials/packaging-projects/ uses pyproject.toml
(modern approach, less popular but gaining).
Python Crash Course
def eprint(string: str) -> None:
print(string, flush=True)
[metadata]
name = mindaugas_test_library
version = 0.0.1
author = Mindaugas Bernatavicius
author_email = [email protected]
description = For generating A E S T H E T I C ascii art
long_description = file: README.md
long_description_content_type = text/markdown
url = https://github.com/MindaugasBernatavicius/notes
classifiers =
Programming Language :: Python :: 3
License :: OSI Approved :: MIT License
Operating System :: OS Independent

[options]
packages = find:
python_requires = >=3.7
include_package_data = True
[build-system]
requires = [ "setuptools>=54", "wheel" ]
build-backend = "setuptools.build_meta"
Distributing packages
Steps (.cont):
Run pip list to see if the package was installed like any other package
To uninstall package from local file system: pip uninstall <package name>
Test the package (you can delete the .egg-info and build dirs)
Run pip install build and python -m build
If the build succeeds (see screenshot) run: pip install twine
… and python -m twine upload --repository testpypi dist/*
… you will need credentials (2FA and API tokens need to be used)
Test it in another project by installing it:
.. pip install -i https://test.pypi.org/simple/ mindaugas-test-library==0.0.1
After confirming that the package works you can publish it to pypi:
… python -m twine upload --repository pypi dist/*
If installation fails - need to troubleshoot.
… for example if “__init__.py” filename is incorrect only dist-info
… will be generated. Updating existing lib is not possible
… even after deleting it in pypi (“names are forever” in pypi).
… You need to change build number. See :
https://stackoverflow.com/questions/21064581/how-to-overwrite-pypi-package-when-
doing-upload-from-command-line and
https://www.reddit.com/r/Python/comments/35xr2q/
howto_overwrite_package_when_reupload_to_pypi/

Python libraries can be held in github/gitlab and installed from there.


Python Crash Course
from mindaugas_test_library.enhanced_print import eprint

eprint("Hi")
Distributing packages
Take note, that the static files are sometimes need to be included with the
distributed library, so make sure that the locally installed package we just
created has them:
Python Crash Course
Distributing packages
Note, we did not install transitive dependencies automatically with the package we
created. But there are ways to do that. See:
https://stackoverflow.com/a/48314070/1964707
Python Crash Course
Distributing packages
How about distutils, setuptools (setup.py) which many modules (tinygrad and others)
are still using?
See this:

And this: https://stackoverflow.com/a/25372045/1964707


Python Crash Course
Standard library modules
Python Crash Course
Python standard library is rich in it’s functionality:
https://docs.python.org/3/library/
It would be useful to learn these better and better as you progress in your carrer.

Important ones:
sys → doing work with python interpreter and runtime:
https://docs.python.org/3/library/sys.html, sys.argv, sys.path, sys.exit()
os → doing work with the operating system:
https://docs.python.org/3/library/os.html, os.environ, os.getcwd(), os.getpid()
io → work with input output: https://docs.python.org/3/library/io.html,
io.open(), readline()
time → work with time: https://docs.python.org/3/library/time.html ,
time.sleep(), time.time()
re → standard regex module: https://docs.python.org/3/library/re.html ,
re.match(), re.split()
pickle → the popular binary serialization tool
https://docs.python.org/3/library/pickle.html , pickle.dump(), pickle.load()
statistics → standard package for simple statistics
https://docs.python.org/3/library/statistics.html statistics.mode(), median(), etc.
logging → standard logging package … will see soon.
… math, random, datetime, argparse, csv, pprint, functools, decimal, marshal and so
on.

For statistics, datascience, machine learning, deep learning we usually use


specialized 3rd party packages.
statsmodel, numpy, pandas, scikit-learn, keras, pytorch, tensorflow, fastai,
pytorchlightning
… same for web apps, mobile app, iot.
Logging and config files

For large projects we need to know how to log errors and notices so that the
application operator / administrator would be able to effectively troubleshoot, fix
and / report issues (to maintainers). Print() are ephemeral (not saved),
contaminate the command line, limited in features.
In many projects we are logging to at least two files (or more): error.log and
<some-name>.log (statspagescrapper.log) files.
Python comes with the builtin modules for logging so there is no need to install
anything else for many applications (but: structlog module).
What is there to know about logging: levels, how much to log and standard-ish log
structure.
5 standard logging levels: DEBUG, INFO, WARNING (default), ERROR, CRITICAL, see:
https://www.youtube.com/watch?v=W1vOdzHCa-I there is an RFC for syslog standard:
https://datatracker.ietf.org/doc/html/rfc5424 .
How much should you log? Maximal amount of logging depends on the developer. Based
on how advanced the application. Start by logging all handleable exceptions and
possibly I/O calls (calling an external service, the database, getting console
args). The general rule - log all significant events (lifecycle events) while
providing the ability to turn on different logging levels for more extensive
logging.
Refs: https://realpython.com/python-logging/ and
https://docs.python.org/3/howto/logging.html
DEMO: logging, logging to file, logging attributes
(https://docs.python.org/3/library/logging.html#logrecord-attributes), dedicated
loggers.
Python Crash Course
Logging and config files

There is an approach to app development called “12 factor apps” that is gaining
popularity. They have a chapter about logging: https://12factor.net/logs
Python Crash Course
Logging and config files

There are also alternative ways of logging configuration when you delegate the
configuration to the app: https://docs.python-guide.org/writing/logging/
We will take a look at configuration options next.
The visual below is just to make the slide more memorable.
Python Crash Course
Logging and config files

Often in a larger project we want to provide a simplified, externalized way of


configuring the behavior - we don’t want to change the code all the time when a
simple change in the configuration and a restart would suffice.
A configuration is something that the program reads and adjusts it’s behavior based
on the values in it.
It makes it simpler for people who are not programmers to operate the software -
imagine you have a scraping script that needs to be reconfigured when something on
the site changes - you can hire a person without programming knowledge to operate
it if the configuration options are clear, documented and simple to use.
In general there are basically 4 ways of configuring a python program:
File based configuration - including python code files, yml, json, .ini and other
text files.
External configuration: database or some network service / API (... or even a Web
UI, Desktop GUI)
Environment variables - for example, os.environ['HOME'], .env and python-dotenv
package.
CLI parameters - they can usually override other 3 ways of configuring the
application, enabling faster testing cycles
Let’s brainstorm what can be configured to get a better idea why extenalized
configuration might be useful:
scrape url(s) or external service urls
scrape selectors for a specific url,
port number on which the server will run
logging level(s) / turn on debug logging?
max thread count for multithreaded application
delay between requests made
ML model used when several are applicable for a given data
file format to output serialized ML model weights
Let’s try to replace the logging config we had (python config / configuration with
code) with file based configuration
Refs:
https://stackoverflow.com/questions/5055042/whats-the-best-practice-using-a-
settings-file-in-python
https://docs.python.org/3/library/configparser.html
Python Crash Course
Logging and config files

What does 12 factor apps say about config files.


.env files app can automatically instantiate env vars.

db_server=localhost
db_port=3363
db_user=root
db_pass=gfdg78746565@$545

Python Crash Course


Git for version control
Git is a code version control (eco)system (VCS) … (ecosystem: GUI tools, CLI, SASS
like github, gitlab (local option), bitbucket)
Everyone who is programming in any capacity could benefit from using git.
For us it’s useful because we will want to host portfolio.
Learning ML/DL we might not have much time to learn git, however the basics are
easily accessible and useful.

Why do we use git (...ecosystem)?


It allows to create versions of our code and switch between those versions. There
are two ways of versioning: chronological (commits) and parallel (branches). With
git we can return to any commit we made, essentially having a time machine through
project history (time machine). Migrate code from one branch to another as well.
Git allows multiple developers to work on the same code base (ussually working on
different features at the same time), permitting and encouraging to integrate code
from different developers often (“if it hurts do it more often …”).
As a backup (if gitlab, github).
Allows easy integration with infrastructure and build tools (Jenkins, Jira and so
on).
Popularity - comes with it’s perks: tooling, community, documentation.
DVCS - work offline, all history available since the repo is on your machine not on
a server.
What are the disadvantages:
Git does not deal well with binary files - you can keep some project specific
images in git, but it’s not ideal for that (gitlfs).
Quite complex to master (contested), but not necessarily complex to learn the
basics.

Git history: https://www.youtube.com/watch?v=4XpnKHJAok8


Python Crash Course
Git for version control
Installation: https://git-scm.com/downloads + we need to configure email and
username for git to use.

Git works by creating a separate database in our project .git - it’s a file based
database.
Git does not store diffs in the database, but we can think like that for the time
being to make it simpler (more: https://stackoverflow.com/questions/10398744/does-
git-store-diff-information-in-commit-objects ). The project directory will only
contain the files at any particular moment, but the database will remember all the
history from the first to the last commit and everything inbetween.
Advanced intro of git internals: https://www.youtube.com/watch?v=P6jD966jzlk
There are common workflows / work cycles with git and we will take a look at one
now.
Python Crash Course
Git for version control
There are many Git GUI tools, however I would recommend using the CLI, as once you
learn the CLI you will be able to choose any GUI tool you like. If you learn GUI-
first, then your knowledge might be a bit limited.
Workflow:
git init → if you are the initial founder of the project
git clone → if you join the existing project
git --version → check the version
git status → continuously runnable command. State of the repo.
git add <f/d/. > → stages the changes.
git rm --cached . → unstages the changes. (-r for multiple files)
git reset → remove change from staging, not file
git commit -m ‘’ → creates a point in time to return if needed
git update-ref -d HEAD → revert first commit (rarelly used)
git log → shows the history of the current branch

Exercise: create simple git repo locally with single commit. Send git log output.
Python Crash Course
Git for version control
Workflow (continued):
git branch -a → list branches
git branch <name> → create branch
git checkout <name> → switch to branch
created branch will have the same code as the parent branch initially
changes made in a branch belongs to that branch
changes need to be committed or stashed before switching branches
stash is like a temporary storage where you can put unfinished work
file committed to a branch belongs to that branch (new file doe not prevent
switching)
git stash → stash current uncommitted changes (to not lose
them!)
git stash -- filename.ext → git stash a single file w/o any message
git stash push -m "more" → add more to the stash
git stash list → list stashes
git stash pop 0 → apply the changes and drop the stash (if conflict
happens, drop will not be automatic)
git stash drop 0 → drop w/o applying
git stash apply 0 → apply the stash w/o dropping it
git stash show -p 1 → check what is inside the stash before applying it
git merge <branch> → merge the branch (have to be on branch into which
the merge is performed)
git branch -d <name> → delete the branch (-D force delete)
git branch -rd origin/<name> → delete branch reference locally if you deleted
branch first on github
git branch -m old new → raname branch from old to new
git push -d origin <name> → delete remote branch:
https://stackoverflow.com/a/2003515/1964707
git checkout <commit-hash> → get back to the past
git checkout <branch-name> → back to the future
git diff HEAD^ HEAD → diff current head with the previous commit
git diff sha1 sha2 → diff commits (use chronological order)
git diff main ticket1 → compare branches (before merging for example)
git show <commit-hash> → show extended details of the commit (including
changes made as a diff)
git revert <commit-hash> → undoes changes that are made by this specific
commit (rem / add / code /files )
Python Crash Course
Git for version control
Common gotchas:
Don’t commit passwords in tacked files. Create .example file that the users who use
your project will checkout and enter passwords in.
Use .gitignore to indicate which files are not to be tracked by git (like config
files with passwords).
Use .gitkeep to keep track of project directories that are kept empty (only to
indicate project structure) in git (/log, /tmp, /bin).
Do not create git repos inside other git repos - git repos are not nestable (we can
use git submodules for that, it’s an advanced topic).
Never commit your library directories e.g.: venv

Common workflows / situations:


When you are doing some work on a branch, without finishing it you need to switch
to work on the same branch. You can create a separate branch (call it “quickfix”
lets say) and then commit everything to that branch. Switching back to the old
branch would not contain the the changes in the “quickfix” branch. Some call it
“branch-in-flight” - this can be used instead of “git stash”.
A branch per attempt: if you don’t know if your code solution is going to work or
not you can branch-out of main/master trying different solutions w/o discarding any
of them. Might be an alternative to having large piles of code in each attempt
commented out although in reality most developers would comment out attempts. Can
be scalled to architectural decisions which can’t be commented.
You work on one ticket, you make some progress and see that you are blocked. You
can branch-out, work another ticket and then return back to the first branch once
the requirement is clear.
If you ever feel unsure about the local changes you are doing with git (deleting
branches, changing history). Before doing anything just create a copy of the whole
folder and try it in the copy first. Useful when learning git as well.
Python Crash Course
Git for version control
Github:
A service written on top of git, + serves many different purposes: hosting service,
project management tools, social network, patreon.
Even though git is decentralized, github is commonly a central repository where the
de facto project code is kept.
Often also includes all the product artifacts: releases, documentation, issues,
discussions - especially for OSS.
What we need to know with github: commits, releases / tags, adding comments, adding
screenshots to readme, deleting repos.
For solo developers - this is the core of your personal brand as a developer (cv +
linkedin + github (dev: personal projects + OSS contributions) … + kaggle (data
roles)).
Push our code to gihub and see it’s usage.
git remote -v → check which remotes are configured
git remote add origin <github url>
git push --set-upstream origin main | git push origin main

Exercise: create github account and push code (from the previous exercise) to it.
Send POW.
… you need to create a token (classic) to push code to github (password auth not
supported when pushing for years)
… use personal access token: https://docs.github.com/en/authentication/keeping-
your-account-and-data-secure/managing-your-personal-access-tokens#using-a-personal-
access-token-on-the-command-line
Python Crash Course
Git for version control
Github (cont.):
Opensource workflow:
OSS project are hosted in git.
We need to fork them and create a PR (pull requrest) if we want to contribute.
After the PR it will be reviewed and either rejected or approved.
Python Crash Course
Git for version control
Question from students - git rebase:
With the rebase command, you can take all the changes that were committed on one
branch and replay them on a different branch
Rebase is used to make git history cleaner and more understandable.
There are advantages of using rebase, but there are disadvantages and cases when
you should not use rebase:
Do not use rebase on public branch that multiple people use - might cause problems
(do not rebase into master or from master)
Some teams do not permit a rebase at all - check when you come into a new company
Why we use rebase - squashing (history cleanup, smaller log and so on) and rebasing
from master (pulling w/o merge) when its safe. The last part should arguably be the
default: https://stackoverflow.com/a/4675513/1964707
Python Crash Course
Git for version control
Demo squash:
create main, add file, create ticket1 branch, add couple commits then
git log --oneline → a less verbose version of git log
git merge-base ticket1 main → get original branching point from main
git rebase -i <sha>
choose the commit you want others to be squashed into
leave a single commit message (all can be left aswell)
git reflog

Disadvantages:
Since commits are squashed you can’t travel back to each one
Python Crash Course
Git for version control
Demo rebasing from master:
create main, add file, create ticket1 branch, add couple commits then
switch to main and create yet another file (we are simulating the situation when a
teammate commes in and says that there is an important new peace of code,
potentially a “hotfix” in master that we want to also incorporate right away so
that we would see if your code is compatible with that new hotfix and there are no
issues)
don’t forget to commit in main
switch to the other branch
check the log
git rebase main
check the log and the merge base!

… you can compare this flow with the merge


git merge main
and see the difference (no merge commit)
Python Crash Course
Git for version control
Reverting the rebase:
Ref: https://stackoverflow.com/questions/134882/undoing-a-git-rebase
Python Crash Course
Git for version control
Short and good video about rebase: https://www.youtube.com/watch?v=0chZFIZLR_0
Python Crash Course
Git for version control
SQ: Using git rebase to solve merge conflicts.
This is a practice sometimes seen “in the wild”.
This is an advanced topic, so do not worry about it until you know the git basics
very well.

Recap, tips and tricks:


Never, ever commit any config files containing sensitive information to git. If you
have committed such files never ever just replace the passwords or other sensitive
information and commit on topic.
You can imitate “a team” working on a project in GIT with just github and your
computer. Just clone the repo in another directory and open another IDE / text
editor and start changing things in parallel.
In a lot of cases people do not test whether the project launches correctly when
cloned from git. Readme.md file has to contain the launch procedure, if it is not
correct this will be treated as a mistake. This is very important for our PP1 and
for future job interview processes - always test that the launch procedure
described actually works!
Python Crash Course
Git for version control
Some git humour:
https://www.youtube.com/watch?v=3mOVK0oSH2M
https://www.youtube.com/watch?v=SCPVDpyApgQ
Python Crash Course
Module management with pip and venv
Pip
Default package manager for python
Included in the python installation since versions 3.4
Although python is a “bateries included” language we still need external packages,
PyPI is our friend for that.
Pip does not uninstall transitive dependencies, do pip list or pip show for that
and uninstall using pip-autoremove: https://stackoverflow.com/a/27713702/1964707
command in cmd: python venv\Scripts\pip_autoremove.py requests
Or use pipdeptree to check which module depends on which
Commands / demo, for more see: https://realpython.com/what-is-pip/
pip help, install, uninstall → common commands to know well
pip search → in the process of being deprecated:
https://stackoverflow.com/a/66816629/1964707 )
python -m pip install --upgrade pip → execute pip to upgrade pip
pip show aesthetic-ascii-mindaug → extended information about a package
(required-by and requires are important!)
pip install -r requirements.txt → file with prod requirements,
requests>=2.21.0, <3.0 to specify range that does not break the app
pip freeze > requirements.txt → create a list of project dependencies (works
with cmd on windows)
pip install --upgrade -r req....txt → upgrade to newer version when available
requirements_dev.txt → dev requirements (not for production, pytest
for example). Below is how to add prod reqs to it!
requirements_lock.txt → locked dependencies and versions (not
essentially if you use reqs.txt well)
pip uninstall -r requirements.txt -y → uninstall all with reqs file
pip list | find /c /v "" (or wc -l)
pip list --local
pip -V → check which pip is called (libs will be
installed there)
Python Crash Course
Module management with pip and venv
Python Crash Course
Venv
Builtin python module for managing project-specific dependencies: for example
requests v1 is used by your project, but you want to create another project that
uses requests v2. You don’t want to update it globally because the other project
might fail. Projects become self contained - deletion of the project folder then
deletes all dependencies leaving the computer clean.
Same version of python interpreter used to create the environment is going to be
used in the environment.
Never commit venv to VCS, commit requirements{__dev/lock}.txt files.
commands
python -m venv <name> → creating the venv, pycharm does that by itself & you can
inheriting all the global packages if you want
<name>\Scripts\active.bat
deactivate
rmdir <name> /s | rm -r <name> → delete including subdirs
python -m venv venv --system-site-packages → inherit system packages
Version management with pyenv

Pyenv
manage different python versions and easily switch between them
usually used on a development machine
don’t confuse w/ pyvenv, a venv predecessor, now deprecated
useful to test application against multiple versions of python
Installation for mac and linux: https://github.com/pyenv/pyenv
Installation for windows: https://github.com/pyenv-win/pyenv-win#installation
More: https://realpython.com/intro-to-pyenv/...

Pyenv important commands


pyenv install -l
pyenv install 3.7.4
pyenv versions
pyenv local 3.7.4 | pyenv global 3.7.4
Python Crash Course
Project management with pipenv (optional)

Pipenv
a new possible replacement/improvement for pip
aims to combine pipfile (a replacement for the requirements.txt file that uses toml
format and store dev, test and prod dependencies in separate sections of the same
file), pip and virtualenv into one command on the command-line.
some literature still recommends to use pip for beginner python developers - the
workflow is easier
more: https://pipenv.pypa.io/en/latest/ … hoeever to start better use this
documentation: https://docs.python-guide.org/dev/virtualenvs/
see this for an extended community discussion:
https://stackoverflow.com/a/41573588/1964707

Pipenv usage demo:


pip intall pipenv → install pipenv globally (you can obviously
choose the version of pip that will do the installation using pyenv global)
mkdir test_project → create a project directory
pipenv --python 3.6.8 → create a project with a specific python version
pipenv install → create virtual environment
pipenv install requests → install a specific package into the venv (do not
need activation)
mkdir .venv → if you want to use a local venv just pre-create a
venv folder in the project directory
pipenv install requests –dev → install requests as a dev dependency (by default
there are only dev and non-dev packages, but see the categories)
pipenv install --categories aws .. → install a dependency into a custom category
pipenv uninstall requests → uninstall requests
pipenv –venv → check where the project venv is (C:\Users\
Mindaugas\.virtualenvs\TestDelete-***\Scripts where interpreter resides)
pipenv shell → activate venv
exit → deactivate venv
pipenv --rm + del|rm Pipfile* → remove the virtual environment (it is kept
separately from the project by default)
Python Crash Course
Project management with poetry (optional)

Poetry
a new replacement/improvement for pip and pipenv
a tool for dependency management and packaging in Python
allows you to declare libraries your project depends on and it will manage
(install/update) them for you
offers a lockfile to ensure repeatable installs / dependable builds, and can build
your project for distribution
poetry can be used for both application creation and library creation (it is
sometimes said that pipenv focuses more on applications rather than libs).

Installation + PATH
https://python-poetry.org/docs/#installation
remember poetry is not prepackaged with python and is a global tool if installed
the default way (global to all users or to your user) and there are important
implications to that:
it will not be managed by pyenv and will be bound to your system python.
it does create virtual environments, but it does that outside of the project (which
is something I dislike, because if you delete the project the dependencies will not
be deleted automatically. Note, this is subject to change).
because it is bound to your system python, it will declare your global python
version as the minimal supported version [tool.poetry.dependencies]
Commands (windows):
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content
| py -
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content
| py - --uninstall (don’t forget to remove pypoetry from PATH)
Python Crash Course
Project management with poetry (optional)

Workflow for a trial


create a new folder
choose python version using pyenv
create a venv and configure vscode to use it
pip install poetry
poetry init
… use it

Commands
poetry new <project-name> → create a new project with poetry
poetry init → initialize poetry into the project that was
already started (I usually use this approach)
poetry install → install all dependencies (ussually run after
cloning the project), does not remove packages
poetry install –sync → if you want to uninstall all packages
removed and install new packages added from/to pyproject.toml
poetry add <package-name> → to add a new package to the dependency list
of our app
poetry remove <package-name> → uninstall the package
poetry update → updates the packages, but respects the
poetry show –tree → show dependencies as a tree
poetry add requests → add dependency
poetry add pytest --group test → add dependency to group test (group
names can be chosen at will, no need to specify group when uninstalling)
poetry show --only main → list non development dependencies (only the
main dependencies)
poetry run python app.py → run app (you can encounter specialized
commands, poetry run flask run they require additional configuration)
Python Crash Course
Distributable applications (optional)
Raw modules (python files via gitgists, or direct download)
Executable Zips (convered)
Docker containers
Creating installable software and .exe files: PyInstaller and py2exe, see:
https://packaging.python.org/en/latest/overview/#bringing-your-own-python-
executable
Python Crash Course
Practical Project 1 (PP1)
We have learned quite a lot of Python (and some additional concepts) in this part -
we want to solidify our knowledge by practicing!
To complete this practice assignment you need to create a python web scrapper
project.
Requirements:
Scrape one or more websites, that are publicly available - it can be a minimal
project or you can go as in depth as you like.
It must scrape in either of 2 ways (or use both ways) with navigation either - in
depth (e.g.: items page + item page) or in breadth (pagination) - this is the
minimal requirement to get a positive grade. You don’t have to implement both, but
that would be good practice.
You can use any library / framework, but if you use scrapy, bs4, selenium,
requests-html please use recommended python project structure.
It must contain a config file - you decide what parameters need to be configurable,
simple options: url, selectors, port, logging level, how much time to wait before
accessing next page, print scrapped information to console or to file or both etc.
It must log errors to a centralized file - at least one log file, for example
main.log.
Code is hosted in github (can be private, but please invite the teacher as a
collaborator to verify the project) with at least 3 commits, containing readme file
with launch instructions (document how to launch the project easilly),
requirements.txt (or equivalent).
Optional requirements (future lectures will cover these topics):
Write some unit/integration tests (pytest)
Incorporate excel file processing
Save values to a database
Anything else you might think of or want to try…
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1K85NPFcshE8nwBol24vUPjF0wm-b7cg5.pptx ---


Artificial Intelligence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Klasės ir objektai
Klasės metodai
01
02
03
Klasės kintamieji
Python Crash Course
00
Objektinio programavimo esmė
04
Statiniai metodai
05
Paveldimumas
06
07
Dauginis paveldimumas
Enkapsuliacija
08
Kompozicija
09
Polimorfizmas
Objektinio programavimo esmė
Python Crash Course
Objektinis programavimas tai yra viena iš programavimo paradigmų (ne apie
deklaratyvus vs. imperatyvus):
Procedūrinis (struktūrizuotas) programavimas: programos skaidymas funkcijomis bei
naudojimas if / switch ir kitų blokų. Nėra klasių / objektų, kurie laikytų savyje
duomenis ir metodus. Tai ką darėme iki šiol.
Funkcinis programavimas: funkcijos kaip kintamieji, immutable duomenų struktūros,
grynos funkcijos, deklaratyvumas (SQL yra deklaratyvi kalba, Regex. Deklaratyvumas
priešinamas su imperatyvumu. map() + filter() vs. for() loop).
Objektinis programavimas: viskas modeliuojama objektais kaip “realiame pasaulyje”
(objektai turi savybes ir metodus). Objektą pagaminti naudojamas šablonas,
vadinimas klase. Yra 4-turi (5?) pagrindiniai OOP principai, kuriuos tuoj
apžvelgsime.
Today, many languages do not fall neatly into these 3 categories (or the categories
of declarative vs. imperative style). Many languages are hybrid / multiparadigm
languages, e.g.: Python - for loops (procedural style), classes/objects (oop),
closures/lambda/map (functional).
These paradigms are for humans (mainly productivity) - computers / CPUs do not care
(maybe compilers and interpreters do). Creators of OOP thought that huge enterprise
apps will be much easier to understand in OOP style (although that is not where the
benefits came from - garbage collection was much more productivity enhancing).
Objektinio programavimo esmė
Python Crash Course
Kas yra klasė? Klasė turi savyje duomenis ir metodus t.y. ji enkapsuliuoja savybes
ir elgseną po vienu vardu. Taigi galime klasę tiesiog laikyti “konteineriu” arba
sugrupavimu atskirtų duomenų ir su jais susijusių metodų. Tai pirmasis būdas kaip
klasė gali būti apibrėžta - statinės klasės atitinka šį apibrėžimą, bet neatitinka
antrojo.

Klasė tai pat gali būti objekto šablonas. Iš klasės galime pagaminti daugybę
objektų turinčių tą pačią struktūrą (struktūra = savybės + metodai) - atitinkančių
tą pačią formą - tačiau kitokias tų duomenų reikšmes. Pvz: Employee class.
Kiekvienas darbuotojas turės id, vardą, pavardę - bet konkretus duomuo bus
skirtingas ir dažnai unikalus (ar bent specifinis) kiekvienam objektui. Dažnai
klasės yra paaiškinamos cookie-cutter analogija.

Spektras: nuo bukos klasės iki objektų šablono.


Objektinio programavimo esmė
Python Crash Course
“Save gerbianti” programavimo kalba turėtų leisti naudotis 4-turiomis OOP
ypatybėmis.

Enkasuliacija / inkapsuliacija (encapsulation)


objektas savyje talpina metodus ir duomenis, gali kontroliuoti kaip duomenis
pasiekiami (private, public).
dažnai tiesiog kalbame apie setter ir getter (accessor / mutator) metodus kaip apie
enkasuliacijos panaudojimą.
lietuviškai dažnai vadinama inkapsuliacija.
Paveldimumas (inheritence)
kodo perpanaudojimo principas, kai iš tėvinės klasės vaikinė paveldi duomenis ar
metodus ir joje šiųjų aprašyti nebereikia. Palaiko D.R.Y. principą programavime
(functions also make the code DRY-er). Bottom-up.
taip pat galime mastyti per abstraktaus modeliavimo prizmę, kai paveldimumas
apibrėžiamas per objektų klasifikacijas, bet taip dažnai studentai tiesiog pasimeta
(Animal → Cat → Tiger, Vehicle → Car, Vehicle → Motorbike, Person → Employee →
Teacher). Top-down.
IS-A santykis / testas. Employee is a Person.
Polimorfizmas (polymophism)
(subtyping) pakeičiamumas tarp vaikinės ir tėvinės klasės. Neįmanomas be
paveldimumo.
į metodą ir paduodamas vaikinis objektas, nors metodas deklaruojamas kaip
priimantis tėvinį - polimorfinės funkcijos.
į masyvą/kažkokią kolekciją su tipo deklaracija paduodamas vaikinis objektas, nors
specifikuotas tėvinis tipas - polimorfinės kolekcijos.
Kompozicija (composition) / agregacija (aggregation)
objektas talpina savyje kitą objektą kaip lauką (field), kad išpildytų savo
funkcionalumą.
HAS-A santykis su klasikiniu Car → Engine pavyzdžiu.
Daugiau pavyzdžių: Person → Fullname { firstname, middlename, lastname, initials }
(class, not a string)
There are other principles in literature, like Abstraction (interfaces, ABCs ?),
Aggregation (as opposed to Compossition), Association - but these are either very
technically similar (like Compossition, Aggregation, Association). Abstraction is
not included because this principle is not specific to OOP, it is a very general
principle of all programming! The same way we would not associate DRY or
Decomposition with OOP only!
Klasės ir objektai
Python Crash Course
Klasių aprašymas Python kalboje (pascal case). Šablonas:

class ClassName:
pass

Galite naudoti pass, jei nenorite aprašyti visos klasės iškart.


Sukurti objektą iš klasės kviečiame konstruktorių. Konstruktorius inicializuoja
objekto savybes, alokuoja atmintį objektui ir gražina objekto reference’ą
kintamąjam, kuriuo galėsime pasiekti objektą. Jei konstruktorius neaprašytas klasės
viduj, tai jis yra menamas (default constructor).

object_name = ClassName()

toks konstruktorius gali būti pakviečiamas ir suteikiamas python intepretatoriaus


nemokamai, jo nereikia apsirašyti.
sukūrus objektą, galime dinamiškai jam priskirti savybes ir jų reikšmes - tokios
savybės yra unikalios kiekvienam objektui ir yra vadinamos “instance savybėmis” /
“instance properties”:

def __init__(self, age):


self.age = age

tačiau jei norime savybių reikšmes priskirti objekto kūrimo metu - kaip dažnai būna
patogiausia daryti - turime apsirašyti __init__ metodą. Tai pirmasis mūsų ‘duoble
underscore’ (arba ‘dunder’) metodas kartais vadinamas init, konstruktoriumi arba
dunder init.
kintamasis self nurodo į objektą, pagamintą iš klasės, kurioje metodai yra
kviečiami (panašiai kaip this kitose programavimo kalbose).
Klasės kintamieji
Python Crash Course
Tai kintamieji, kurie priklauso klasei ir yra bendri visiems objektams,
pagamintiems iš to klasės.
Kitose kalbose tai vadinamieji static kintamieji (nors Python turi ir static
kintamuosius)
Jie yra priešinami su instance savybėmis / kintamaisiais (instance == object).
Klasės savybės inicializuojamos kai klasė kuriama, todėl negali priklausyti nuo
vėliau inicializuojamų dalykų, pvz: instance savybių.
Instance savybės priklauso kiekvienam objektui atskirai. Jų reikšmės unikalios tam
objektui kaip jūsų vardas ir pavardė yra priskirti Jums ir kitas žmogus turės kitą
vardą ir pavardę, taip ir instance metodai is savybės galimai skiriasi nuo visų
kitų objektų (žinoma, gali ir sutapti, tačiau nėra būtinybės sutapti).
Klasės savybės priklauso klasei t.y. šablonui iš kurio padaryti tie objektai.
Klasės savybės yra bendros visiems iš tos klasės sukurtiems objektams. Jei
pakeistume tą savybę tai tas pasikeitimas atsipindėtų visuose objektuose, kurie
pagaminti iš tos klasės. Jeigu keistumėme per objekto vardą, pvz.: emp1.a = 5, tai
būtų sukuriamas naujas kintamasis instance lygmenyje / scope.
Vienu metu, klasė gali savyje talpinti ir klasės ir instance kintamuosius (they are
not exclusive, they can coexist).
Klasės kintamieji
Python Crash Course
Klasės kintamieji
Python Crash Course
Python klasės kintamasis egzistuoja klasės lygmenyje:
print(Classname.__dict__)
print(objectName.__dict__)
… tačiau name lookup taisyklės yra tokios: jei interpretatorius neranda kintamojo
objekto viduje, jis eiško to kintajomo klasės viduje. Tai įsipaišo į bendrąją
Python vardų radimo tvarką - LEGB (local, enclosing, global, builtin), see:
https://realpython.com/python-scope-legb-rule/#class-and-instance-attributes-scope
class variable - local scope
object_name.class_var_name → local scope (object) to enclosing scope (class)

x = 5

def test():
# in this case we would reach even the built-in scope
print(abs)

test()
x = 5

def test():
# x will be searched in local scope (function),
# ... then enclosing scope (which is also global scope)
print(x)

test()
Klasės kintamieji
Python Crash Course
Tai svarbu suprasti, nes jei bandysime keisti klasės savybę turime dvi galimybes:
keisti kviečiant su klasės varbu arba su objekto vardu. Jei kviesime su objekto
vardu tai objekto scope bus sukurtas naujas kintamasis (fieldai objektams gali būti
priskirti dinamiškai).

Taigi, jei norime, jog klasės viduje raise_amount būtų išimtinai pakeičiamas /
overridintas vienam objektui- galėtume vienam objektui pritaikyti išimtį -
paliksime klasėje self.raise_amount. Tokiu būdu nesunku implementuoti “default
case”.
Klasės kintamieji
Python Crash Course
Situacijoje, kai negalime sugalvoti kodėl reiktų overridinti klasės kintamąjį
vienam ar keliem objektam naudosime klasės vardą:
Klasės metodai
Python Crash Course
Klasės metodai sukuriami su dekoratoriumi @classmethod.
Tokiu būdu pirmasis kintamasis paduodamas į tokį metodą bus klasė, o ne objektas
self.

@classmethod dekoratoriumi pažymėtas metodas gali būti iškviečiamas ir naudojant


objektą, tačiau taip dažniausiai nedaroma (klaidina)
@classmethod’ai kartais naudojami kaip alternatyvūs konstruktoriai situacijose kai
norime sukurti objektus ir duomenų, kuriuos gauname iš failų, išorinių servisų ar
pan. Tai naudojama pvz.: datetime.py bibliotekoje:
Statiniai metodai
Python Crash Course
Instance metodai automatiškai gauna objektą kaip parametrą, ant kurio yra kviečiami
ir pagal tai yra atpažįstami
Class metodai gauna klasę ir dar aptažįstami pagal dekoratorių @classmethod.
Static metodai negauna nieko automatiškai, bet yra deklaruojami klasės viduje ir
atpažįstami iš dekoratoriaus @staticmethod.
Juos naudojame kai metodo logika yra logiškai susijusi su tuo ką daro ar ką
programoje atstovauja/reprezentuoja mūsų klasė, bet metodo viduje esame
nepriklausomi nei nuo klasės nei nuo instance kintamųjų (nenaudojame nei self nei
cls).
Statiniai metodai naudingi https://www.journaldev.com/18722/python-static-method -
useful for utility methods (simples example: Calculator class).
Statiniai metodai
Python Crash Course
Apibendrinimui:
Jei iš mūsų klasės nebus kuriami objektai tai viską galime palikti @staticmethod
arba @classmethod.
Jei kursime objektus, tai duomenys skirsis objektų atžvilgiu vienas nuo kito
(turime daugį). Tuomet tuos duomenis darome ne-static (t.y. instance).
Jei metodas nieko nedaro su instance ar class kintamaisiais tai jis gali būti
static (klasėje svarbiausi yra kintamieji - data / fieldai).
Dažnai kalkuliatorius daromas kaip static class arba kokia nors Util klasė su
metodais, kurie tiesiog kažką apskaičiuoja.
Jei reikia tarp objektų išlaikyti unikalumą (pvz: Person su vardais) tai kuriame
instance savybes!
Savybės (duomenys) diktuoja kokie bus metodai.

Dažni klasių su statiniais metodais pavyzdžiai (dumb classes):


Util: // XYZUtil:
DbUtil:
Validator:
Processor: StringProcessor
SomeXCalculator:

Dažni klasių su iš kurių kuriame objektus pavyzdžiai (data classes, object


templates):
Person (inicializuojame iš DB žmonių informaciją)
Employee (inicializuojame iš DB ar failo CSV personalo informacija)
Article (scrape’iname delfi duomenis ir kuriame “list of article objects”)
City / Item / Product / Offer / Doctor / Appointment kiti probleminiems domenams /
sferoms reprezentuoti skirti objektai (domain objects / domain models).
Visos klasės iš kurių sukurtų objektų bus daug - daug Employees, daug Cities,
Items, Products, ShippingCarts bus klasės, kurios naudos ne-static savybes (ir
atitinkamai metodus).
etc.

Paveldimumas
Python Crash Course
Paveldimumas (inheritence)
kodo perpanaudojimo principas, kai iš tėvinės klasės vaikinė paveldi duomenis ar
metodus ir joje šiųjų aprašyti nebereikia.
palaiko DRY (don’t repeat yourself) principą programavime (no boilerplate, bottom-
up design ).
taip pat galime mastyti per abstraktaus modeliavimo prizmę (labiau top-down
design), bet taip dažnai studentai tiesiog pasimeta (Animal → Cat → Tiger arba
Person → Employee → Teacher). IS-A santykis.
Dažnai knygose žmonės pradeda kalbėti apie abstrakčias kategorijas, kurioms
priklauso realaus pasaulio daikai, kad paaiškintų paveldimumą. Tačiau
produktyvesnis būdas dažnai yra tiesiog pademonstruoti kaip yra perpanaudojamas
kodas (DRY).
Paveldimumas
Python Crash Course
Method resolution order (MRO):

Tinkiname paveldinčią / vaikinę klasę - keičiame klasės savybę, bei overridiname


konstruktorių, konstruktoriaus perpanaudojimui:

Paveldimumas
Python Crash Course
Pasidarykime dar vieną klasę:
Sužinoti ar objektai priklauso tai pačiai klasei galime su isinstance() metodu
Sužinoti ar du objektai susiję paveldimumo ryšiu galime su issubclass() metodu.
Sužinoti koks klasės, kuriai priklauso objektas pavadinimas galime įvairiais
būdais: https://stackoverflow.com/questions/510972/getting-the-class-name-of-an-
instance

Dauginis paveldimumas
Python Crash Course
Python kalba yra viena iš naudaugelio kalbų, kurios palaiko multiple inheritence.
Ref: https://www.programiz.com/python-programming/multiple-inheritance
Tokiu atveju MRO: https://www.programiz.com/python-programming/multiple-
inheritance#resolution naudojamas C3 linerizacija MRO išrišti:
https://stackoverflow.com/questions/55692832/method-resolution-order-mro
MRO paaiškina, kodėl reikia naudoti ClassName.__init__() vs. super().__init__(),
see: https://stackoverflow.com/a/42413830/1964707
Pažvelkime į paveldimumo tipus apskritai:

Dauginis paveldimumas
Python Crash Course
If you subclass (verb) a class that has multiple inheritance, your super() can in
fact delegate to a slibing, not a parent! When using multiple inheritance refer to
super constructors using the class names directly in the entire inheritance chain:
Employee.__init__()
The Diamond Problem:
https://en.wikipedia.org/wiki/Multiple_inheritance#The_diamond_problem
class Person:
def __init__(self, name, age):
self.name = name
self.age = age

def __str__(self):
return f"{self.name} is {self.age} years old"

class Employee(Person):
def __init__(self, name, age, rate, num_of_hours):
print('Employee%s' % super().__init__)
super().__init__(name, age)
self.rate = rate
self.num_of_hours = num_of_hours

def show_finance(self):
return self.rate * self.num_of_hours

class Student(Person):
def __init__(self, name, age, scholarship):
super().__init__(name, age)
self.scholarship = scholarship

def show_finance(self):
return self.scholarship

# class Employee(Person):
# def __init__(self, name, age, rate, num_of_hours):
# Person.__init__(self, name, age)
# self.rate = rate
# self.num_of_hour = num_of_hours
#
# def show_finance(self):
# return self.rate * self.num_of_hours
#
#
# class Student(Person):
# def __init__(self, name, age, scholarship):
# Person.__init__(self, name, age)
# self.scholarship = scholarship
#
# def show_finance(self):
# return self.scholarship

class WorkingStudent(Employee, Student):


def __init__(self, name, age, rate, num_of_hour, scholarship):
Employee.__init__(self, name, age, rate, num_of_hour)
Student.__init__(self, name, age, scholarship)

def show_finance(self):
return self.rate * self.num_of_hour + self.scholarship

print(WorkingStudent.__mro__)
os4 = WorkingStudent("Monica", 24, 9.5, 70, 500)
Paveldimumas
Python Crash Course
Apibendrinimas:
Kada žinoti, kad turėtume naudoti paveldimumą? Na tai visų pirma, turime įsikinti,
kad reikia naudoti OOP išvis. Kas mažoms programoms, kartais nereikalinga. OOP
sukurtas valdyti sambius projektus kaip abstrakcijos lygmuo. Skirptinant, mažoses
programose pilnai pakaktų duomenis reprezentuoti kolekcijomis (pvz: list of dict)
ir logikai - “free floating”/global funkcijas.
Jei jau matome, kad turime daubybę skirtingų duomenų (domain objects), kuriuos
reikia reprezentuoti - klasikinis pavyzdys iš lentelių imami duomenys apie
Employees, Account, ShopingCartItem, Item, Auction, SalesEvent, Event, Building.
For existing project - use whatever they use and whichever way they use it. Even if
it means learning to do things incorrectly.
It kai nusprendėme naudoti OOP, tai paveldimumas panaudojamas arba “bottom-up” arba
“top-down”. Top-down tai planuojant išaiškės. Bottom-up tai tiesiog rašai
aplikaciją ir matai, kad yra 3-ys klasės (“the rule of 3”), kurios labai panašios
(Employee, Customer, GuestUser) ir visos turi id, name, surname - tai galima
sutaupyti kodo pasidarant superklasę Person, kuri bendrąsias savybes turės savyje
(id, name, surname) specifinės savybės ir metodai jiems apdoroti bus atskirose
klasėse.
Enkapsuliacija
Python Crash Course
Encapsuliacija tai klasės implementavimo principas kai objekto duomenys ir metodai
yra enkapsuliuojami t.y. talpinami viduje jos.
Duomenų keitimas bei skaitymas daromas kontroliuojamai, naudojant specialius
metodus bei prieigos modifikatorius.
Tai dažnai siejama su getters ir setter (accessors and mutators) metodais kitose
kalbose.
Python kalboje naudojami dekoratoriai implementuoti getteriams ir setteriams:
Property dekoratorius leidžia aprašyti metodą, kurį galime kviesti kaip property su
dot operator.
Deleteris - naudojamas sunaikinti / nullinti duomenis objekto viduje
Daugiau:
https://www.geeksforgeeks.org/getter-and-setter-in-python/
https://stackoverflow.com/questions/2627002/whats-the-pythonic-way-to-use-getters-
and-setters

Enkapsuliacija
Python Crash Course
Python turi ir prieigos modifikatorius. Jų yra 3: public, protected ir private:
vars with public access modifiers can be accessed anywhere inside or outside the
class. They are written normally.: self.name
protected can be accessed anywhere unless protected by the @property decorator -
they are written with a single underscore self._name . Jie konvenciškai nėra
kviečiami iš už modulio ribų, nors interpretatorius tai leidžia.
private variables can only be accessed inside the class - written with double
underscore: self.__name
Ref: https://stackabuse.com/object-oriented-programming-in-python/#accessmodifiers
Bendrai, private / public data skirtis nėra tokia svarbi kaip Java / PHP ar kt. OOP
kalbose. Kai kurios knygos / tutorialai internete to net nemini. Palyginus kitose
kalbose bet koks narmalus tutorialas turėtų namažą diskusiją apie tai.
Egzistuoja ir private metodai.
“In python, nothing is trully private”, see:
https://stackoverflow.com/questions/70528/why-are-pythons-private-methods-not-
actually-private

Kompozicija
Python Crash Course
Objektas talpina savyje kitą objektą kaip lauką (field), kad išpildytų savo
funkcionalumą, dalį jo deleguodamas.
HAS-A santykis su klasikiniu Car → Engine pavyzdžiu (car has an engine).
Vieno objekto padavimas į kitą iš išorės per konstruktorių ar setter metodą yra
vadinamas Dependency Injection (DI). Tai yra creational design pattern.
Pagal tai ar dependencis yra paduotas per konstruktorių ar setterį išskiriami 2 DI
tipai:
setter injection
constructor injection
Viena klasės gali turėti tiek kiekvieną iš DI tipų, tiek ir abu.
Dependencis gali būti vadinamas lietuviškai “priklausančiu objektu”. Engine
priklauso Car’ui.
Ref: https://realpython.com/inheritance-composition-python/#composition-in-python
Simple variable injection is usually not considered composition in the literature.
Kompozicija
Python Crash Course
Sometimes we distinguish between aggregation and composition.
Composition: when a class creates a dependent object inside of itself (inside a
constructor).
Aggregation: when a class receives a dependent object externally (dependency
injection).
Polimorfizmas
Python Crash Course
Daugiaformiškumas (iš graikų kalbos)
Pakeičiamumas tarp vaikinės ir tėvinės klasės (polimorfizmas per paveld.).
į kolekciją su tipo deklaracija yra paduodamas vaikinis objektas, nors tipas
specifikuotas tėvinis (polimorfinė kolekcija).
į metodą ir paduodamas vaikinis objektas, nors metodas priima tėvinį (polimorfinė
funkcija)
Tai vadinama subtype polymorphism.
Polimorfizmą sunkiau pademonstruoti dinamiškai tipizuotose kalbose, nes ten galima
dėti visus tipus į visokias kolekcijas ir funkcijų parametrai kiauri.
… tačiau dabar žinome apibrėžimą ir panaudojimą šios koncepcijos.
Ref: https://www.geeksforgeeks.org/polymorphism-in-python/
—--
Polimorfizmas visada eina “į apačią paveldimimo hierachijoje”, nes jei klasė
paveldi kažką iš tėvinės klasės tai mes galime tiek polimorfinėse kolekcijose, tiek
polimorfinėse funkcijose tuo pasikliauti ir naudoti paveldėtus metodus ir savybes.
Kai kuriose šaltiniiuose išskiriamas dinaminis ir statinis polimorfizmas (method
overloadingas) ir polimorfizmas aiškinamas kaip procesas, kurio metu išrišamas
(bindinamas) arba pakviečiamas reikalingas metodas. Mano paaiškinimas kaip matote
yra kiek kitoks, bet susiveda į tą patį - juk pakeičiamumas tarp tėvinės ir
vaikinės klasės (polimorfizmas per paveldimumą) remiasi tuo, kad metodai yra
paveldimi vaikinės klasės, todėl ji visose situacijose gali atstovauti tėvinę.
Gera diskusija: https://softwareengineering.stackexchange.com/questions/335704/how-
many-types-of-polymorphism-are-there-in-the-python-language
Polimorfizmas
Python Crash Course
Pakeičiamumas funkcijose - įeities ir išeities parametruose:

Polimorfizmas
Python Crash Course
Duck typing is an alternative to type hinting (... or static types).
Duck typing: if it quacks like a duck …
In python we can pass objects to functions and as long as the object being passed
has what the functions uses there is no issue.
We are loosing the ducktyping capabilities if we are restricting the types with
typehinting (mypy). But we regain some of that flexibility from polymorphism.
So duckyping is (arguably) a non-inheritence based polymorphism - it does not pass
the isinstance() test, but allows for liberal parameter passing and containment of
various types inside collections.
Summary
Python Crash Course

Magiški metodai
Python Crash Course
Plius (+) operatorius Python yra overloadintas - jei sudedame du skaičius gausime
skirtingą semantiką (tą pačią sintaksę) nei sudėdami du str objektus.
Kas bus jei panaudosime sudėties simbolį tarp dviejų, mūsų pačių parašytų objektų?
Tai galime apibrėžti su dunder - double underscore metodais.
Jau matėme dunder init metodą - konstruktorių.
Kiti metodai:
__repr__() → objekto reprezentacija loginimo, debubinimo tikslais skirta
developeriams. Rekomenduotina gražinti konstruktoriaus pakvietimo išraišką.
__str__() → įskaitoma (human readable), draugiška skaitytojui objekto (vidurių)
reprezentacija (kaip toString() kitose kalbose).

__add__() → rečiau naudojamas, apibrėžia veiksmą + operatoriui

__len__() → len(object), class Team: ???, class Flight: ???


python bibliotekose matysime dunder metodai yra perpanaudojami
Dažniausiai naudojami: __init__, __repr__, __str__ .
Atsargiai juos naudokite - dažnai geriau turėti specialią funkciją, kurios vardas
pasako, ka ji daro, nei overloadinti operatorius. Pvz. __ge__ tarp dviejų darbuotų
pažiūrėti kuris yra “didenis”, geriau turėti custom funkciją compareBySalary() ar
pan. Operatorių overloadinimo customizacija yra kalbų, kurios pasitiki
programuotojais vienas skiriamųjų ženklų (C++), nors pats overloadinimas yra platus
- dažnoje kalboje matysime “+”, “/” su keliais tipais veikia.

Magiški metodai
Python Crash Course
Ref: https://levelup.gitconnected.com/python-dunder-methods-ea98ceabad15
When we start supporting a single comparison operation, it is advisable to support
them all or most.
Magiški metodai
Python Crash Course
SK: kur galime pamatyti __repr__ veikimą:
REPL’e
colab, kai nedarote print(custom_obj)

Objektų palyginimas
Python Crash Course
Objektai palyginami su __eq__ dunder metodu. Šis metodas nusprendžia objektų
palyginimo semantiką t.y. ką reiškia dviems mūsų sukurtiems objektams būti lygiais.
Pagal nutylėjimą yra lyginami objektų id(), tačiau dažnai norime palyginti ar
objektai lygūs savo reikšmėmis (ar didesni, mažesni ir t.t.)
Ref: https://stackoverflow.com/questions/1227121/compare-object-instances-for-
equality-by-their-attributes
Ref: https://www.kite.com/python/answers/how-to-compare-two-objects-in-python
Objektų palyginimas
Python Crash Course
Pamenate klausimą - kokios dvi operacijos yra svarbios rikiavimui: palyginimas ir
swapas.
Mokėdami lyginti objektus, galime pradėti kalbėti apie jų kolekcijas, o tuomet ir
apie operacijas objektų kolekcijoms.
Pažvelkime į šias operacijas:
rikiavimas → jam reikalingi __eq__ ir __lt__ (jei plain .sort(), be key ) dunder
metodai
paieška → reikalingas __eq__ dunder metodas, daugiau variacijų:
https://stackoverflow.com/questions/9542738/python-find-in-list . When we define
__eq__ we can use in operator or list.index() method for searching.
Digression: set search is much faster than list search, because set uses hashing.
But set does not allow duplicate values.
mappinimas → same as simple variables
filtravimas → same as simple variables
Atminkite, jog priklausomai nuo duomenų reprezentavimo būdo, naudojant OOP, galime
operacijas daryti pagal įvarias objekto savybes:
kurios yra paprasti skaliarianiai kintamieji (int, str ir pan),
kurios yra kolekcijos: listai, dictai (listo suma ar vidurkis - darbuotojo
atlyginimas, padarytų užduočių kiekis, studento pažymiai ir t.t.)
kurios yra agregacijos / kompozicijos ryšiu susiję vidiniai objektai.
kurios yra įvairios aukščiau minėtų trijų dalykų kombinacijos
Objektų palyginimas
Python Crash Course
Define __hash__() if you want to use your objects as keys in dict or as values in
set.
Rule: if __eq__() returns True for two objects they should have the same __hash__()
value.
If objects are hashable (so have a __hash__() method defined) they should be
immutable. Why? Because when you add a hashable object into a dictionary it is
placed into a particular position by the hash value and if you change the internal
object properties latter then the hash will change and you wil not be able to
retrieve the key, because the hash will be different due to the change of property
values.
OOP kritika
Python Crash Course
https://www.youtube.com/watch?v=goy4lZfDtCE
https://www.youtube.com/watch?v=QM1iUe6IofM
https://www.youtube.com/watch?v=tD5NrevFtbU
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 1OPUBfBRNFVXl2p8wvq7NVUn3TuSOviMh.pptx ---


Artificial Intelligence
Neural Networks for Tabular Data
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


What is tabular data
01
02
Tabular data types
Neural Networks for Tabular Data
00
Structure of Part 6
Problems with Tabular data
03
04
Process of working with Tabular data
05
06
07
Deep learning advantages
Classical ML for tabular data
08
Current Architectures
Summary
09
Resources
6.1 - Introduction to DL for tabular data
6.2 - Regression (one-hot encoding, dropout) and Classification (multiclass and
binary)
6.3 - Fast.ai and categorical embeddings
Structure of Part 6
Neural Networks for Tabular Data
Tabular data (TD) are the type of data you might see in a spreadsheet, CSV file,
SQL database, a Pandas DataFrame[1].
They are usually arranged in rows (examples, instances) and columns (features,
attributes). Many of the datasets that companies want to extract value from are
this type of dataset — e.g. sensor readings, clickstreams, purchase histories,
customer management databases — rather than just images or just text.

Some call this structured data or relational data. I would argue that we should
call it structured only if it complies to 1NF from relational theory. However this
is not an established definition or something like that.
What is tabular data
Neural Networks for Tabular Data
[1] Even this can be argued against as this tweet shows:
https://twitter.com/math_rachel/status/990375128314736640
Good / bad loans, overpriced / underprised flats, etc.
We have seen many tabular datasets already!
CC Fraud
California and Boston house price prediction
Wildfire area prediction
Adult dataset for 50K salary prediction
What is tabular data
Neural Networks for Tabular Data
Every column can be of a different category of data. Most general categories -
quantitative (numerical) vs. qualitative (categorical) data. See:
https://studyonline.unsw.edu.au/blog/types-of-data . Google’able term: datatypes in
statistics.
Tabular data types
Neural Networks for Tabular Data
In statistics, there are four data measurement scales: nominal, ordinal, interval
and ratio.
Nominal - simply labels w/o quantification, names of things (cat, dog), like hair
color.
Dichotomous (2 or N possibilities)
Non-overlaping (south, north)
Sometimes we might want to split categories: wagon - passenger and cargo, maybe the
model would work better if we split cargo wagons to heavy cargo vs. light cargo.
Ordinal
implies order
no scale to compare - what is the difference between happy and very happy?
does not have a mean, but does have mode (most frequent) or median (middle of
sorted set).
more examples: https://www.graphpad.com/support/...
Interval
order and scale - diff between 50cm and 60cm is the same as 60cm and 70cm degrees
[TODO] no true zero - 0 does not mean absence (20 degrees C is not twice as hot as
10 degrees C, don’t beleive that? Convert to Fahrenheit).
impossible to calculate ratio, but central tendency and dispersion (stddev) can be
calculated
examples: time is an interval measure, duration (time interval) is not, temperature
is interval data. 2x 10:00 is not 20:00, but 2x1h = 2h (durration).
Ratio - numbers
both descriptive and inferential statistics to can be applied
examples: weight, height, income, durration, speed
Tabular data types
Neural Networks for Tabular Data
These data scales can be understood as concentric circles:
N > O > I > R
…. can they?
Tabular data types
Neural Networks for Tabular Data
https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-
ratio/
Tabular data types
Neural Networks for Tabular Data
https://www.mymarketresearchmethods.com/data-types-in-statistics/
Tabular data types
Neural Networks for Tabular Data
Cardinality is the measure of uniqueness in the set.
High cardinality - highly unique values, most values differ.
Low cardinality - highly repeating, most values are the same.

In relational database world we know that indexes should be add on columns with
higher cardinality, ideally unique and normally distributed values to provide
maximal partitioning of data
Sometimes usefull when choosing which encoding to use (one-hot encoding on high
cardinality data might be not optimal)
Tabular data types
Neural Networks for Tabular Data
Note: tabular data can contain text, images (image url), audio and other types of
data or references to some other type of data. To deal with this we have 2 options:
eliminate that data if we don’t know how to feed it to the model / model it.
multi input model
Example: dermoscopy image together with tabular (meta)data.
Ref: https://towardsdatascience.com/integrating-image-and-tabular-data-for-deep-
learning-9281397c7318
Tabular data types
Neural Networks for Tabular Data
During the 2010s, deep learning revolutionized computer vision and natural language
processing, but plain old tabular datasets have proved a tougher nut to crack.
In general, deep learning (neural networks stacked in many layers, sometimes
hundreds or thousands of them) is effective because it can learn deep hierarchical
representations of data.
Language and the visual world has structure that can be analysed at hierarchical
levels (words, phrases / edges, corners), and at a higher level (sentences, grammar
/ objects, relationship between objects). Images are “structured” data because they
have local structure (nearby pixel values tend to be highly correlated), so that
for example convolution operations can model them well - this is not the case with
tabular data - one row can be completely different than the other and a group of
records can be very diverse.
Before deep learning started to be effective in the 2010s, language processing and
image analysis relied on creating hand-crafting features that reflected certain
properties of the data, but today, models like BERT (for language) and DenseNets
(for image analysis) are able to learn very informative representations of the
data, removing the need for feature engineering.
In addition, images and language data has local structure which lends itself well
to certain types of operations, such as convolutions, which are implemented in all
standard neural network libraries.
For tabular data, there is generally speaking no local or hierarchical structure
(although there could be, in specific cases). For this reason, many people think
that deep learning is irrelevant to tabular data. Instead, past experience seems to
indicate that versions of decision tree ensembles (random forests, gradient
boosting etc.) are the most reliable methods for tabular data.
But we have a desire not to multiply aproaches, but to reduce them - deep learning
provided the hope of reduction for the model zoo, however if we can’t integrate
tabular data then there is no reduction.
Problems with Tabular Data
Neural Networks for Tabular Data
** Note No SOTA architecture means that AutoML for DL Tabular is more complicated.

Problems with Tabular Data


Neural Networks for Tabular Data
Researchers have proposed a few potential benefits of using DL for tabular data:
It might turn out to work better, especially for very large datasets. (The very
interesting Applying Deep Learning to AirBnB Search paper mentions that AirBnB uses
gradient boosting for small to medium sized problems and DL for large ones.)
Deep learning unlocks the possibility to train systems end-to-end with gradient
descent, so that image and text data can be plugged in without changing the whole
pipeline. Uniformity - have one single model with multiple layers learning from
different input types and connecting them together to make a prediction (e2e
training). Image + metadata = DL can be a good choice.
It is easier to use deep learning models in an online mode (with data points
arriving one by one, “streaming” rather than all at once), because most tree-based
algorithms need global access to data to determine split points (gradual learning
w/o full retraining). Gradual learning easy with DL.
DNN can learn the embeddings of categorical features: entity embeddings. We will
talk about them extensively latter on. One of the most powerful reasons to use NN
for tabular data. Could we train a classical ML model on embedding layer of a NN?
Also, we often can create a more powerful ensemble model by including a DNN. So
using DNNs for tabular data gives us more freedom for ensemble learning.
However, I can see at least one drawback:
Deep learning models are often complex and rely on extensive hyperparameter
optimization, which is much less of a problem in random forests and gradient
boosting — they often perform quite well without much parameter tuning!
Problems with Tabular Data
Neural Networks for Tabular Data
Ref: https://www.youtube.com/watch?v=WPQOkoXhdBQ
Process of working with Tabular data
Neural Networks for Tabular Data
Missing data?
Just ignore those data points with missing values (deletion / drop).
Nominal / ordinal - just make up a labell “missing” (increase cardinality). Predict
this value based on other collumns from other data with ML/DL model (“predictive
imputation”), most frequent value.
Numeric (interval or ratio) - add median and then add another collumn whether it
was missing or not (imputed). This will increase the size, but we can represent it
in DL with compressed sparse matrices.
Timeseries - interpolation - average between two (or many) surrounding values
Ref: https://medium.com/@danberdov/dealing-with-missing-data-8b71cd819501
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/

How to deal with categorical data?


Many ML/DL algorithms are unable to operate on categorical or label data directly.
Hence, they require all input variables and output variables to be numeric. This
means that categorical data must be converted to a numerical form. NNs want to
multiply matrices, they need numbers!
Encoding the values!
Preparation
Neural Networks for Tabular Data
Encoding data
Dealing with nominal / ordinal data (“labels”) → Multiple options!
Dealing with text? → Multiple options!
Dealing with numerics? → Normalization /
standartization (no encoding)!
Dates? → Multiple
options!
Let’s take a look at how to deal with each particular type in turn one by one.
Preparation
Neural Networks for Tabular Data
Nominal / oridinal data needs to be encoded!
Types of encodings: Link1 ; Link2
Encoding based on frequency (frequency encoding)
Encoding ordinaly (integer / label encoding)
Learning feature embeddings (... only obtained during the training process, so
still need to use the other encoding techniques).
Encoding based on correlation with the target (w.o.e)
Mean / target encoding
Ordered Integer Encoding
One-hot encoding (this one is potentially most popular)
Preparation
Neural Networks for Tabular Data
Frequency encoding
Works by counting how many items of each category are there and then replaces a
collumn with the frequency in each possition of that category in the original
collumn
Disadvantage: lossing information if the same frequency between multiple categories
(imagine if blue and red values had the same number of items).

Encoding ordinaly (ordinal / integer / label encoding)


Simply assigning a number to a category and replacing the categorical collumn with
the number
Disadvantage: imposing order artificially if the underlying data is only nominal
Preparation
Neural Networks for Tabular Data
WoE - weight of evidence
Tells the predictive power of an independent variable in relation to the dependent
variable. Since it evolved from credit scoring world, it is generally described as
a measure of the separation of good and bad customers.
Link1 and Link2
https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-
encoder-a0f1733a4c69
Preparation
Neural Networks for Tabular Data
Mean / Target encoding
Disadvantage: as with frequency encoding

Ordered integer encoding

One-hot encoding
Performs well in many situations
Typically used, popular
Disadvantage: with high cardinality data it produces a lot of columns that slows
down the learning significantly. Also useless features multiply potentially harming
precision of inference
Preparation
Neural Networks for Tabular Data
Demo: one hot and ordinal encoding, adult dataset as a example of a diverse
dataset.
Preparation
Neural Networks for Tabular Data
Entity embeddings: are vectors in euclidian space that are close in that space when
the underlying feature represented by that vector is “close” to the other feature.
They are learned during model training - adjusted during backpropagation.
Two big advantages: they produce more precise models and even high cardinality
columns can be represented with smaller feature vector (less memory than one-hot
encoding especially relevant for high cardinality collumns).
Applications: word embeddings in NLP, collaborative filtering, encoding categorical
features.
Short video overview: https://www.youtube.com/watch?v=186HUTBQnpY
Paper: https://arxiv.org/pdf/1604.06737.pdf
Article: https://medium.com/@apiltamang/learning-entity-embeddings-in-one-breath-
b35da807b596
We are going to talk more about it when we reach the last lecture of part 6 and NLP
part of the course.
Preparation
Neural Networks for Tabular Data
Embedings came from NLP, but can now are use for categorical data.
Preparation
Neural Networks for Tabular Data
Summary for encoding categorical values:
Preparation
Introduction to Deep Learning
We talked about how to encode different kinds of qualitative data (nominal and
ordinal).
How about if the data contains text - one or more sentences, maybe even more?
Stacking DL models - NLP for tabular data using one model for only the text column
and others for numeric. But even when doing that the problem does not disappear
completelly we still need to feed “text” to a neural network.
Encoding w/ Universal Sentence Encoder
The Universal Sentence Encoder encodes text into high dimensional vectors that can
be used for text classification, semantic similarity, clustering and other natural
language tasks. The model is trained and optimized for greater-than-word length
text, such as sentences, phrases or short paragraphs.
https://tfhub.dev/google/universal-sentence-encoder/1
https://www.tensorflow.org/hub/tutorials/
semantic_similarity_with_tf_hub_universal_encoder see this
https://amitness.com/2020/02/tensorflow-hub-for-transfer-learning/

Featurize with pretrained NLP model (BERT) - take the embeddings from the
pretrained model, see: https://towardsdatascience.com/nlp-extract-contextualized-
word-embeddings-from-bert-keras-tf-67ef29f60a7b
Perform term frequency, inverse document frequency (TFIDF) - which is a numerical
expression of words importance - sometimes recommended.
Stacking w/ specialised text models (create specialised model used in prep).
A simple Count vectorization approach (scikit has that available)
We will talk about preprocessing of text data in the future part on NLP. For now,
just remember that dealing with text might not be as simple as it appears!
Preparation
Introduction to Deep Learning
For numerical data everything is as usual - normalizing / standardizing the data
makes gradients more evently distributed, helping convergence and reliability
accross datasets so we usually do that.
Should we normalize encoded categorical (nominal and ordinal data)? The only case
where is could potentially matter is when you use ordinal encoding for many
“labels” - (1 … 200). Then it would mean that your standardized features and the
encoded feature is different by 2 orders of magnitude. Keep that in mind. But in
general you can standardize normalize.
Preparation
Introduction to Deep Learning
Dates - this is not only mostly transformed feature, but usually there is a lot of
feature engineering that can be done.
See: https://stackoverflow.com/questions/46428870/how-to-handle-date-variable-in-
machine-learning-data-pre-processing

Let’s think about unix timestamps and split dates 2001.06.16 → | 2001 | 6 | 16 |
vs. 80866730. When using timestamps we are loosing information about the month
(unless you add it back), the cyclical nature of time is not longer inferable.
Feature engineering tips:
sometimes adding day of week is helpful for the model.
mark holidays/non-working days
When in doubt you can always just try creating a model without a particular collum
Preparation
Introduction to Deep Learning
Our NNs now will have multiple layers and many neurons. They can “remember” stuff,
so it it important to validate the models we train appropriatelly.
Not only choosing the hyperparameters (activation function, layer count, etc), but
… also shuffling, k-fold cross-validation and holdout are imperative.
Do you remember what (k-fold) cross-validation is?
Train on k-1 folds, hold one as validation, average the scores.
https://www.youtube.com/watch?v=fSytzGwwBVw
https://www.datarobot.com/wiki/training-validation-holdout/
Helps us prevent hitting a biased trainiing sample
Preparation
Introduction to Deep Learning
Start with low-capacity network - start small to get a baseline. Few hidden
neurons.
Choose output activation function for classification:
Design
Introduction to Deep Learning
For regression - multiple linear uniform or gaussian / normal - rmse, poisson
regression - poisson, if the output is “zero-inflated” use tweedie (could we
improve the forestfire area predictions even with 0’s kept?)).
Sorting + binning helps see target distribution, there are tests to see what the
distribution actually is (Shapiro–Wilk test, Kolmogorov–Smirnov test and others).
More: http://proceedings.mlr.press/v80/imani18a/imani18a.pdf
Scikit Tweedie regressor and keras tweedie loss:
https://datascience.stackexchange.com/...
Design
Introduction to Deep Learning
And the fact that we should adapt the loss functions using the target distribution
is very interesting.
We compared classification models not only using auROC or f1 score, but also
confusion matrix, to inform us which class of data we are being the most incorrect
on. This informs our analysis of the residuals and can tell us how to tune out
network.
We can do similar things for regression models by tracking how the distribution of
targets is learned (rather than simply tracking RMSE or r2). We can have similar
RMSE for our prognoses for different models, but the errors might show weakness in
learning different parts of the distribution. If, for example, our model makes most
of the mistakes predicting the lowest prices of flats we can say that this is a
systematic (rather than random error).
Making mistakes in systemic way (on high/low flat prices) inform use that we might
need more samples of low priced flats, we might want to take a look at feature
engineering that might help predict low prices or even have ensemble model
dedicated for predicting low prices houses.
TODO :: implement tracking of distribution learning.
Design
Introduction to Deep Learning
What about hidden activation functions? So many to choose from!
eLU - how a negative value should be squashed is determined by hyperparameter
alpha. This means that you can determine it in the tunning process
Design
Introduction to Deep Learning
Dropout: a percentage of neurons are turned off, but only durring training.
This technique prevent overfitting. If you don’t see it, don’t use it.

Additionally dropout mitigates neuron death, like the image on the side exaplains
(we could confirm that by training ReLU NN models).
Analogy with sommelier.
Design
Introduction to Deep Learning
Ideally, before training a model, we would establish a benchmark using well known
techniques / models (multiple ones even). It would be used as a comparison.
Each model should be optimized (specific preprocesing, feature engineering) for a
better benchmark. Doing automated HP tunning is also an option.
This is ussually not done, but if you wanted to have the absolute best model, like
in competitions you either way would use multiple models.
Training
Introduction to Deep Learning
** The table on the side does not indicate that models will always perform in this
order. Linear models do not always perform better than KNN or XGB over other
Ensemple models. This is just an example of a table you might track you
benchmarking results for model selection.
Tunning
Various options exist:
Random search
Grid search
Bayesian optimization
Genetic algorithms
AutoML
Random search
RandomizedSearchCV from sklearn.
Grid search
GridSearchCV from sklearn.
Training
Introduction to Deep Learning
Batch size:
Too small - underfitting and slow
To large - more ram, never finds global minima
… but that is old news.
1% rule - a new thing! (or find a better one with grid search).

Weight decay (L2 regularization, will learn latter):


Training
Introduction to Deep Learning
Complex surface of the loss function due to high dimensionality is often
encountered in complex tabular datasets.
Having only linear regression with one variable our loss depends on W and b.
A large NN having multiple parameters - as is usual for tabular data - will have
way more complex loss surface.
Training
Introduction to Deep Learning
How to find optimal learning rate - same as for all other data, tabular is no
exception
Get the basis - LR range test (LRRT) - a method for discovering the largest
learning rate values that can be used to train a model without divergence.
One-cycle policy (integrated in fast.ai, custom in keras)
Binary search for learning rate is also used.
Training
Introduction to Deep Learning
How many epochs to fit our OCP to? We can use binary search for that.
What are the risks when choosing epoch count: too many - overfit, too few -
underfit.
Training
Introduction to Deep Learning
Targeting aspects of the distribution durring training. Do we capture the
distribution of our target values with our preditions? This is more than just an
RMSE score!
Loss tracking and generalization error (helps predict performance on real world
data).
Assesment
Introduction to Deep Learning
R2 score for regression: R-squared explains to what extent the variance of one
variable explains the variance of the second variable. So, if the R2 of a model is
0.50, then approximately half of the observed variation can be explained by the
model's inputs.
Assesment
Introduction to Deep Learning
Classification assesment with confusion matrix.
Additional techniques - investigate which classes are confused for the others (by
eliminating certain classes or just inspecting the data - would a human be also
confused?)
Assesment
Introduction to Deep Learning
Compare with other models
If we are underperforming as compared with linear models - something is very wrong.
If we are in the middle - we can consider tunning the model.
Performing worse that a tree model can mean discontinuities (multimodal
distributions) (which we could overcome by stacking w/ a tree based model, manual
boundaries when discontinuities are observed w/ separate model on each split,
ensembling specialized models).
Comparison in our example: Link
Assesment
Introduction to Deep Learning
Width (what happen when we increase it)
Depth (what happen when we increase it)
Very small vs. very large NNs
Remember: huge NNs have chaotic loss surfaces, hence we need small LRs and
continously test the learning rate with LR range test. If we change the size of our
NN we need to revisit LR, BS, EPOCH_COUNT
Skip connections to learn linear predictors.
Caution on batch norm (what is batch norm?)
Other tunning options
Introduction to Deep Learning
Thus far we only normalized the inputs.
Batch norm is a technique of normalizing the outputs of intermediate layers in a NN
so that the statistical distribution of inputs to each layer would not change
between iterations or epochs, see: https://www.youtube.com/watch?v=tNIpEZLv_eg and
https://www.youtube.com/watch?v=DtEq44FTPM4
Why? 3 reasons are generaly given:
Speeds up training. How - similarly to input normalization by making the loss
surface more uniform / symetrical. https://arxiv.org/abs/1805.11604 However nobody
has proven that, so nobody knows the answer (the original paper indicated reduced
internal covariance shift).
Decreases the importance of initial weights (suboptimal starts)
Regularizes the model a bit (introduces randomness in the network)
See explanation on why it works: https://www.youtube.com/watch?v=nUUqwaxLnWs
Question / research: are all layers with different activation functions affected by
batch norm the same way? ReLU [0, +inf) vs. sigmoid (0, 1)? Or is the effect
different?
Other tunning options
Introduction to Deep Learning
//
Other tunning options
Introduction to Deep Learning
//
Summary
Introduction to Deep Learning
Other models need to be employed - classical ML for tabular data
Ref: https://www.youtube.com/watch?v=TeNgD82RFL4
Apprentice

Journeyman

Master
Resources
Introduction to Deep Learning
//
Resources
Introduction to Deep Learning
DANets: Deep Abstract Networks for Tabular Data Classification and Regression :
https://arxiv.org/abs/2112.02962
Applying Deep Learning To Airbnb Search https://arxiv.org/abs/1810.09591
Resources
Introduction to Deep Learning
New (2020.12 initial commit) pytorch-based framework for tabular data:
https://github.com/manujosephv/pytorch_tabular
Resources
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information
--- Content from 1XjEQk4ZRhth3PDtM8W5LRvIrHuvddOT2.pptx ---
Artificial Intelligence
Recommender Systems
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Types of recommender systems
01
02
Popularity based
Recommender Systems
00
Definition and usecases
Content based
03
04
Collaborative filtering
Deep learning models
07
08
Pitfalls of recommenders
06
Hybrid approach
09
Advanced models
10
Impact of recommender systems
05
Knowledge based recommenders
11
Summary, Further explorations
In recent years people expanded the definition of a recommendation system to
include much more than “recommend products to users”:
google maps, that recommends certain routes
gmail, that recommends how to reply to an email
marketing application that recommends users that you should send the campaign email
to because they were receptive before (campaign targeting)
We are not going to use this definition, these are different machine learning
tasks. We are sticking to the classical one (see side image) - product
recommendations (movies, music) based on predicted likability (ussually there no
likeability to route selection).
We note that recommendation systems have a bad reputation - they are created up
maximize the profit for the company at the expense of the customer (by trying to
“hook” him for selling or spending time on the platform). However, we need to
understand that the system can be used for user comfort and personalization (like
twitter or facebook feed). They bring comfort to the user also (for example save
time for searching items to buy, movies to watch, music). Filtering /
differentiating!
Facebook / instagram are also recommending content and wrapping users into a
“informational bubble/echochamber (intentional)” - a common criticism.
Definition and usecases
Recommender Systems
Not all recommender systems are the same, there are even types of them, namely:
Popularity based (very general) - recommend the most popular items
Knowledge based - recommend items based on users answers to some questions (in a
tutorial website)
Content based (general) - recommend items similar to those that user bought / liked
/ wishlisted / watched till end by item metadata
Collaborative filtering based (specific to user) - recommend items based on other
users that similar to you, preferences
Hybrid approaches - mix and match from all the above, depending on the usecase or
inability to apply some specific method
“Advanced” models, like: CARS
We will talk about user preference identification latter, but most of the systems
require explicit user feedback. And you should now know why so many system are so
annoyingly intrusive when asking for feedback.
Let’s define the recommender types more precisely.
Types of recommender systems
Recommender Systems
These simply recommend the most popular items to users. Popularity-based systems
are simplest of all and have minimal computational requirements. Metrics: most
viewed, most bought, “most buzz in social media”, most commented.
Advantages: simplest to implement, simple to scale, simple to understand, easy to
cache.
Disdvantages: do not make personalized recommendations based on specific user’s
likes & behaviors, they tend to be less accurate than content-based or
collaborative filtering based systems. Sudden introduction of a new product could
potentially hurt your brand if the product is not good / lack quality. Technical
problem - new items (cold start/novelty problem for items (as opposed to cold start
for users)) might never get recommended (some people solve it by including a
section “rising stars” functionality (popularity based on a short time window) or
similar). You need to adjust the stock in case recommendation is popular.
In short: just recommend movies that are popular and/or received the best average
review.
Popularity based
Recommender Systems
Here we have a single user, that has interacted with the system - watched some
movies, read books, listened to some music and evaluated those (like / dislike).
From this we know what type of product the user likes.
We also know the properties of these products we offer. So we can recommend items
from the same categories, type or having similar properties.
These types of recommendation systems make the assumption that something intrinsic
in the consumable itself is atractive, so other items that have the same properties
will also be attractive. No ML needed, just appropriate tagging / feature
engineering. Also you need to incentivize users to evaluate the products more (or
judge by latent feedback). The disadvantage is that it is hard to recommend
complementary items or novel items the user has not tried yet. Assumption: feature
engineering captures the features by which we can judge what the user likes (is it
really the movie director or is it the main actress? Maybe the length?)
In short: if the user likes/has watched Tarantino movies, he probably will
like/will watch most of them.

Additionally, it is not even necessary for the user to interact with the system -
we might know what kind of movies are liked from the geographical region where the
users IP is from, that would also be a content based recommender (market
segmentation based recommendation). Also, we will discuss latent
likeeability/feedback factors latter.
Content based
Recommender Systems
With collaborative filtering we have many users in the system and we essentially
extrapolate like this:
if the user has similar tastes in movies (he likes the same movies and dislikes the
same movies) as another user
we will recommend other unseen movies that the other users with similar tastes
liked.
The matrix of users and products - interactions matrix - can be very large and very
sparse (most users have not evaluated the movie). This matrix can be approximated
by two other matrices: user factor matrix and item factor matrix. Recommendation is
just multiplication then using the latent factor matrices.
This type of recommender does not require market segmentation (no user profile of
inherent properties - age, income, etc.) of users or metadata about the items.
Collaborative filtering
Recommender Systems
Compared to content-based filtering: item-similarity vs. user-similarity
Collaborative filtering
Recommender Systems
Knowledge based recommenders can be thought of as partially automated
recommendations.
They explicitly ask the user about his preferences.
These systems are often used when building a user-product matrix or item similarity
matrix is too diffiicult. Difficulties arrise when the sistems is rarelly used -
for example how would you recommend a house for a person to buy when he buys one or
two in his lifetime? Learning w/o iteration is close to impossible. This is
ussually done via a survey / wizzard or just leave possibility to filter content on
the site.
Questioning users can be done post-fact, or prior (image a sales person approaching
you and asking what are you looking for in a house that you want to buy).
Many online teaching / publishing platforms have use this approach.
Knowledge based recommenders
Recommender Systems
A hybrid approach can be constructed in two major ways:
creating a system that will use either one of the 3 approaches depending on a
situation: not many users (impossible to use collaborative filtering) but a new
user has rated several movies - content based recommender. No info at all - add a
2 question survey at the begging of the registration or recommend by country / IP /
gender if they are known or most popular.
“synchronous” hybrid: use all 3 when possible and combine the results. Research
suggests that this approach might be the best. This is similar to an ensemble
model.
A hybrid approach is (or at least was) used by the youtube recommendation system
Hybrid approach
Recommender Systems
There are different algorithms to implement either content or collaboration based
recommenders, from computing vectors and their similarities algebraically, to ML
approaches (KNN) and of course deep learning approaches.
For DL approaches we simply use a neural net to predict how likelly the user is to
evaluate the item favorably - user rating predicted vs. user rating actuall would
be the metric.
The advantages of DL model would include them being more precise.
The disadvantages: it has to be trained, it might not be explainable, can be slower
than some other approaches.
The distribution of total evaluations should be taken into account (tweedie), so
the classification problem would of the asymmetric type.
This kind of recommender might be considered closer to content based filtering,
although it is content + knowledge.
Deep learning models
Recommender Systems
Skewness and sparsity
The matrix of users and products offered is sparse and skewed (few interactions in
the user-item space). There can be hundreds of users and thousands of items and
most of the matrix is values are zero - so costly computations and memory wasted.
This is because:
most customers interact with 1%-2% of your products.
most products are reviewed by few people who just like reviewing / buying products
and are not necessarily representative of the population of buyers or any specific
buyer.
This contributes to sparsity. Skeweness is due to:
few users are very prolific and most are very quiet
some products are very popular and others are not (that fact that the user likes an
item that all / most people like does not tell use much about his preferences).
You can't just naivelly take all reviews for all products.
Pitfalls of recommenders
Recommender Systems
Cold-start (ramp-up)
When a new item or a new user is added to the system, the system does not have any
possibility to judge the future behavior of the user or the product based on
information from the past.
We might also consider the problem when system is completely fresh (new system),
but this is not what is usually meant by cold-start problem in the literature.

Pitfalls of recommenders
Recommender Systems
Popularity based - cold start for new item, but not for new user.
Content based - cold-start problem for new user (because we know the item
properties)
Collaborative filtering - both.
Lack of explicit feedback - inferred preference!
We must sometimes relay on implicit features / implicit user behaviors when
evaluating their preferences. It is common for less than 1% of people to evaluate
an observed video via a thumbs up.
Pitfalls of recommenders
Recommender Systems
Lack of explicit feedback - inferred preference (cont.)!
How to overcome this problem - implicit data, examples:
Assume that if the user watched the whole video - he likes it
Assume that if the user listened to a song for 2 times in a row (in 30min) - he
likes it.
If the meal was finished, then assume the use likes it.

“Likes it” does not mean that positive emotions were produced - just that the user
wants more of that or similar or complementary thing. Negative news if the user is
angry about them we still can use the term “likes it”.

Even though we can overcome lack of explicit data with implicit data, explicit
should still be the ground truth from the ethical standpoint (and when possible,
Deep Learning based approaches should use explicit evaluations for their loss
calcualtion). Think: should you recommend a video to a user if it gives him strong
negative response but the user is addicted to that negative response.
Another thing: serving the user or educating the user - should the system inject
“counter recommendations”. How can a company/system choose - what can it optimise
in this situation: time spent on the platform, it might be higher if the user is
exposed to content he/she/it does not “like”.
Pitfalls of recommenders
Recommender Systems
Context aware recommenders systems (CARS) take into account the context of the
experience and the recommendation. This is especially true when the recommener uses
user ratings / evaluations.
There are also 3 ways of injecting context into a recommender in CARS:
contextual pre-filtering (next slides),
post-filtering,
modeling - deep learning models can pay attention to some context data.
see: https://link.springer.com/chapter/10.1007/978-1-4899-7637-6_6 (author
Lithuanian, university of minesota: https://scholar.google.com/citations?
user=oWCSRZ0AAAAJ&hl=en workshop: https://www.youtube.com/watch?v=ZrMxfbZhLT8 ).
Advanced models
Recommender Systems
Ref: https://medium.com/@andresespinosapc/the-basics-of-context-aware-
recommendations-5dd7a939049b

This is an active area of research, also see: https://arxiv.org/pdf/2007.15409.pdf


Advanced models
Recommender Systems
Note that engineering recommender systems is not only about the model and the
algorithm to generate the recommendations. It also requires a lot of knowledge and
understanding of the problem domain - recommendations for food are not the same
thing as a recommendation for a movie.
Modern recommendation systems are complex and now you will be able to potentially
infer what is happening with the recommendations that are you seeing - better.
If you want to research recommendations you can do that on your own. Try:
searching on youtube for something you do not watch: playdoh video / lego video. Do
recommendations start flowing after a single search? For how long are you seeing
those recommendations? If you searched 3x for playdoh videos - does the quantity
and duration of recommendations increase?
enumerate all the places where various systems are placing recommendations?
can you find multiple types of the recommendations (recommended for you, frequently
bought together, etc.)?
Modern systems can distinguish Organic vs. Fake/Generate traffic/interactivity with
the content. For example if you asked a bunch of your friends on facebook to like a
particular video w/o watching that video - can it hurt?
Advanced models
Recommender Systems
Analysts estimate that already 35% of what consumers purchase on Amazon and 75% of
what they watch on Netflix come from product recommendations based on
recommendation algorithms. So in Amazon people are searching for items, in Netflix
they are more passively consuming the content. Tiktok could be argued to be a very
advanced recommender system.
Search (user is active) vs. Recommendation (user is passive).
Recommendation systems not only exploit users by tempting them to buy more products
& services customized to their tastes, but also keep them engaged for a longer time
to show them more ads and get more clients. See:
https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-
with-consumers
Impact of Recommendation Systems
Recommender Systems
We discussed the types of recommender systems - need to know them all and be able
to describe them.
Summary
Recommender Systems
** Serendipity - “laimingas atsitiktinumas” item is not searched for but it is
“liked” immensely.
Explore other advanced models for recommendations systems, what is the current
SOTA?
How would you evaluate the effectiveness in a recommendation system - important
question: user time spent using the system, items bought/money spent. Is it
possible for a recommender to increase the wrong metric - more time spent, fewer
items bought?
How would you bootstrap a recommender system from the start of the business (it is
often that recommenders are implemented only when the business is already grown
somewhat, but starting from scratch)? Random → Popularity → Content → Collab?
How would you do it in different domains: news articles, handbags, power tools,
cars.
Further explorations
Recommender Systems
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Recommender Systems
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1VsPhY46R2nnL3kPX8iw9vzhjgbr6QUlk.pptx ---


Artificial Intelligence
Introduction to Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Applications of neural networks by type
01
02
Applications of neural networks by datatype
Introduction to Deep Learning
00
Types of neural networks
NNs and current AI revolution
03
04
Deployment options
05
06
07
Deep Learning Platforms
Limitations and problems of NNs
08
Who’s who in AI / ML / DL?
Important papers in DL
09
Glossary, recap, practical project 5
Now that we are familiar with stuctures of deep nns and processes by which they
interact - the interals. We can go one step above and see how NNs can be arranged
into different types of networks.
We’ll talk about types of NNs and what problems does a certain type solve best -
start from the simples and move to more complex architectures. List:
Perceptron
Fully Connected, Feed Forward NN (FCFFNN or FCFFANN), MLP
Sparse NNs
Convolutional NN (CNN)
Deconvolutional NN
Recurent NN (RNN)
Transformers (RNNs in enc-dec architecture + attention)
Generative Adverserial NN
Deep Reinforcement Learning NN
There are other types or subtypes, more advanced and specific:
Residual neural networks
Modular Neural Networks
Kohonen Self Organizing Neural Networks
Etc.
Types of neural networks
Introduction to Deep Learning
Perceptron: simplest and oldest model of Neuron, as we know it.
Takes some inputs, sums them up, applies activation function and pasases them to
output layer.
It can solve a linear regression problem if it has a linear or no activation
function and binary classification problem with linear separability if it has a BSF
or sigmoid activation function.
Perceptron is usually used to classify the data into two parts. Therefore, it is
also known as a Linear Binary Classifier.
https://www.youtube.com/watch?v=s8pDf2Pt9sc - explains the limitations of SLP
https://datascience.stackexchange.com/questions/18286/limitations-of-perceptron
Types of neural networks
Introduction to Deep Learning
A multilayer perceptron (MLP) is a class of feedforward artificial neural network
(ANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN,
sometimes strictly to refer to networks composed of multiple layers of perceptrons
(with threshold activation); Multilayer perceptrons are sometimes colloquially
referred to as "vanilla" neural networks, especially when they have a single hidden
layer.
An MLP consists of at least three layers of nodes: an input layer, a hidden layer
and an output layer. Except for the input nodes, each node is a neuron that uses a
nonlinear activation function. MLP utilizes a supervised learning technique called
backpropagation. Its multiple layers and non-linear activation distinguish MLP from
a linear perceptron. It can distinguish data that is not linearly separable.
Types of neural networks
Introduction to Deep Learning
Fully connected feed forward neural network
Fully connected - neurons at layer L1 are connected to all neurons in next layer
L2.
Feed Forward - no cycles in the layers, information flows one way.
Having multiple choises of layering, input, outputs, weights and activation
functions gives us the ability to create NN a large variety of architectures.
A MLP is sometimes classified as a type of FCFFNN, but it is not a Deep NN (DNN) -
>1 hidden layer.
Can approximate any function, so potentially could solve any problem (if many
layers are allowed) that can be modeled as a supervised problem (see: universal
approximation theorem).
Types of neural networks
Introduction to Deep Learning
Width, length and even depth of NN varies. The time complexity and time it takes to
train a big neural network increases in proportion to these 3 dimensions and the
combination of them. If we minimize the connections, we can predict and train
faster, thus we have sparse neural networks.
Sparsity of NN
When a NN is not fully connected it’s called a sparse NN.
An example of sparse NN: ELM - extreme learning machine.
Normal neural networks can be made “sparse-like” during training with a
regularization technique called dropout
One interesting article came out recently about sparse NNs:
https://arxiv.org/abs/2102.01732
Note, CNNs are not a type of sparse NN, illustration below is just to illustrate
depth, width and height
Types of neural networks
Introduction to Deep Learning
Convolutional Neural Network
Designed for specific problems, like image classification and visual data.
Input (2D) → Convolution → Pooling (for downsampling) → FC layer → Output (categ)
Filters in Convolution
Convoluation is just image feature extraction by application of filters (aka
kernels).

Pooling (aka subsampling or downsampling)


Sum, max, avg
2 by 2 max pool, stride of 2
After than a fully connected layer learns
Types of neural networks
Introduction to Deep Learning
Convolutional Neural Network (cont.)
In summary:
Take an input image, which is a two-dimensional matrix, typically with three color
channels.
Next, we use a convolution layer with multiple filters to create a two-dimensional
feature matrix as output for each filter.
Then, we pool the results to produce a down sample feature matrix for each filter
in the convolution layer.
Next, we typically repeat the convolution and pooling steps multiple times using
previous features as input.
Then, we add a few fully connected hidden layers to help classify the image.
Finally, produce our classification prediction in output layer.
Types of neural networks
Introduction to Deep Learning
Deconvolutional neural network
These networks perform the inverse of a convolutional network.
Rather than taking an image and converting it into a prediction value, these
networks take an input value and attempt to produce an image instead.
https://mdfarragher.medium.com/here-are-the-mind-blowing-things-a-deconvolutional-
neural-network-can-do-2fc99e008fe4
Types of neural networks
Introduction to Deep Learning
Recurrent Neural Network (RNN)
All of the networks thus far were feed forward - data flows in one diretion during
the inference phase.
Recurrent NNs have feeback loops in them. They operate efficiently on sequences of
input of varying length (sentences, time series data).
RNN uses knowledge of it’s current state as input for it’s prediction. Like a short
term memory. This property makes them effective when working on data that
accumalates over time.
Example:
Task: predict what the user will type next on the keyboard (autocomplete).
We could do it w/o ML/DL → simply store most common words and suggest them.
How to think about it: the predition depends on the current letter pressed. But
then it needs to be updated after the user presses another letter.
Task: missing genomic sequence / error detection in sequence.
There are two problems with RNNs like vanishing and exploding gradients. These
problems are solved by subtypes of RNNs - Gated & LSTM RNNs (they differ by the
amount of gates than they have). We will learn more about them in future parts of
the course.
Types of neural networks
Introduction to Deep Learning
Transformers
The Transformer Neural Network is a novel architecture that aims to solve sequence-
to-sequence tasks while handling long-range dependencies with ease. It was proposed
in the paper “Attention Is All You Need” 2017. It is the current state-of-the-art
technique in the field of NLP.
It’s classified as a decoder-encoder network.
The main advantage over other techniques (LSTM RNN) is the ability to do parallel
computations (... thus run on a GPU).
Also introduces the concept of “attention”. The attention-mechanism looks at an
input sequence and decides at each step which other parts of the sequence are
important.
Types of neural networks
Introduction to Deep Learning
Generative adversarial neural networks - GANs
Generator - denerates fake data.
Descriminator - tries to detect if data is fake.
Over many cicles each get better and better at their goals, making the other party
also increasingly better.
Example: https://www.youtube.com/watch?v=xkqflKC64IM
Question: where these networks could be
used?#https://machinelearningmastery.com/impressive-applications-of-generative-
adversarial-networks/
Types of neural networks
Introduction to Deep Learning
Reinforcement learning neural networks
Action, reward, penalty, state.

Example: self-driving cars, robots.


This type of NN tries to maximize the reward, it’s inputs is the state of the
world, outputs - actions.
Deep reinforcement learning vs. reinforcement learning distinction also exists.
Types of neural networks
Introduction to Deep Learning
Simulations of neural networks: https://www.youtube.com/watch?v=3JQ3hYko51Y
Types of neural networks
Introduction to Deep Learning
Spiking Neural Networks
Some call 3rd gen. ANNs, more neuromorphic (i.e. “brain like”) than traditional
NNs.
Main objective: make predictions with networks that require less energy.
Problem: requires spiking neuromorphic hardware (FPGA’s) as explained here:
https://youtu.be/PeW-TN3P1hk?si=Cb899A2vyvOvyNMm&t=850
Good intros:
https://www.youtube.com/watch?v=9dYZXQl4ozk
https://arxiv.org/pdf/1804.08150.pdf
https://www.baeldung.com/cs/spiking-neural-networks
Framework: https://www.nengo.ai/keras-spiking/
Types of neural networks
Introduction to Deep Learning
Applications by NN type
FCFFNNs
Regression, classification
Simple image recognition
CNN applications
Applications of neural networks by type
Introduction to Deep Learning
RNN applications

GAN applications

DRL applications
Tabular data (often time DL is overkill here, use it when there are lots of feature
a loads of data)
Classification - good vs. bad loan / spam vs. ham
Regression - house price (numeric continuous)
Clustering - marketing segments
Anomaly detection - spiked negative comments for product.
Success story: predicting patient mortality (tabular data).
Textual data (RNNs)
Predicting article categories (classification).
NLP on written data: authors emotions from comments.
Autocomplete (gmail uses that).
Translation btw/ languages.
Headlines generated based on the content of the article.
Applications of neural networks by datatype
Introduction to Deep Learning
Image data
Object recognition, age regression, emotion detection, gender clasification.
Pixelized image segmentation - convolutional encoder decoder.
Image captioning - CNN + RNN.
Image resolution enhancement.
GANs to fill the missing part of the image. Seeing how people will look when old.
Synthetic celebrieties.
GANs are also used for image generation based on text (caption).
Audio data
CNN + RNN to recognize songs (vehicles / animals): Shazam
CNN + RNN speetch to text transcription
CNN + RNN decoder encoder nets for real time translation
Speech synthesis: dilated causal convolutional neural network. We can synthesize
our own voice as well.
Applications of neural networks by datatype
Introduction to Deep Learning
Video data
Similar things to image, but also prediction
Automatic sign language translation
CNN encoder decoder for video resoration and colorization:
https://www.youtube.com/watch?v=h7GX3wEfxcg https://www.youtube.com/watch?
v=ELmVmJEt4L4
Video generation: deep fakes and deep dreaming.

Future: Cars, robots, medicine, law, smart cities.


Self driving taxis, robots in the warehouse doing repetitive tasks.
Smart buildings, smart cities that adapt to demand for power and resources.
Agriculture: autonomous tractors, trains.
Applications of neural networks by datatype
Introduction to Deep Learning
First Perceptrons were invented in 1957.
We needed to wait ~60 years for the current explosion in AI. Why?
The period between 1970’s - 2000s is called, the AI winter (on an on-off basis).
3 trends brought ML/DL back:
Data abundance: doubles every 2 years.
Powerfull machines: Moores law, datacenters (semiconductors, SSDs, etc.)
ML algorithms: Transformers (GPT-1), GPU using algorithms (thanks gamers!).
Future trends:
NNs and current AI revolution
Introduction to Deep Learning
DL as a Service/Product (DLaaS) - provide training data, trained models, servers.
You provide: new data.
Providers: GCP, AWS, MS Cognitive Services, IBM Watson, ChatGPT (“jipity”, LOL)
Pros (simple, quick, inexpensive to try) and Cons (remote, generic, can become
expensive).
DL as a Platform - models, servers. You provide: training data and new data for
prediction.
Providers: Azure ML, MS Cognitive Services.
Pros (simple, quick, inexpensive to try) and Cons (remote, generic, can be
expensive, need data).
DL as a Framework - you provide model, servers, training and new data.
TensorFlow, Keras, mxnet, CNTK, Theano, (Py)torch, caffe2, fast.ai.
Pros (custom, local, private, even open source) and Cons (labor intensive,
complex).
The level of this course.
DL from scratch. Python, C++, Java. Same cons as before, just even more pronounced.
—--------------------------------------------------
DL in non-server environments
Mobile phones: https://developer.android.com/ndk/guides/neuralnetworks
IoT: https://www.tensorflow.org/lite/examples https://arxiv.org/pdf/1810.01109.pdf
Deep Learning Platforms
Introduction to Deep Learning
Azure
https://azure.microsoft.com/en-us/services/cognitive-services/#api
You can get an account for free, but you must have a non-virtual credit (not debit)
card. This requirement was lifted from 2021. Now you can try azure for free.
Here is a video of how custom vision api works: https://www.youtube.com/watch?
v=uizfGkm3NR0
You will need an azure account.
Deep Learning Platforms
Introduction to Deep Learning
Azure
Deep Learning Platforms
Introduction to Deep Learning
Amazon Web Service
AWS Rekognition Custom Labels: https://aws.amazon.com/rekognition/custom-labels-
features/
Deep Learning Platforms
Introduction to Deep Learning
Amazon
https://www.youtube.com/watch?v=6SWI3DdaRpU
Deep Learning Platforms
Introduction to Deep Learning
We have talked about the ML pipeline before - DL pipeline is pretty much the same.
Let’s create our own deep learning app from the model we have developed already.
We will use the easiest flask microframework for this task, a bit of HTML and CSS.
Don’t worry if you don’t understand everything in this demo that relates to
creating web applications - it is a separate domain of knowledge that is not
related to AI/ML/DL and can be learned latter.
The purpose of the demo is just to show that it is possible.
Code: github./DeepLearningCourse/05_Introduction_to_Deep_Learning/Lecture1/
PerceptronFlaskApp
With frameworks like Keras and Pytorch, operationalization / deployment options,
but essentially it is the same principle, for example:
https://pytorch.org/tutorials/intermediate/flask_rest_api_tutorial.html
Very good idea - create interactive portfolio projects
Deployment options
Introduction to Deep Learning
Training:
Data labeling is an expensive process.
Deep learning networks are more accurate, and improve in accuracy as more neuron
layers are added. Additional layers are useful up to a limit of 9-10, after which
their predictive power starts to decline. Today most neural network models use a
deep network of between 3-10 neuron layers.
Understandability (black box model)
Monopolies on data:
Big companies have monopolies on data or can obtain data much more easily.
Big efforts to democritize AI/ML/DL (open.ai).
Social and political issues:
Jobs, UBI, meaninglessnes.
Morality:
The famous “who will the car put in danger problem” - the trolley problem? How did
comma.ai solve it? https://youtu.be/iwcYp-XT7UI?t=6292
https://github.com/commaai/openpilot/blob/devel/SAFETY.md
Surveilance: clearview.ai
Will human life improve with increasing automation? It has improved thus far (short
term vs. mid term).
https://www.faception.com/ - still active. Classification of people into
terrorists, https://www.crunchbase.com/organization/faception
More problems: https://towardsdatascience.com/the-limitations-of-machine-learning-
a00e0c3040c6
Limitations and problems of NNs
Introduction to Deep Learning
Elon Musk, Lex Fridman, Peter Thiel, Andrew Ng - for general statements, insights.
They do not agree, so don’t assume whatever one thinks is the right anwer. Most of
the time “we just don’t know” what will happen in the future.
Additionally, I like: Goerge Hotz (comma.ai, tiny corp/tingrad), Jeremy Howard
(fast.ai)
For more names in the AI field: see Lex Fridman’s podcats - he interviewed /
interviews people from deep learning.
Google, Jeff Dean
Andrej Karpathy, Tesla (left 2020)
Ian Goodfellow, Invented GANs
Yann LeCun, invented CNNs, Facebooks chief AI scientist
Jerome Pesenti, VP of AI at Facebook
Geoffrey Hinton, father of deep neural networks
Ilya Sutskever, OpenAI
François Chollet - creator of Keras, good author
Aurélien Geron - great author
Jim Keller - on AI hardware
///
Who’s who in AI / ML / DL?
Introduction to Deep Learning
Reading pappers is part of what ML/DL engineers do. Sometimes more, sometime less.
Learning Internal Representations by Error Propagation
Link:
https://web.stanford.edu/class/psych209a/ReadingsByDate/02_06/PDPVolIChapter8.pdf
Peper that first showed how to train a NN, invented backpropagation.
A Fast Learning Algorithm for Deep Belief Nets
Link: https://www.cs.toronto.edu/~hinton/absps/ncfast.pdf
Geoffrey Hintons paper that started the whole DL revolution
The Unreasonable Effectiveness of Data
Link: https://static.googleusercontent.com/media/research.google.com/en//pubs/
archive/35179.pdf
Peter Norvig et all showing that data is more important than algorithms (in some
instances at least).
Deep Learning: A Critical Appraisal
Link: https://arxiv.org/ftp/arxiv/papers/1801/1801.00631.pdf
Gary Marcus on the current and future limits of DL
ImageNet Classification with Deep Convolutional Neural Networks
Link: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-
convolutional-neural-networks.pdf
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, state of art image
classification
More great papers, mainly classical:
https://www.marktechpost.com/2018/04/26/top-deep-learning-papers-2018-edition/
https://www.quora.com/What-are-some-of-the-seminal-papers-on-deep-learning
http://deeplearning.net/reading-list/
https://www.linkedin.com/pulse/key-ai-papers-learning-internal-representations-
error-john-rekesh/
Important papers in DL
Introduction to Deep Learning
Inputs - Source data fed into the neural network, with the goal of making a
decision or prediction about the data. Inputs to a neural network are typically a
set of real values; each value is fed into one of the neurons in the input layer.
Training Set - set of inputs for which the correct outputs are known, used to
train the neural network.
Outputs - Neural networks generate their predictions in the form of a set of real
values or boolean decisions. Each output value is generated by one of the neurons
in the output layer.
Neuron/perceptron - The basic unit of the neural network. Accepts an input and
generates a prediction. Each neuron accepts part of the input and passes it through
the activation function. Common activation functions are sigmoid, TanH and ReLu.
Activation functions help generate output values within an acceptable range, and
their non-linear form is crucial for training the network.
Weight Space / Parameters - Each neuron is given a numeric weight. The weights,
together with the activation function, define each neuron’s output. Neural networks
are trained by fine-tuning weights, to discover the optimal set of weights that
generates the most accurate prediction.
Forward Pass - The forward pass takes the inputs, passes them through the network
and allows each neuron to react to a fraction of the input. Neurons generate their
outputs and pass them on to the next layer, until eventually the network generates
an output.
Error Function - Defines how far the actual output of the current model is from the
correct output. When training the model, the objective is to minimize the error
function and bring output as close as possible to the correct value.
Glossary, recap, practical project 5
Introduction to Deep Learning
Ref: https://missinglink.ai/guides/neural-network-concepts/complete-guide-
artificial-neural-networks/
Backpropagation - In order to discover the optimal weights for the neurons, we
perform a backward pass, moving back from the network’s prediction to the neurons
that generated that prediction. This is called backpropagation. Backpropagation
tracks the derivatives of the activation functions in each successive neuron, to
find weights that brings the loss function to a minimum, which will generate the
best prediction. This is a mathematical process called gradient descent.
Bias and Variance - When training neural networks, like in other machine learning
techniques, we try to balance between bias and variance. Bias measures how well the
model fits the training set—able to correctly predict the known outputs of the
training examples. Variance measures how well the model works with unknown inputs
that were not available during training. Another meaning of bias is a “bias neuron”
which is used in every layer of the neural network. The bias neuron holds the
number 1, and makes it possible to move the activation function up, down, left and
right on the number graph.
Hyperparameters - a setting that affects the structure or operation of the neural
network. In real deep learning projects, tuning hyperparameters is the primary way
to build a network that provides accurate predictions for a certain problem. Common
hyperparameters include the number of hidden layers, the activation function, and
how many times (epochs) training should be repeated.
Epoch: 1 Epoch = 1 Forward pass + 1 Backward pass for ALL training samples.
Batch Size = Number of training samples in 1 Forward/1 Backward pass. (With
increase in Batch size, required memory space increases)
Number of iterations = Number of passes i.e. 1 Pass = 1 Forward pass + 1 Backward
pass (Forward & Backward pass not counted differently) Example : If we have 1000
training samples and Batch size is set to 500, it will take 2 iterations to
complete 1 Epoch.
Iteration - 1x forward propagation
Glossary, recap, practical project 5
Introduction to Deep Learning
Ref: https://missinglink.ai/guides/neural-network-concepts/complete-guide-
artificial-neural-networks/
More: https://developers.google.com/machine-learning/glossary
More: https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html
Our goal in this part was an overview of deep learning
Your task would be this:
Find whichever dataset you want for polynomial multiple regression or non-linearly
separable classification (you can use/generate the same scikit/python dataset or
find real one)
Train & tune a deep neural network using two frameworks (choose from TF/keras,
pytorch, fast.ai)
Describe in a few sentences (not fewer than 5) which framework you liked best and
why.
Provide the results as a collab notebook or github link with notebook in teams for
easy verification.
** Please include decision boundary / regression line plotting (visual model
verification).

** Not necessary, can increase your mark but will not decrease it if not complete.
Glossary, recap, practical project 5
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information
--- Content from 1_x6WI5oluB5ipYjoO4Vcfj0ZYGDV_jGM.pptx ---
Artificial Inteliigence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Algorithm classification, Big O
Datastructure classification
01
02
03
Datastructure definition
Python Crash Course
00
Algorithm definition
04
Examples of algorithms and datastructures
05
Learning algorithms and datastructures
06
Design patterns
07
Design principles
08
Microbenchmarking
Algorithm definition
What is an algorithm?
Steps to take to wash your hair.
To drive to work.
Steps that a robot takes to cut out ventilation holes.
Definition: sequence of steps to solve a problem or complete a task.
In CS/SE algorithms additionally have clear beginning and end, sequence, input and
output.
But what is a program? Also - a sequence of steps to solve a problem… And so is a
function?
So what is the difference?
Algorithms are designed, not written - whereas programs are written.
Algorithms are invented, discussed, proposed in the design phase of SDLC. Programs
- in the implementation phase.
Algorithms are analyzed, programs are tested (a priori analysis and a posteriori
testing).
Algorithms are language, OS, platform independent, program - are written in a
particular language.
Important: distinguish between the algorithms and its implementation (RSA is proven
to be secure, but the implementation might have bugs)
Algorithm is conceptually the size of the function. Algorithms are usually
encapsulated in one of a few functions. Commonly: ~10-100 LOC
Python Crash Course
Algorithm classification, Big O
Algorithms can be represented as:
Text, words and sentences
Pseudocode (case analysis: SWED interview)
Flowcharts
Start, end - oval
Input / Output (memory extenal) - parallelogram
Branch / loop - rhombus
Operacija atmintyje - rectangle
Implementation in code (function, programm)
Python Crash Course
Algorithm classification, Big O
We can classify algorithms in many ways (like balls in a tray - by color, by size
or by size and color):
By problem solved: sort, search, mathematical / numeric (number generation, square
root, fast inv square root, fib), graph, tree algs, etc.
By implementation technique: recursive / iterative (left_pad()), divide and conquer
(binary search, merge sort), dynamic programming (defined as optimization of
recursive algorithms), greedy, bruteforce, backtracking (maze solved, next best
move in chess).
By performance metrics / complexity (time, space): O(n^2), O(n) and so on.
For example reversing a string has at least 10 possible implementations:
https://trycatchblog.com/sinhadroid/top-10-ways-to-reverse-a-string/

If we have many ways to do things, which way do we choose? Can we decide which is
better? Yes.
There are qualities we want to achieve, things we value about the algorithm that
makes it better
Faster is better, less memory used is better, simpler is better, not requiring I/O
is better, less energy consumed is better.
How do we decide which algorithms is “faster”?
If we count the seconds then we will need to perform the analysis on a single
computer!
We do not count seconds , we perform complexity analysis - let’s turn to that next.
Python Crash Course
All the qualities / desirable characteristics:
Complexity:
Time → time function, not time itself (ms, microsecs). Simple statement takes 1
unit of time and we take the sum of them at the end and pay attention to the most
significant terms.
Space → space function, not space itself (MB, KB). Same as time - sum, take the
most significant value.
Stability - is initial order preserved or destroyed. Why is that important? When
sorting objects we might want to preserve the order between equal objects w/
regards to properties that we are not comparing against. Why? So that we might sort
again with the same algorithm, but by different property (id and name).
Qualities we care about in programms not in algorithms (ussually): I/O (net
(latency, bandwidth, packet counts), disk (r/w speed, latency)), power consumed (in
general the faster the programm, the less CPU it uses (but this becomes untrue when
multithreading is involved)), cpu (registers, cache size and misses*), processor
count (can we parallelize), stack size. In practice we might need to choose a
slower algorithm in case we don’t have much memory, or we might choose an algorithm
that performs the least number of operations to preserve memory.
Most important is time complexity usually (when memory is not very constraint,
which is usually the case in web development and data science, unless you do it on
IoT devices). We perform asymptotic complexity analysis to understand the
complexity class of the algorithm.
Asymptotic analysis - works, because for large N we don’t care about constants or
lower importance terms, because they will be completelly dominated by the important
terms.

* For example naive matrix multiplication algorithm and cache-friedly matrix


multiplication algorithm.
Algorithm classification, Big O
Python Crash Course
Algorithm classification, Big O
Digression - backpropagation (Geoffrey Hinton) is an algorithm used to update
weights in a Neural Network!
We can classify it as a Graph algorithm, because a neural network is DAG (directed
acyclic graph).

Other algorithms in deep learning: optimizers (adam, nadam, adagrad, etc.), forward
propagation.
Machine learning: how a kNN algorithms learns is described by an algorithm, how k-
means does clustering - also. And so on and so on.
Python Crash Course
Example:
Algorithm classification, Big O
Python Crash Course
Example: https://www.bigocheatsheet.com/
Algorithm classification, Big O
Python Crash Course
Case analysis addUpToN(n) - linear vs. constant time.
Algorithm classification, Big O
Python Crash Course
Classification:

Ref: https://adrianmejia.com/most-popular-algorithms-time-complexity-every-
programmer-should-know-free-online-tutorial-course/
Heuristics:
Algorithm classification, Big O
Python Crash Course
Student question:
Why not measure in seconds?
Answer:
We can use seconds, but we need good isolation. Measuring in seconds is called
timing the function.
However seconds depend on computer hardware.
They also depend on the load on the computer - the time it takes for the function
to finish can vary between runs. If the variation between runs is bigger than the
variation between algorithms we can’t decide.
Data dependence - we want a measurement that abstracts the dependence on data.
… so we measure the proportion between input size increase and the increase in
steps taken.
Exercise:
Algorithm classification, Big O
Python Crash Course
Answer to the exercise question:

A great video: https://www.youtube.com/watch?v=bwA9i6wjfhw (Vladimir Agafonkin -


Fast by default: algorithmic performance optimization in practice). Vizualization
of O(logn) algorithm - intuitivielly we think that O(logn) algorithms are somewhere
between O(1) and O(n), however:
Algorithm classification, Big O
Python Crash Course
Datastructure definition
A data structure is a collection of data values, the relationships among them, and
the functions or operations that can be applied to the data implemented in a
particular way to perform some operations, like storing, organization, grouping,
deletion more efficiently (faster, more intuitively, in a less power hungry way,
etc).
In short: a way to arrange data in memory (+ appropriate algorithms).
Real life data structures (real world analogies):
- Phone book, english dictionary (alphabetical sorting), book index.
- Geographical map (graph)
- List, table;
We talk about DSs abstractly - abstract data types. In that sense they are
programming language independent (as opposed to implementation of the DS which is
concrete and programming language dependent).

What do we do when we implement datastructures


Create relationships between data nodes and some rules (expressed as an algorithms)
on how the datastructure add, remove, update data.
Often using OOP, but not always (like in C programming language)
Datastructures are not:
Databases, tables in the database
Data modeling examples
Python Crash Course
Why are there so many datastructures (see next slide)?
Specialization: BST/Hashtable is great for search, (D)LL is better than Array when
inserting into middle, beginning. Array is great when we need to get values by
their position (iterate over them).
Some are used constantly (arrays), some in very specific applications (B-tree), DAG
(blockchain, AI)
Datastructure classification
Link: https://en.wikipedia.org/wiki/List_of_data_structures
Python Crash Course
Examples of algorithms and datastructures
Simple algorithms:
sum(arr), min(arr), max(arr), avg(arr), stddev(arr)
swap(a, b)
findFirst(arr, val), findAll(arr, val) (linear search)
reverse(array or string)
more advanced level: addUpTo(n) (formula vs. iterative vs. recursive), binary
search, bubble sort or other simple sorting algorithm.

Sorting algorithms
Slow: bubble sort / insertion sort / selection sort:
bubble sort is not a useful algorithm, so why learn it? It is simple (i), it has 2
simple optimizations that students can sometimes discover themselves which allows
introducing the concept of optimization (ii) and connects the important concepts of
swapping two values and comparison between them that are fundamental to sorting
algorithms (iii) also we can easily use it when discussing the difference between
software engineering and computer science (iv) and there are a few interesting
problems like: implemented a counter to count how many iterations / swaps were
made, how to sort custom objects.
Fast: merge, quicksort, timsort

Binary Search vs. Linear Search


How does it work? Does Python use it internaly?
Examples: https://www.geeksforgeeks.org/python-program-for-binary-search/

More for interview prep: https://towardsdatascience.com/10-algorithms-to-solve-


before-your-python-coding-interview-feb74fb9bc27
Python Crash Course
Examples of algorithms and datastructures
Memoization / dynamic programming - algorithms to enhance recursion
Ref: https://www.youtube.com/watch?v=wI5sjJQotyI
Ref: https://www.youtube.com/watch?v=DnKxKFXB4NQ
Python Crash Course
Examples of algorithms and datastructures
Linked List:
How does it work? Does Python use it internaly?
Ref: https://realpython.com/linked-lists-python/
https://www.youtube.com/watch?v=34ky600VTN0 (Why Linked Lists vs Arrays isn’t a
real choice)

Trees:
BST: https://qvault.io/python/binary-search-tree-in-python/
Invert binary tree (classic problem): https://medium.com/@theodoreyoong/coding-
short-inverting-a-binary-tree-in-python-f178e50e4dac
B-tree (balanced trees): https://www.youtube.com/watch?v=UzHl2VzyZS4

Hashtables (most important from these):


O(1)
Dicts in python
Python uses open addressing to resolve hash collision
Python Crash Course
Examples of algorithms and datastructures
Graphs - represent relationships
Maps, Facebook (bidirectional, directed) vs. Twitter/Substack(?) (unidirectional,
directed) - social graphs. Computer networks, streets.
Directed / undirected, unidirectional / bidirectional, weighted / unweighted,
cyclical / acyclical, looped / loopless.
Simple usage: a simple recommendation engine for friends in facebook / followers in
twitter.
Adjacency list vs. adjacency matrix - most commonly lists are used due to graphs
being generally sparse and matrix representation requiring a lot of memory for
storing the absence of a relationship (zeros). See these two articles for pros and
cons: https://www.programiz.com/dsa/graph-adjacency-matrix
More: https://runestone.academy/runestone/books/published/pythonds/Graphs/
Implementation.html and https://www.geeksforgeeks.org/graph-and-its-
representations/
Python Crash Course
Datastructure definition
An example of two graphs that are less and more connected. We can invent the
measure of connectedness:

We can connected roads (edges) and intersections (vertices) and calculate the
travel path when there is a single unidirectinally connected edge (introducing
asymmetry S→F vs. F→S). Also we can have a weighted graph signifying how long it
takes on a 5min interval average to travel through that edge.
Python Crash Course
Examples of algorithms and datastructures
Graph tasks:
Python Crash Course
Examples of algorithms and datastructures
Graphs
The part of datascience that deals with graphs is called Network Science. It was
created from Graph Theory.
Historically: Euler circuit theorem (konigsberg bridge problem).
NetworkX is a python library for working with graphs.
It can visualize networks
And answer some questions for analysis
See: https://networkx.org/documentation/latest/tutorial.html
However for massive amounts of data we would need Apache Spark, Neo4J
NetworkX does not scale to gigabytes of data.
Apache Spark uses graph frames for working with graphs.
https://spark.apache.org/graphx/
There are more tools and examples:
for visualizing graphs: https://towardsdatascience.com/visualizing-networks-in-
python-d70f4cbeb259
analyzing: https://towardsdatascience.com/visualising-graph-data-with-python-
igraph-b3cc81a495cf
Python Crash Course
Microbenchmarking
Although we care about the scalability of algorithms, we do care about time in
implementation of the algorithms (functions or programs).
This can be done in many ways in python, ref:
https://stackoverflow.com/a/2866456/1964707
Notes:
Do not overcomplicate the code between t0 and t1 - we want to time just the part of
the algorithm that varies, we do not care about the initial allocation for example,
because it is the same for both algorithms.
Be fair - you should time both variations after they achieved the same result. If
the first part creates a string and the second one just a list and you time that,
but latter on convert it to a string, then you are timing it unfairly.
You can also vizualize the workings of the algorithms in https://pythontutor.com/
Python Crash Course
import time

############## 1 ##############
mystr = "Mindaugas"

t0 = time.time()
for i in range(0, 10000):
finalstr = ""
for i in range(len(mystr)-1, -1, -1):
finalstr += mystr[i]
t1 = time.time()

print(t1-t0)
print(finalstr)

############## 2 ##############
mystr = "Mindaugas"

t0 = time.time()
for i in range(0, 10000):
mylst = []
for i in range(len(mystr)-1, -1, -1):
# print(f'{i}:{mystr[i]}')
mylst.append(mystr[i])
final = ''.join(mylst)
t1 = time.time()
print(t1-t0)
print(final)
# print(''.join(mylst)) # unfair advantage

Learning algorithms and datastructures


Why learn algorithms?
Great way to learn a new language.
Wirth’s Law, also: https://tonsky.me/blog/disenchantment/
European style interviews don’t concentrate on them, but BIGTECH’s (FANG) do.
Competitive programming
Better understanding of existing software, especially advanced
FUN!
Python Crash Course
Learning algorithms and datastructures
Programming challenges:
hackerrank: https://www.hackerrank.com/domains/tutorials/30-days-of-code
codewars: https://www.codewars.com/
leetcode: https://leetcode.com/problemset/all/
codechef: https://www.codechef.com/problems/school/
topcoder (real tasks): https://www.topcoder.com/challenges
codeforces: https://codeforces.com/

Video (“video first!”):


Algorithms: https://www.youtube.com/watch?v=0IAPZzGSbME
Data Structures: https://www.youtube.com/watch?v=92S4zgXN17o
HackerRank: https://www.youtube.com/watch?v=shs0KM3wKv8
Python Crash Course
Learning algorithms and datastructures
Books: Choose by customer review and comments on amazon!
Internet:
https://www.quora.com/What-are-some-amazing-computer-science-algorithms
https://cacm.acm.org/
https://github.com/gzc/CLRS

Programming competitions:
Top competitions: https://www.quora.com/Which-are-the-best-coding-competitions
https://www.youtube.com/watch?v=MVLSQB5Durg → How to Ace Top Programming
Competitions
https://www.youtube.com/watch?v=xAeiXy8-9Y8 → How To start
List of competitive programming competitions: https://clist.by/
Guide for beginners: https://github.com/Errichto/youtube/wiki/How-to-practice%3F
Python Crash Course
Design patterns
Since we are DS / ML / DL students it is quite uncommon to learn design patterns.
However a quick mention will not hurt to much.
DPs are standard solutions to object oriented design problems, they mostly make
sense in OO languages (see, for an exception:
https://stackoverflow.com/questions/4112796/are-there-any-design-patterns-in-c +
functional languages, like “functional strategy”).
GoF DPs are the ones to know, but there are way more:
https://en.wikipedia.org/wiki/Software_design_pattern#Classification_and_list
Frameworks and libraries use design patterns internally.
Design pattern among other concepts (in the layers of the abstraction cake):
Architectural patterns - high / application level solutions to problems (MVC,
Microservices): https://martinfowler.com/eaaCatalog/ and
https://learn.microsoft.com/en-us/azure/architecture/patterns/
Design patterns - intermediate level (Singleton).
SOLID design principles
OOP principles - inheritance, polymorphism, composition interfaces / abstract
classes … / data structures
Algorithms (similar to the level of functions) - “you don’t need algorithms”
means: “you will not need to create custom ones” (creation vs. usage). Usage will
be unavoidable.
Idioms / syntax mechanism - low level language specific constructs (foreach(),
class X {}).
Groups of design patterns:
Creational. Recommended: Dependency Injection, Singleton (not really for python),
Factory (method) / (Fluent) Builder. Advanced: Object Pool - when you don’t want to
invoke GC.
Structural. Recommended: Facade (requests), Decorator.
Behavioral. Recommended: Iterator, Strategy (simplified in functional programming
with functional strategy pattern).
Concurrency. Recommended: Thread pool (usefull when you want to call a throttled
API w/o being blocked)
Sources:# https://www.youtube.com/@ArjanCodes (probably the best resources on
design patterns)
https://refactoring.guru/design-patterns/python
https://sourcemaking.com/design_patterns/abstract_factory/python/1
https://python-patterns.guide/gang-of-four/iterator/
Python Crash Course
Design principles
Design principles are also not often thought for DS / ML / DL students.
SOLID: https://towardsdatascience.com/solid-coding-in-python-1281392a6a94
Single-Responsibility Principle (SRP) → this can be applied to functions,
classes even programms (unix design philosophy).
Open-Closed Principle (OCP) → new functionality should not touch old
code (add new functionality by just adding new code)
Liskov Substitution Principle (LSP) → child classes should be
replace’able by parent classes.
Interface Segregation Principle (ISP) → better to have more interfaces,
separate your interfaces: FileManager vs. Reader / Writter
Dependency inversion Principle (DIP) → bubbleSort([List of Sortable]) vs.
bubbleSort([List of Person])
Some like it shorter: keep your classes small and create interfaces!
Recommended video: https://www.youtube.com/watch?v=pTB30aXS77U , code for this:
https://github.com/ArjanCodes/betterpython/tree/main/9%20-%20solid
Python Crash Course
We have types of programming:
Mobile
Web: Fe / Be / FS
Desktop
Dataroles: Data / ML / DL (Quants?)
Ops/Infra: Devops / Secops / Admin / Netops
Game dev
System (OS, Webserver, RDBMS, language/compilers)
Low level / electronics: embedded, drivers
Even more succinctly: Applications, Systems, Data, Ops, Games (keep in mind that
this is an arbitrary differentiation, imprecise model)

Language syntax are important to all


Language idioms are important to all
Algorithms are important for all (to varying degree)
Datastructures are important for all (to varying degree)
Design principles (OOP and not only) (only application and game dev) ← (... if you
are struggling with this, then don’t go to DP’s)
Design patterns are important for many, although mostly for application developers
(there are patterns in other spheres)
Architectural patterns (Cloud, Integration, MVC, MVVM, DDD, ActiveRecord vs. Data
Mapper) mostly involve Web Be context, but could be used elsewhere:
https://martinfowler.com/eaaCatalog/ and
https://learn.microsoft.com/en-us/azure/architecture/patterns/
Summary
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1ytEfyefdv8LhwALSjq0buSzJjVNNUUm_.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Generating Shakespeare with Char RNN
01
02
TimeDistributed wrapper
Natural Language Processing
00
Beyond classification: Text Generation
Statefull RNNs
03
04
Beyond classification: Translation
BLEU score
07
08
Encoder - decoder
06
Google translate
09
Tokenization revisited#Attention mechanisms
10
Transformers
05
Breaf history
11
Further explorations, PP10
Let’s move beyond classification (a special case of which is sentiment analysis) to
text generation and translation.
We also transition from treating text corpora as BOWs to treating it as a sequence
of words with semantic relationships.
BOWs (BO-ngrams) vs. sequences - are the two fundamental ways we look at language
data.
Text generation requires a "language model" (some researchers don’t describe BOW as
a language model) - BOW not enough.
Consider the statement "In 1998, the president of the United States addressed the
nation in his famous speach _". A language model would need to understand
sufficiently well this statement and what we require from it. It should have some
undestanding of US / Presidents / History and some other concepts in order to fill
in the blank.
Check if GPT-3 would generate correct answer.
GPT-4?
ChatGPT?
Beyond classification: Text Generation
Natural Language Processing
In a famous 2015 blog post titled “The Unreasonable Effectiveness of Recurrent
Neural Networks,” Andrej Karpathy showed how to train an RNN to predict the next
character in a sentence. This Char-RNN can then be used to generate novel text, one
character at a time. Here is a small sample of the text generated by a Char-RNN
model after it was trained on all of Shakespeare’s work:

It is important to note, that thus far we only worked with word level models.
Another level of model are character level models. These models and more expensive
to train (slower and require more memory). However they potentially could be the
future as better algorithms and more powerfull hardware is to be developed. (TODO:
sub-word models, byte-level models, super-word level)
As A. Karpathy wrote in the same article: "Currently it seems that word-level
models work better than character-level models, but this is surely a temporary
thing." See: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Why is that?
I could guess that some forms of words can be rare in text (“Mokinys apsimetė
ąžuolu”), but a character level model might be able to learn the form itself from
other examples (“Silkę užkando obuoliu”) and adapt it even if the identical word
did not exist in the entire corpus - morphology.
Additionally punctuation could be learnable via character level model, not “Let's
Eat Grandma”, “skųsti negalima tylėti”.
Generating Shakespeare with Char RNN
Natural Language Processing

Generating Shakespeare with Char RNN


Natural Language Processing
Let use “tinyshakespeare” dataset and create a text generation model.
We will use a char based model, as it is something we have not seen before.
To preprocess text char by char we will use keras tokenizer the following way:

… in modern times Tokenization became a much more complex problem with the rise of
LLMs. As mentioned there are more levels that just word and character. This is
visible by just looking at the new tokenizers Keras provides:
https://keras.io/api/keras_nlp/tokenizers/
Generating Shakespeare with Char RNN
Natural Language Processing
TimeDistributed() is a wrapper type in keras (just like Bidirectional() - also
wrapper).
TimeDistributed(Dense()) applies the same Dense() layer to every timestep in the
input sequence. Every input should be at least 3D, and the dimension of index one
of the first input will be considered to be the temporal dimension [[], [], []] →
[seq1, seq2, seq3] → the output will be generated for each input sequence
independly, so we will have output: [out_4_seq1, out_4_seq2, out_4_seq3]
It is a wrapper used for seq-to-seq problems (can be used for others: one-to-one,
one-to-seq). It is applicable to Dense() layer or Conv1D() - these are most
commonly used, but is applicable to any keras.layers.Layer.
There are two key points to remember when using the TimeDistributed wrapper:
Input must be (at least) 3D. This often means that you will need to configure your
last LSTM layer prior to TimeDistributed wrapped Dense layer to return sequences
(return_sequences=True).
The output will be 3D. This means that if your TimeDistributed wrapped Dense layer
is your output layer and you are predicting a sequence, you will need to resize
your y array into a 3D vector.
If we were to take this example model.add(TimeDistributed(Dense(1))) it would mean
the following:
apply a dense layer for each input sequence.
the single output value in the output layer is key. It highlights that we intend to
output one time step for each time step in the input.
TimeDistributed wrapper
Natural Language Processing
When should you use TimeDistributed?
If your Data is dependent on Time, like Time Series Data or the data comprising
different frames of a Video, then TimeDistributed(Dense()) is said to be more
effective than simple Dense Layer (could we create an MWE proving that?).
TimeDistributed(Dense()) applies the same dense layer to every time step during
GRU/LSTM Cell unrolling. That’s why the error function will be between the
predicted label sequence and the actual label sequence.
Using return_sequences=False, the Dense layer will get applied only once in the
last cell. This is normally the case when RNNs are used for classification
problems. (seq-to-one)
If return_sequences=True, then the Dense layer is used to apply at every timestep
just like TimeDistributedDense.
TimeDistributed might be hard to understand right away and easy to forget if you
are not using it often. See this answer for more information:
https://stackoverflow.com/a/42758532/1964707 also
https://stackoverflow.com/questions/47305618/what-is-the-role-of-timedistributed-
layer-in-keras/47309453#47309453
TimeDistributed wrapper
Natural Language Processing
For a char level model we return the probability of the next caracter for each
sequence, so we use a layer Dense(count_of_all_possibilities) + softmax +
categorical_cross_entropy. As if we were solving a classification problem.
TimeDistributed wrapper
Natural Language Processing
Until now, we have used only stateless RNNs: at each training iteration
(batch_size) the model starts with a hidden state full of zeros, then it updates
this state at each time step, and after the last time step, it throws it away, as
it is not needed anymore. What if we told the RNN to preserve this final state
after processing one training batch and use it as the initial state for the next
training batch? This way the model can learn long-term patterns despite only
backpropagating through short sequences. This is called a stateful RNN.
So for long sequences we know we can use:
LSTMs / GRUs over SimpleRNNs
Statefull RNNs
First, note that a stateful RNN only makes sense if each input sequence in a batch
starts exactly where the corresponding sequence in the previous batch left off. So
the first thing we need to do to build a stateful RNN is to use sequential and
nonoverlapping input sequences (rather than the shuffled and overlapping sequences
we used to train stateless RNNs). When creating the Dataset, we must therefore use
shift=n_steps (instead of shift=1) when calling the window() method. Moreover, we
must obviously not call the shuffle() method.
Statefull RNNs
Natural Language Processing
When training Statefull RNNs, the state needs to be reset after each epoch and this
can be performed with Keras callback API.

Also, after this model is trained, it will only be possible to use it to make
predictions for batches of the same size as were used during training. To avoid
this restriction, create an identical stateless model, and copy the stateful
model’s weights to this model. Does this affect performance of the inference? We
can test it - create the necessary batches and then pass them though the same
network and though a new networnok with weight copied (it should t make a
difference, but it’s good to think about how to prove it / test it).
Statefull RNNs
Natural Language Processing
Sometimes called NMT (neural machine translation). Let’s take a look at a simple
neural machine translation model that will translate English sentences to German.
To understand how difficult natural language translation is think about the naive
approch - map words between two languages 1-1 and when you have a sentence - just
convert each word using that mapping. This totally ignores grammar of the language
for some very simple sentences it would be fine, but it would not fine:
This car is better -> Šis automobilis yra geresnis.
Contrary to popular belief, Lorem Ipsum is not simply random text. -> Priešingai
populiariam tikėjimui ... (semi-good although most people with some proficiency in
Lithuanian would say "Priešingai populiariam įsitikinimui". We have to use
different synonyms that are context-dependent).
So essentially NMT is learning the grammar of the language and even more broud
context-dependencies (in the semantic realm).
And because of the difficulty of the task neither
level 0: naive model nor,
level 1: rules based translation models solve it well. We need
level 2: statistical learning / neural networks.
Beyond classification: Translation
Natural Language Processing
Let's try that same sentence now: "Gardu. Nepigu. Interjero sprendimai pritaikyti
greitam pavalgymui, bet ne ilgesniam pasisėdėjimui su draugais."
Is targetting a specific language better than being a jack of all trades model?
Note: https://www.lrt.lt/naujienos/mokslas-ir-it/11/1085161/lietuviska-masininio-
vertimo-sistema-pralenke-google-microsoft-ir-kitus-technikos-milzinus (question:
does google translate use one or many models for translation). Currently LLMs are
multilingual, although trained on English mostly.
Try it: https://translate.tilde.com/#/ (another company that does something with
NLP in Lithuania: Tildė, TokenMill, Semantika.lt (VDU))
https://www.deepl.com/en/translator
Beyond classification: Translation
Natural Language Processing
Most of us were introduced to machine translation when Google came up with the
service. But the concept has been around since middle of last century.
Research work in Machine Translation (MT) started as early as 1950’s, primarily in
the United States. These early systems relied on huge bilingual dictionaries, hand-
coded rules, and universal principles underlying natural language.
In 1954, IBM held a first ever public demonstration of a machine translation. The
system had a pretty small vocabulary of only 250 words and it could translate only
49 hand-picked Russian sentences to English. The number seems minuscule now but the
system is widely regarded as an important milestone in the progress of machine
translation.
The paper is an interesting read still: http://www.hutchinsweb.me.uk/GU-IBM-
2005.pdf
Soon, two schools of thought emerged:
Empirical trial-and-error approaches, using statistical methods, and
Theoretical approaches involving fundamental linguistic research (rule-based
approaches)
In 1964, the Automatic Language Processing Advisory Committee (ALPAC) was
established by the United States government to evaluate the progress in Machine
Translation. ALPAC did a little prodding around and published a report in November
1966 on the state of MT. Below are the key highlights from that report:
It raised serious questions on the feasibility of machine translation and termed it
hopeless
Funding was discouraged for MT research in the report
It was quite a depressing report for the researchers working in this field
Most of them left the field and started new careers
Not exactly a glowing recommendation!
A long dry period followed this miserable report. Finally, in 1981, a new system
called the METEO System was deployed in Canada for translation of weather forecasts
issued in French into English. It was quite a successful project which stayed in
operation until 2001.
Brief history
Natural Language Processing
The world’s first web translation tool, Babel Fish, was launched by the AltaVista
search engine in 1997.
And then came the breakthrough we are all familiar with now – Google Translate. It
has since changed the way we work (and even learn) with different languages.
Breaf history
Natural Language Processing
In short: launched 2006, used statistical models rather than rule-based models,
2016 started using neural machine translation (SMT -> NMT) which was an offshoot of
Google Brains project (Jeff Dean and Andrew Ng). It was a first demonstration of
zero-shot transfer learning in MT when the engine was able too translate
Korean⇄Japanese having only been trained on Japanese⇄Korean.
https://en.wikipedia.org/wiki/Google_Translate
https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation
https://arxiv.org/abs/1609.08144 (... our model consists of a deep LSTM network
with 8 encoder and 8 decoder layers using attention and residual connections + many
more optimizations or accuracy, training and inference speed)
Let's see a video about how google translate works: https://www.youtube.com/watch?
v=AIpXjFwVdIE
A much longer video about google translate, a few important points at the begging:
https://www.youtube.com/watch?v=nR74lBO5M3s Note, this video talks about language
recognition which is an important ML application. We did not train a model to
reconize a language but it is important for google's translation project since, as
you probably have seen, it tries to guess which language you are using. How would
you phrase this problem - is it a regression, a classification, a summarization, a
translation problem?
The application fails to translate Monthy Pythons "The funniest joke in the world":
"Wenn ist das Nunstück git und Slotermeyer? Ja! Beiherhund das Oder die
Flipperwaldt gersput!" w/ a FATAL ERROR (it did so in 2017 and still does in 2021,
it does not in 2022)
Google translate
Natural Language Processing
The most common metric used in NMT is the BiLingual Evaluation Understudy (BLEU)
score, which compares each translation produced by the model with several good
translations produced by humans: it counts the number of n-grams (sequences of n
words) that appear in any of the target translations and adjusts the score to take
into account the frequency of the produced n-grams in the target translations.
Can we compare to some superhuman translation score? We could, potentially, by
measuring comprehensibility. Translate instructions and see whether people perform
those instructions more correctly when they are translated by many profesional
interpreters and NMT. However this raises issues, like: where does translation end
and enhancement throught rephrasing begin (does the translator add anything). Would
this enhancement count as “beyond translation”?
With the BLEU score a superhuman translation would never be recognized.
More on this: https://www.youtube.com/watch?v=DejHQYAGb7Q
BLEU score
Natural Language Processing

BLEU score
Natural Language Processing

BLEU score
Natural Language Processing
Encoder-decoder is a sequence-to-sequence (some authors use: encoder: sequence-to-
vector + decoder: vector-to-sequence) architecture (takes in sequences for
predict() and outputs sequences). Came into view at 2014:
The classic paper on this is: https://arxiv.org/abs/1409.3215
Video introduction: https://www.youtube.com/watch?v=_i3aqgKVNQI
Sequence-to-sequence learning (Seq2Seq) is about training models to convert
sequences from one domain (e.g. sentences in English) to sequences in another
domain (e.g. the same sentences translated to French).

"the cat sat on the mat" -> [Seq2Seq model] -> "le chat etait assis sur le tapis"

They are used for a variety of NLP tasks, such as:


text summarization,
speech recognition,
DNA sequence modeling (DNA to RNA)
machine translation,
free-from question answering (generating natural language answer given a natural
language question)
image captioning (CNN → RNN)
In general, it is applicable any time you need to generate text.
NOTE: encoder-decoder architecture can be applied to other models like RNN, here is
a discussion of CNN Encoder-decorder: https://www.youtube.com/watch?v=1icvxbAoPWc
Encoder - decoder
Natural Language Processing
Additional examples:
Speech Recognition
Name/Entity/Subject Extraction to identify the main subject from a body of text
Relation Classification to tag relationships between various entities tagged in the
above step
Chatbot skills to have conversational ability and engage with customers
Text Summarization to generate a concise summary of a large amount of text
Question Answering systems
Image Captioning (Encoder CNN, Decoder RNN)
In the most simple terms and encoder-decoder is a neural network architecture
consisting of two connected RNNs (any flavor: LSTM, GRU):
Encoder - decoder
Natural Language Processing
Let’s discuss a specific case: english to french translation: english sentences are
fed to the encoder, and the decoder outputs the French translations.
French translations are also used as inputs to the decoder, but shifted back by one
step. In other words, the decoder is given as input the word that it should have
output at the previous step (regardless of what it actually output), so teacher
forcing.
For the first word, it is given the start-of-sequence <SOS> token. The decoder is
expected to end the sentence with an end-of-sequence <EOS> token.
Note that the English sentences are reversed before they are fed to the encoder.
For example, “I drink milk” is reversed to “milk drink I.” This ensures that the
beginning of the English sentence will be fed last to the encoder, which is useful
because that’s generally the first thing that the decoder needs to translate.
In the general case, input sequences and output sequences have different lengths
(e.g. machine translation) and the entire input sequence is required in order to
start predicting the target. This requires a more advanced setup, which is what
people commonly refer to when mentioning "sequence to sequence models" with no
further context. Here's how it works:
A RNN layer (or stack thereof) acts as "encoder": it processes the input sequence
and returns its own internal state. Note that we discard the outputs of the encoder
RNN, only keeping the state. This state will serve as the "context", or
"conditioning", of the decoder in the next step.
Another RNN layer (or stack thereof) acts as "decoder": it is trained to predict
the next characters of the target sequence, given previous characters of the target
sequence. Specifically, it is trained to turn the target sequences into the same
sequences but offset by one timestep in the future, a training process called
"teacher forcing" in this context. Importantly, the encoder uses as initial state
the state vectors from the encoder, which is how the decoder obtains information
about what it is supposed to generate. Effectively, the decoder learns to generate
targets[t+1...] given targets[...t], conditioned on the input sequence.
Note: this there are many variations of encoder-decorder models, some with
recurrent_dropouts, some with TimeDistributed wrappers, some with Bidirectional
wrappers. There is a huge number of combinations on how to build encoder-decoder
architectures, so don’t put yourself into a box.
Encoder - decoder
Natural Language Processing
N.B.: Because hidden state of the encoder is used as a conditioning mechanism for
decoder output generation, we call this conditional language modeling - give me the
highest probability sequence given x condition.
Encoder - decoder
Natural Language Processing
Pictorial representation of the previous process.
Encoder - decoder
Natural Language Processing
In inference mode, i.e. when we want to decode unknown input sequences, we go
through a slightly different process:
Encode the input sequence into state vectors.
Start with a target sequence of size 1 (just the start-of-sequence <SOS>
character / word).
Feed the state vectors and single char/word target sequence to the decoder to
produce predictions for the next character/word.
Sample the next character/word using these predictions (use argmax for example).
Append the sampled character/word to the target sequence.
Repeat until we generate the end-of-sequence <EOS> character/word or we hit the
character/word limit.
Note language models used in translation can be character or word level models that
is why we used the words interchangibly.
Encoder - decoder
Natural Language Processing
Used mainly in context of encoder-decoder architectures, mainly for translation.
Teacher forcing is applied during the training time.
Teacher forcing is the technique where the target word is passed as the next input
to the decoder.
Ref: https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c
Model TF: https://www.tensorflow.org/text/tutorials/nmt_with_attention#training
Model Pytorch (actually uses randomized teacher forcing!):
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#traini
ng-and-evaluating
Teacher forcing
Natural Language Processing
Suppose you train an Encoder–Decoder model, and use it to translate the French
sentence “Comment vas-tu?” to English. You are hoping that it will output the
proper translation (“How are you?”), but unfortunately it outputs “How will you?”
Looking at the training set, you notice many sentences such as “Comment vas-tu
jouer?” which translates to “How will you play?” So it wasn’t absurd for the model
to output “How will” after seeing “Comment vas.”
Unfortunately, in this case it was a mistake - model tried to complete the sentence
as best it could. By greedily outputting most likely word at every step, it ended
up with suboptimal translation. Can we give model a chance to go back, fix mistakes
it made earlier?
One of the most common solutions is beamsearch: it keeps track of a short list of
the k most promising sentences (say, the top three), and at each decoder step it
tries to extend them by one word, keeping only the k most likely sentences. The
parameter k is called the beam width (“strypo plotis”).
For example, suppose you use the model to translate the sentence “Comment vas-tu?”
using beam search with a beam width of 3. At the first decoder step, the model will
output an estimated probability for each possible word. Suppose the top three words
are “How” (75% estimated probability), “What” (3%), and “You” (1%). That’s our
short list so far. Next, we create three copies of our model and use them to find
the next word for each sentence. Each model will output one estimated probability
per word in the vocabulary. The first model will try to find the next word in the
sentence “How,” and perhaps it will output a probability of 36% for the word
“will,” 32% for the word “are,” 16% for the word “do,” and so on. Note that these
are actually conditional probabilities, given that the sentence starts with “How.”
The second model will try to complete the sentence “What”; it might output a
conditional probability of 50% for the word “are,” and so on. Assuming the
vocabulary has 10,000 words, each model will output 10,000 probabilities.
Good exaplanation: https://www.baeldung.com/cs/beam-search
Beamsearch
Natural Language Processing
Next, we compute the probabilities of each of the 30,000 two-word sentences that
these models considered (3 × 10,000). We do this by multiplying the estimated
conditional probability of each word by the estimated probability of the sentence
it completes. For example, the estimated probability of the sentence “How” was 75%,
while the estimated conditional probability of the word “will” (given that the
first word is “How”) was 36%, so the estimated probability of the sentence “How
will” is 75% × 36% = 27%. After computing the probabilities of all 30,000 two word
sentences, we keep only the top 3. Perhaps they all start with the word “How”: “How
will” (27%), “How are” (24%), and “How do” (12%). Right now, the sentence “How
will” is winning, but “How are” has not been eliminated.
Then we repeat the same process: we use three models to predict the next word in
each of these three sentences, and we compute the probabilities of all 30,000
three-word sentences we considered. Perhaps the top three are now “How are you”
(10%), “How do you” (8%), and “How will you” (2%). At the next step we may get “How
do you do” (7%), “How are you ” (6%), and “How are you doing” (3%). Notice that
“How will” was eliminated, and we now have three perfectly reasonable translations.
We boosted our Encoder–Decoder model’s performance without any extra training,
simply by using it more wisely (additional perf. cost for beam search exists
however).
Beamsearch is an advanced topic, so it is only useful to mention it in passing, if
you were to implement it a good starting point would be this tutorial:
https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt and
https://forums.fast.ai/t/text-generation-and-beam-search/37675 . Now keras
implements BeamSearch sampler (it was need for LLMs):
https://keras.io/api/keras_nlp/samplers/beam_sampler/
Additional sources:
1st. https://www.youtube.com/watch?v=Er2ucMxjdHE
2nd. https://www.youtube.com/watch?v=RLWuzLLSIgw
3rd. https://www.youtube.com/watch?v=gb__z7LlN_4
Written: https://www.katnoria.com/nlg-decoders/
Beamsearch
Natural Language Processing
Applicability - find the most likely sequence:
Genomic sequence (???) - is there ever a need to predict the most likely genomic
sequence. DNR → RNR transcription prediction? Protein synthesis maybe? In general
we could apply this anywhere where there is input sequence a predictable output
sequence (predictable with some probability).
Decryption?
Language translation - most likely sentence.
Interesting detail, LLMs also support sampling strategies like BeamSearch and it
seems to be an imperative to implement in new frameworks, like Tinygrad:
https://github.com/tinygrad/tinygrad/issues/3921 . More sampling strategies exist
and we might mention them in the next part when talking about Transformers:
https://huggingface.co/blog/how-to-generate
Beamsearch
Natural Language Processing
Research on what is the current consensus on char-level models vs. word-level
models? Before it was that word-level models are better (2015, Karpathy - is his
prediction).
Does stateful=True have any implications on recurrent_dropout? Can we use it with
SRNN?
Research the usage of Bidirectional and TimeDistributed wrappers or
BeamSearchDecoder for encoder-decoder architecture.
Further explorations
Natural Language Processing
For this part take any task we have seen: text generation, text classification
(sentiment analysis), translation and create a notebook with a fully completed
example.
Note: you DO NOT need to provide an accuracy metric for translation (but if you
want BLEU score is where you should look at) or text generation. But you need it
for classification.
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 10 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) when finished for review and evaluation.

More advanced / bonus project: implement a custom transformer. Reference material:


Ref (“attention is all you need” paper reading and implementation):
https://www.youtube.com/watch?v=8Ht3ATIEwDs
Ref (A. Karpahy): https://www.youtube.com/watch?v=kCc8FmEb1nY
Practical Project 10
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1QZ1-wqAWxDKRTIeclYOL7INC6hlhfMIH.pptx ---


Artificial Intelligence
Reverse Image Search
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


t-SNE visualizing proximate images
01
02
Timing the KNN
Reverse Image Search
00
KNN for RIS
Additional libraries for NN search
03
04
Further explorations
We will use KNN to calculate the nearest neighbors

NearestNeighbors(n_neighbors=5, algorithm='brute',
metric='euclidean').fit(feature_list)

How about distance metric? We said we are going to prefer cosine distance metric,
not euclidian? Actually we can imitate the cosine distance somewhat by normalizing
the data and then using euclidean distance, see:
https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-
learn-kneighborsclassifier
The outputs will not be cosine-like, but the behavior will be similar.
Of course, I would invite you to verify it, not just believe it blindly!
KNN for RIS
Reverse Image Search
t-SNE visualizing proximate images
Reverse Image Search
t-SNE (t-distributed stochastic neighbor embedding, pronounced: tee-snee) is a
dimenionality reduction that is similar to PCA but has nicer mathematical
properties. It is very usefull and often used in the industry to vizualize higher
dimensional data projected into a low dimensonal graph / cluster.
STUDY TIP: better learn the intenals of PCA and then try to understand t-SNE as it
is a bit more complicated and a bit less important then PCA.
Refs:
https://www.datacamp.com/community/tutorials/introduction-t-sne
https://www.youtube.com/watch?v=NEaUSP4YerM
https://www.youtube.com/watch?v=RJVL80Gg3lA - more technical explanation
What is perplexity? The video explains this as the measure of density between
plots, that has effects on how the final tsne plot is calculated. A tunnable
parameter, there are some indications that higher values produce clearer shapes,
ussually it varies between 5 and 50. Getting the most from t-SNE may mean analyzing
multiple plots with different perplexities. See:
https://distill.pub/2016/misread-tsne/
https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html
t-SNE visualizing proximate images
Reverse Image Search
Comparing t-SNE vs. PCA.
If you got the picture below in a job interview and were asked to explain it?
We have two properties that are desirable
Similar items should be close together - strong clusterisation
Disimilar clusters of items items should be further appart
Ref: https://suneeta-mall.github.io/2022/06/09/feature_analysis_tsne_vs_umap.html
t-SNE visualizing proximate images
Reverse Image Search
//
t-SNE visualizing proximate images
Reverse Image Search
In real production RIS app we will most often perform which operation? It will be
the operation of finding the nearest neighbors of the image uploaded by the user.
This can be optimized!

In general 4 metrics to time and optimize:


Time it takes for CNN to converge (learning time metric)
Time it takes for KNN to converge (learning time metric)
Time it takes to get the N closest neighbours (inference time metric - client
impacting)
Time it takes for CNN to return feature vector/extraction (inference time metric -
client impacting)

Optimizations:
Standard features - 1024 (mobilenet) or 2048 (resnet), we can establish a benchmark
KNN with standard features the the popular architectures are producing. The feature
space if big.
PCA - what are the values for the metrics we care about after performing PCA (100
features)?
Brutefore, Balltree, KDtree - how fast is the KNN with these variations?
Use approximate NN libraries (see next slide)

Timing the KNN


Reverse Image Search
Aggregated comparison: https://github.com/erikbern/ann-benchmarks
Annoy
https://github.com/spotify/annoy
https://markroxor.github.io/gensim/static/notebooks/annoytutorial.html
https://stackoverflow.com/questions/57039214/how-to-use-the-spotifys-annoy-library-
in-python
Falconn
https://github.com/FALCONN-LIB/FALCONN
NMSLib
https://github.com/nmslib/nmslib
https://nmslib.github.io/nmslib/index.html

Additional libraries for NN search


Reverse Image Search
You can explore other libraries to see which ones offers the best performance
Evaluate different experiments (standard features, pca, pca+different knn algs,
pca+different aknn libs)
Further explorations
Reverse Image Search
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Reverse Image Search


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1JRS8Dpty4JXQL5XTkAuss75GXczyriVN.pptx ---


Artificial Intelligence
Computer vision and image classification
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


ConvNets
01
02
Convolutions
Computer vision and image classification
00
FCFFNNs for computer vision, drawbacks
Convolution layers
03
04
CNN Architecture
05
06
07
Pooling layers
Summary and questios
What’s next
We will use the famous MNIST dataset of handwritten digits.
It’s a grayscale 28x28(x1) image set.
If we learn it now, you will gain the benefit of saving time when reading books on
ML/DL as many of them use this dataset, so it will be much easier to understand
books, courses, presentations.
Also we will learn how to actually perform image classification and why FCFFNNs are
not enough.
FCFFNNs for computer vision, drawbacks
Computer vision and image classification
Drawback 1 - parameter explosion:
due to the high interconnectivity we have an enourmous count of parameters.
We want to avoid bottlenecking so the intermediate layers might have more neurons
than the input layer.
Drawback 2 - no location invariance:
Remember how we feed data into FCFFNNs? As a single dimensional tensor. So we need
to flatten the image.
This means that we are lossing location invariance as each layer will learn a
location dependent classification. So when the dog that we cant to classify is in
another location our NN might not be able to recognize it. Remember most of the
important data for neural networks is at the center of the image so the neurons
that differently interact with it might be considered “more important by the
network” and if the location of the object is not at the center the networks will
not know what to do with it. The network will not classify an image with a cat if
the cat is not at the center.
FCFFNNs for computer vision, drawbacks
Computer vision and image classification
//
FCFFNNs for computer vision, drawbacks
Computer vision and image classification
Demo:
Let’s see how a fully connected network will work for image classificaiton.
Also, we will see how to upload a custom image and then classify it.
Link1
Main lesson: if we try to classify an image with inverted colors, the NN fails.
Images fed during inference time have to have a similar “quality” to the images
which the network saw during training time. How much similar - is an interesting CV
question, which can/may be answered using classical techniques.
FCFFNNs for computer vision, drawbacks
Computer vision and image classification
Convolutional Neural Networks (CNN, ConvNets) replace our FCFFNNs for image related
tasks.
They have similarities to animal perception system and a proven empirical record of
performing well for image related tasks.
Similarities with the perception system include:
2D image structure preservation (flattening is not immediate).
Local receptive fields (ability to “focus” or “specialize”) - a defined segmented
area that is occupied by the content of input data that a neuron within a
convolutional layer is exposed to during the process of convolution. The LeNet
paper introduced this notion.
Layering for extracting more and more abstract features (same as DNN).
Focus/specialization + Layering = progressive generalization. Pixels → Lines →
Edges → Shape → Fragments → Object.
Additionally they are (partially) sparse NNs and have way fewer parameters compered
to similarly performing FCNNs (could be proven!)
ConvNet have neurons arranged in 3 dimensions: width, height, depth and their input
and output is a volume - each convnet layer transforms an input 3D volume to an
output 3D volume with some differentiable function that may or may not have
parameters.
ConvNets
Computer vision and image classification
ConvNets have at least two kinds of layers we have not encountered before:
Convolution → focus.
Pooling → subsampling (reducing the size), sparsity
Let’s go first through convolutions and then throught pooling layers.
ConvNets
Computer vision and image classification
What is a convolution? A matrix operation applied to some input.

Kernel weights are trained, i.e. found durring training. During that process they
specialize in detecting certain features.
Different kernels can contain different numbers, hence extract different features.
Applying kernel to image gives us the feature map (aka convolution result).
Convolutions
Computer vision and image classification
How is the convolution operation applied?
In a sliding window fashion, giving location invariance (feature is recognized at
top or bottom or the picture)
We have a vertical stride and a horizontal stride. And they are hyperparameters, so
tunnable (overlapping vs. non-overlapping strides).
The kernel size is also a hyperparameter (small size → ussually more efficient,
3x3, not 9x9. It can be even: 9x6)
Multiplication is performed on every channel separatelly (so 3 times in RGB):
Animation
Kernel will not stride outside the image, zero padding might be needed (we’ll talk
about it a bit latter).
Convolutions
Computer vision and image classification
//
Convolutions
Computer vision and image classification
//
Convolutions
Computer vision and image classification
Input of the convolution is an image (1st layer), output is also an image-like
thing (feature map).
So we can chain them.
Chaining two or more convolutions gives us layers of convolutions.
One convolution is composed of a stack of feature maps (multiple feature maps).
Convolution layers are sparse and reduce the number of hyperparameters (1 “neuron”
connected to multiple “pixels”: many input values conneted to 1 output value) i.e.
the same kernel is applied across the entire receptive field - parameter sharing.
Convolution layers
Computer vision and image classification
Calculation on layer weights size decrease due to “weight sharing” - because the
same kernel travel thought the entire image and only kernels have trainable weights
we get a sparse NN (thus solving the parameter explosion problem).
Arithmetic for the learnable synaptic weights (NOT ALL PARAMETERS, JUST TRAINABLE)

Note Conv layers have bias parameters that would change the calculation, but DNN
also have them, so the relative difference might not be affected.
Convolution layers
Computer vision and image classification
Lastly it is common for the output from the convolution to pass through an
activation function! Commonly ReLU is used but certainly you should try other to
see which performs best!
Let’s take a look at an animation that summarizes what I have explained
https://www.youtube.com/watch?v=f0t-OCG79-U
Convolution layers
Computer vision and image classification
Input to the pooling layer is the image-like / feature map (array if pixels) /
matrix.
Performs reductions / aggregations: max, avg. Ref:
https://pytorch.org/docs/stable/nn.html#pooling-layers
We say: “max pooling operation” also in a sliding window fashion.
The resulting matrix is much smaller also with highlighted important features.
Has size and stride. With conv. overlap is desirable, with pooling - don’t overlap
(stride >= polling dimensions)
Although similar to conv layer there is a big difference: there are no weights or
biases, just the operation.
Pooling layers
Computer vision and image classification

Pooling does not reduce the count of the feature maps, but reduces their size.
Pooling layers
Computer vision and image classification
Input is an image, output is an feature map of the Convolution + Pooling layer
groups.
Outputs between layers are also feature maps, but smaller and smaller due to
pooling, but deeper and deeper due to amount of feature maps. Conv. part of this NN
performs feature extraction!
Result of CNN layers is fead to FCFFNN, which spits out probabilities (4
classification).
CNN Architecture
Computer vision and image classification
Source: https://www.youtube.com/watch?v=m8pOnJxOcqY
CNN Architecture
Computer vision and image classification
Famous feature extraction image shows that convolutions are learning to extract
different levels of abstraction at each layer.
Image set rich in details or with a lot of differences in the image set diverse
dataset) would need more convolutions to learn more features.
CNN Architecture
Computer vision and image classification
Before creating a CNN, let’s see how convolutions and pooling operations work
We will see feature extraction and subsampling capabilities, some well known
filters.

Demo: Detecting Corners and Edges w/ Convolution


CNN Architecture
Computer vision and image classification
Summary:
Features that make the learning faster: sparsity (weight sharing in kernels),
pooling layers.
Features that make learning more precise: dimension preservation, local receptive
fields through learnable kernels.
Convolution is a mathematical operation (not a thing, but a process) of applying a
kernel to an image and obtaining the feature map: image + kernel → feature map.
A good summary provided here:
https://cs231n.github.io/convolutional-networks/#layers-used-to-build-convnets

Questions:
What is a convolution?
What is a kernel in the context of CNNs?
What do the pooling layers do?
Are convolution layers fully connectioned? → no
Do convolution layers have trainable parameters / weights? → yes
Do convolution layers have biases? → yes (can have)
Summary and questions
Computer vision and image classification
Training a ConvNet
Optimizing a ConvNet
What’s next
Computer vision and image classification
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
https://docs.google.com/spreadsheets/d/1DmSKXClV4xOkmz-
GjKq6ew4EF0cXed6uda5PZYhtEGU/edit?usp=sharing
Additional information

--- Content from 1PCb15063vNrH9caTx3im97STOk8OvCni.pptx ---


Artificial Intelligence
Introduction to Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


PyTorch installation
01
02
PyTorch tensors
Introduction to Deep Learning
00
PyTorch intro
PyTorch GPU
03
04
PyTorch nn package
05
06
07
PyTorch autograd and optimizers
PyTorch graph visualization
08
Fastai
Swift
09
Summary
What is PyTorch? The most popular Deep Learning library, after Keras and
TensorFlow, by Facebook.
API is quite similar to Keras’s (in part because both APIs were inspired by Scikit-
Learn and Chainer), so once you know Keras, it is not difficult to switch to
PyTorch, if you ever want to (and vice versa).
PyTorch’s popularity grew exponentially in 2018, largely thanks to its simplicity
(no static graph building step or session concept) and excellent documentation,
which were not TensorFlow 1.x’s main strengths.
However, TensorFlow 2 is arguably just as simple as PyTorch, as it has adopted
Keras as its official high-level API and its developers have greatly simplified and
cleaned up the rest of the API. The documentation has also been completely
reorganized, and it is much easier to find what you need now.
Similarly, PyTorch’s main weaknesses (e.g., limited portability and no computation
graph analysis) have been largely addressed in PyTorch 1.0.
PyTorch intro
Introduction to Deep Learning
Pytorch is GPU and CPU aware
Pytorch is like coding python - you can use same debuggers and all the same
libraries
Dynamic computation graph - no need to define the graph and instantiate some
session to execute it on

Imperative execution (training loop is visible)


Inbuild models
Easy extensibility for writing new neural network layers
For example C+ interop: https://pytorch.org/tutorials/advanced/cpp_extension.html
NEW: mobile https://pytorch.org/get-started/mobile/
PyTorch intro
Introduction to Deep Learning
Similarities with TF (mostly comparing to TFv1, not modern TF): both operate with
Tensors, which are like numpy arrays, but can be operated on in a GPU for massively
parallel operations.
Differences between TF and Pytorch
There is some interoperability, for example you can use Tensorboard with Pytorch:
https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html
PyTorch intro
Introduction to Deep Learning

Has a great install options configurator: https://pytorch.org/get-started/locally/

For running on a GPU you will need full cuda install


PyTorch installation
Introduction to Deep Learning
Tensor is a fundamental datatype in PyTorch, just like in TF. They are like numpy
arrays (with all the usual functionality, like slicing, indexing, reductions,
linear algebra and so on) but also the ability to be executed on the GPU for
masivelly paralel operations - “device aware” tensors.

Pytorch supports 8 CPU and 8 GPU tensor types (TF has 13).
Demo: simple tensor operations and conversions between numpy and torch tensors
PyTorch tensors
Introduction to Deep Learning
CUDA support - the standardized parallel computing platform and API allowing to run
computations on GPU.
PyTorch GPU
Introduction to Deep Learning
GPU executions are async be default
PyTorch GPU
Introduction to Deep Learning
Autograd library is a library that performs automatic differentiation (see:
https://en.wikipedia.org/wiki/Automatic_differentiation
Differentiation is the process of calculating the derivatives of functions and is
used by gradient decent process (durring backpropagation)
Note, the automatic differentiation package in Tensorflow is called tf.GradientTape
Demo: we now know enough about Pytorch to solve a simple linear regression problem.
Autograd is used by pytorch optimizers - optimizers are just “minimizers of error
fuction w.r.t model weights”.
PyTorch autograd and optimizers
Introduction to Deep Learning
Just like TF has the Estimator API and/or Keras on top of it, Pytorch has the
torch.nn module
PyTorch nn package
Introduction to Deep Learning
Pytorch differs from Keras as it does not hide: tensor creation, loading data and
model on the GPU and the training loop. Hence we have to add additional code for
the code to be adapted to both contexts - when GPU is available and when not.
We perform a check on whether GPU is enabled with code.
PyTorch device agnostic code
Introduction to Deep Learning
We can use HiddenLayer: https://github.com/waleedka/hiddenlayer
Demo: using hiddenlayer

Also integrates nicelly with Tensorboard:


https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html
Additional tools for visualization:
https://stackoverflow.com/questions/52468956/how-do-i-visualize-a-net-in-pytorch
PyTorch graph visualization
Introduction to Deep Learning
What is fastai?
It’s a DL framework and a company of people teaching and doing DL.
Fastai is an easy-to-use, brilliant library built on top of Pytorch and developed
by Jeremy Howard and Rachel Thomas, providing tools in four main areas:
Vision
Tabular data (our next part will be about tabular data), including collaborative
filtering (recommender systems)
Textual data
Collaborative Filtering (under tabular)
Medical (NEW as of 2021)
Layered, often we work with Learner and DataBlock objects
The main advantage of fastai - they are tracking new DL research and implement
newest best practices fast, like lr finding, one-cycle learning an on, transfer
learning.
Fastai
Introduction to Deep Learning
https://github.com/fastai/fastai/blob/master/README.md#installation
In collab, it already preinstalled, however the version is 1.X - we need to use 2.X
(this is not true anymore from 2022.08)
Fastai
Introduction to Deep Learning
With fast.ai we can very quickly create and configure deep learning models.
We are feeding the data to a model called Learner.
DataBunch / Data objects controls how the data is fed into the Learner + has a lot
of data transformation options.
Fastai
Introduction to Deep Learning
Additional things that fastai can do:
Notably it has a lot of data preproc utils
Fastai
Introduction to Deep Learning
An interesting development was evolving in 2018-2020 - a push to start looking for
alternatives to Python in DL mainly coming from google with the most prominent
option being Apples Swift language.
The two notable proponents of these changes were Jeremy Howard (Kaggle, Fast.ai)
and Chris Lattner (LLVM, Swift) - both brilliant programmers.
TL;DR - the project was terminated in 2021. Let’s discuss this a bit.
Swift
Introduction to Deep Learning
Notably Jeremy has said: “Python just doesn’t cut it” on numerous occasions
acknowledging that python does not have the performance of a compiled language and
differentiable programming is not at it’s core, see for example this:
https://www.youtube.com/watch?v=t2V2kf2gNnI
He also wrote: “But I warned that: “Using Swift for numeric programming, such as
training machine learning models, is not an area that many people are working on.
There’s very little information around on the topic”. So, why are we embracing
Swift at this time? Because Swift for TensorFlow is the first serious effort I’ve
seen to incorporate differentiable programming deep in to the heart of a widely
used language that is designed from the ground up for performance.”
Also see: https://www.fast.ai/2019/01/10/swift-numerics/
The vision they had:
Swift
Introduction to Deep Learning
Google intro: https://blog.tensorflow.org/2018/04/introducing-swift-for-
tensorflow.html
The why:
https://github.com/tensorflow/swift/blob/master/docs/WhySwiftForTensorFlow.md
Tutorials: https://github.com/tensorflow/swift#tutorials and
https://www.tensorflow.org/swift
Swift
Introduction to Deep Learning
Conclusion - the project was terminated:
https://github.com/tensorflow/swift
https://www.infoworld.com/article/3608151/swift-for-tensorflow-project-shuts-
down.html
Other languages to keep an eye on: Julia, notably JS also on the watchlist.
Most important languages in ML/DL:
Swift
Introduction to Deep Learning
Main things to remember:
Most popular frameworks, TF will probably be used as Backend for Keras only (unless
TF.js and IoT).
Most frameworks offer solutions starting from a tensor type (on GPU) to high level
APIs for neural networks (notably, Keras is a high level layer of tensorflow,
Pytorch torch.nn or Fast.ai).
Levels of abstraction - we can discuss which framework hides the most details.
Summary
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information
--- Content from 1WDHGwWwd4uCL0Qrqn7EZDiqV7lLNOXSp.pptx ---
Artificial Intelligence
Reverse Image Search
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Definition
01
02
Reverse Image Search Engines
Reverse Image Search
00
Motivation for RIS
Working principle
03
04
Image Similarity Problem
05
Non-DL approaches to image similarity
06
DL approaches to image similarity
07
Cosine vs. Euclidian distance
08
Feature extraction
You baught a home and are on the lookout for furniture. You search for a sofa but
can't find a good one. You spot a good one at a reception in an interview for a
Data Science position. You ask the administrator, but (s)he does not know the model
or where it's from. You take several pictures. You take the picture to numerous
shops, but to no awail. You friend suggests: "why don't you use the inverse image
search feature in google". After 2 minutes, you find 1 shop selling it. After 1
more, you found a discount.
Another usecase: you find an image used in a book that you read, but you are not
sure whom to attribute this too, who is the author. You could use RIS.
You find an image of a city, you don't know which it is. You upload the image and
it tells you.
More usecases: https://tineye.com/technology
Motivation for RIS
Reverse Image Search
Reverse Image Search (RIS) - is the ability to upload an image and get back similar
images with appropriate suplementary (meta)data back.
Reverse image search (or as it is more technically known, instance retrieval)
enables developers and researchers to build scenarios beyond simple keyword search.
From discovering visually similar objects on Pinterest to camera-based product
search on Amazon, a similar class of technology under the hood is used. Sites like
Tineye alert photographers on copyright infringement when their photographs are
posted without consent on the internet. Even face recognition in several security
systems uses a similar concept to ascertain the identity of the person.
Multiple synonyms are used to describe reverse image search:
Reverse image search.
Inverse image search.
Inverse / reverse Image retrieval.
Content-based image retrieval (CBIR).
Visual search.
Image-to-image search.
Why are these important to know? Because when you read books / blogs or use google
you will skip important articles about the topic if you will not know that these
words refer to the same thing.
Definitions
Reverse Image Search
A list of popular ones:
Google Image Search / Google Lens
Bing Visual Search
Pinterest Visual Search tool
Amazon and Aliexpress offer image searches as well!
Also a bunch of chrome (and other browser) plugins:
https://chrome.google.com/webstore/detail/aliprice-search-by-image/
aadbahhifnekkkcbapdfandpimaoacmj
More: https://www.clearvoice.com/blog/reverse-image-search-tools/
Facebook autotagging (dropped): https://www.wired.com/story/facebook-drops-facial-
recognition-tag-people-photos/
Note: you can take a picture of an item you have seen on the street and check where
you can buy it.
Reverse Image Search Engines
Reverse Image Search
RIS apps if based on CNNs work the following way expressed in terms of steps:
Get data (duhhh!)
We train + tune to the max a CNN for image related tasks (for example
classification) on a large general dataset containing items, that you expect people
to search for.
We extract the features from all images we have with a headless (no FC layer) CNN -
construct the feature space of all the images that you have. Create a vector
database of your own images (can even be the sames ones as CNN was trained on or
different).
—------ Inference steps start here —--------
We pass a single image that the user searches by through a headless CNN to get the
final tensor.
We solve the image similarity problem on the fly - finding similar images by some
metric (ussualy cosine or euclidian distance using embeddings). The search
algorithm is ussully KNN.
Return set of similar images found to the user - top 5, top 20.

Why is this application exciting to implement and learn?


We use the knowledge we already have gained: CNNs, KNN algorithm, embeddings.
We can use optimization techniques, like PCA/tSNE to reduce the dimensionality and
make the algorithms perform better.
We can build something real (i.e. web app).
We can use this topic to talk more broundly about things like image similarity,
image comparison, embeddings & vector distances.
Working principle
Reverse Image Search
Let’s compare to naive solution based on classification labels: upload image →
classification → obtain the class → query-based search based on the obtained class
(getting items from same class). Reducing the problem to classification then doing
text search on that.
This has disavantages:
if we add another category, we will need to retrain our model, since it has to
always be able to classify with that category in question. This disadvantage is
shared by the CNN based approach as well, but only partly because if you don’t have
a specific category you will misclassify an image and assign incorrect label.
“Kitchen sink” can be similar to “a swan” so if you misclassify a kitchen sink and
assign label swan, swans will be returned. Because the feature extractor will still
extract similar features when not relying on classification results the model will
be more robust than a model relying on classification labels.
this will not be precise as we would just get a category “mens brown pants” and
then return all other items in the class/category without ordering by similarity.
Deep learning based approaches with KNN usually return the most similar items by
visual similarity (image similarity)! Not just the ones that belong to the same
category.

Features vs. labels - features win!


Working principle
Reverse Image Search
How do we tell that two images are similar or even the same?

Usages. Forged signature would be an interesting and quite easily doable personal
project.
Image Similarity Problem
Reverse Image Search
We can see these approaches being implemented in traditional software.
Non-DL approaches to image similarity
Reverse Image Search
Compare patches - rotation or cropping destroys the viability of this method.
Subtraction comparison - we will implement that.
Hashing - duplicates of an image can be found. One use case for this approach would
be the identification of plagiarism in photographs.
Histogram for RGB approach - used in image dedublication software that finds
repeating images on your hard drive. However small changes in hue, color or
lighting can make this method struggle.
MSE - mean squared error. We will implement that.
SSIM - structural similarity index measure. We will implement that.
Scale-Invariant Feature Transform (SIFT) - uses feature comparison, less helpfull
when comparing deformable / changing objects. Available in OpenCV.
Speeded Up Robust Features (SURF) - uses feature comparison, less helpfull when
comparing deformable / changing objects. Available in OpenCV.
Oriented FAST and Rotated BRIEF (ORB) - uses feature comparison, less helpfull when
comparing deformable / changing objects. Available in OpenCV.
More:
https://stackoverflow.com/questions/5730631/image-similarity-comparison
https://stackoverflow.com/questions/843972/image-comparison-fast-algorithm
Try it out:
A list of APIs that solve image similarity: https://rapidapi.com/search/image
%2Bsimilarity
Non-DL approaches to image similarity
Reverse Image Search
An ideal way to find similar images would be to use transfer learning. For example,
pass the images through a pretrained convolutional neural network like ResNet-50,
extract the features, and then use a metric to calculate the distance between the
image feature vectors like Euclidean distance.
Going through the convolution and pooling layers in a CNN is basically an act of
reduction, to filter the information contained in the image to its most important
and salient constituents, which in turn form the bottleneck features (headless CNN
produces feature maps (feature vectors, sometime called bottleneck features)).
Training the CNN molds these values in such a way that items belonging to the same
class have small Euclidean distance between them (or simply the square root of the
sum of squares of the difference between corresponding values is smallest) and
items from different classes are separated by larger distances - essentially what
the embedding layer is doing!
Does CNN perform dimensionality reduction?
224 * 224 * 3 = 150528
7 * 7 * 512 = 25088
DL approaches to image similarity
Reverse Image Search
Take a look at the image on the side, dinosour is as close to blonde dog as white
dog in terms of Euclindean distance. Cosine distance seems to work better in this
case.
Cosine similarity / distance is a metric used to measure how similar the vectors
are irrespective of their size. Mathematically, it measures the cosine of the angle
between two vectors projected in a multi-dimensional space.
The cosine similarity is advantageous because even if two similar documents are far
apart by the Euclidean distance (due to the size of the document), chances are they
may still be oriented closer together. The smaller the angle, higher the cosine
similarity. Cosine similarity ranges between -1 and 1 (cos 0 = 1, cos 90 = 0, cos
180 = -1).
There is some tension between euclidian (L2) and cosine distance, where they can
produce somewhat different results and it's generally accepted cosine distance
should be favored. However, I would probably advice trying both & thinking about it
as hyperparameter to KNN
We will generalize the cosine distance calculation to N dimensions to calculate the
similarity between images based on CNN calculated feature vectors. This technique
is used for recomendation systems, NLP for text similarity that is text-lenght-
insensitive as well.
More:
https://medium.com/@Intellica.AI/comparison-of-different-word-embeddings-on-text-
similarity-a-use-case-in-nlp-e83e08469c1c
https://cmry.github.io/notes/euclidean-v-cosine#when-to-use-cosine
Cosine vs. Euclidian distance
Reverse Image Search
We will use transfer learning models from keras trained on the ImageNet(1K) dataset
(source data).
We can use Caltech101 or any other dataset or image (target data) to see how
feature extraction works, see:
http://www.vision.caltech.edu/Image_Datasets/Caltech101/
Note: usually the target image is taken from source dataset, meaning that we are
literally uploading the same images to find similar items - like we could do in an
e-commerce setting, but a much more powerful application is one that works with
unseen images!
We will extract the features by calling predict on a headless transfer learning
model:

model = ResNet50(weights='imagenet', include_top=False, input_shape=(224,


224, 3), pooling='max')

features = model.predict(preprocessed_img) # pass
the image through CNN (no head!)
flattened_features = features.flatten()
normalized_features = flattened_features / norm(flattened_features)
# comes from numpy.linalg

Feature extraction
Reverse Image Search
Mini Project: During training of a CNN you could monitor how a bottleneck features
for two images that belong to the same class get progressively closer and closer
together in the Euclidian feature space and create a gif illustrating that. “Do
cool s**t and put it on the internet”.

You can explore other techniques for image similarity.


You can research additional classical approaches for image similarity and what apps
it could be used for (this would require understanding their
benefits/characteristics).
Paper to read / implement “Transfer learning for image classification using VGG19:
Caltech-101 image data set”: https://link.springer.com/article/10.1007/s12652-021-
03488-z
Further explorations
Reverse Image Search
Vector databases can be used for more efficient retrieval of similar images
(amongst other things):
https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/vectordb
As of 2023 there is no urgency to test them, learn them, but we can keep an eye on
them.
Further explorations
Reverse Image Search
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Reverse Image Search


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1WkQprYHhlOzjTfmKk2IrfG38vNol5s0T.pptx ---


Artificial Inteligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Optimization techniques: cython, pypy
Process of optimization
01
02
03
Optimization techniques: numba
Python Crash Course
00
Python performance
05
Parallelism and concurrency
06
Multithreading
07
Multiprocessing
08
Async
04
Optimization techniques: others
09
Microoptimizations
Python performance
Python is an interpreted, dynamically typed language which makes it slow(er) for
certain kinds of operations compared to other languages (something like that … in
reality languages are neither slow nor fast - their runtimes are). To understand
the problem better let’s go though this article: https://hackernoon.com/why-is-
python-so-slow-e5074b6fe55b
Take note, that:
we are talking execution time, mostly for CPU bound processes (not I/O bound).
python is exploding in domains like ML and finance where speed is very important
developer time is more important than execution time in most cases (100K per year
with 10% productivity increase?). Established companies transitioned to “faster”
languages at some point in their evolution (twitter → ruby → java, facebook → php →
hack).
There are many measurements or metrics we can compare the performance of our
programms (+languages +libraries) against:
total execution time (s, ms),
startup time (s, ms),
responsiveness to user input,
ability to handle “spikes” (sudden changes in load),
memory footprint (MB, GB),
disk I/O max,
disk I/O total,
networking bandwidth usage (MB, GB),
susceptibility to network latency (and jitter),
usage of network sockets (#),
file handle usage (#) and on.
The most common and important ones are: time and memory usage.
How can we make our python programms faster? … to answer that we probably want to
prove that it is slow first.
NOTE: Python performance as a language will be OK in 95% of the cases, the biggest
bang for the buck is to be achieved by optimizing IO.
Python Crash Course
Optimization techniques: cython, pypy
Let’s compare Python with C and Js, ref for the implementation:
https://medium.com/codex/how-slow-is-python-compared-to-c-3795071ce82a
Horrible, what are the solutions? Some of them are (almost) zero-cost -
macrooptimisations:
Cython: with cython you write pseudo python code, build an extension in a separate
step and then import that extension as any python library and it should be much
faster because it’s in C! You don’t need to rewrite the whole app just the parts
that are slowest or most frequently called.
Pypy: performance almost for free! A different python interpreter, it needs to be
installed - https://www.pypy.org/download.html. Some disadvantages are noted here
(be careful they can be outdated, you need to benchmark if you want to be sure):
https://www.geeksforgeeks.org/why-pypy3-is-preffered-over-python3/
Mojo: (will see in the future)
Use the newest python version.
Python Crash Course
Optimization techniques: numba
Ref: http://numba.pydata.org/ and you can read a short 5min guide:
https://numba.pydata.org/numba-doc/dev/user/5minguide.html

Which code is best suited for numba? Functions with lots of loops, that use numpy
arrays and functions!
Let’s take a look at @jit, @njit, @vectorize
Some advanced usecases: paralelization, running on GPU (cuda)
https://towardsdatascience.com/supercharging-numpy-with-numba-77ed5b169240 and
https://numba.readthedocs.io/en/stable/cuda/index.html
In serverless / lambda environments it might not be easy to adapt it, but it is
useful to have in the toolbox.
Python Crash Course
There are multiple ideas on how to develop a routine for optimizing code and how to
think about it.
Usually it boils down to something like this, see: https://wiki.c2.com/?
RulesOfOptimization
Observe an actuall performance issue - there is no issue until someone says: “look
this is really slow, we need to make it faster”. Remember: “premature optimization
is the root of all evil” and “don’t fix it if it’s not broken”.
Measure the performance: benchmark and profile time and / or memory so you will
know that you actually improved something. There are multiple tools for that:
https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html
Make sure there are functional tests for the code you are about to optimize!
Find the most expensive part or the part that repeats the most - bottleneck. Or the
most often executed place.
Optimize the most expensive part first and “do the simplest thing first”. Remember:
you can literally rewrite half of the application 5-6 times when optimizing:
rearrange the code, try 5-6 libraries, apply bit twiddling hacks, try different
runtimes, I/O optimization (disk and networking), multithreading and
multiprocessing and even rewriting parts of code in other languages or … buying
better hardware!
Measure again and see if that is enough (ussally a 2-3x improvement is more than
enough for most users to notice and be happy, sometimes the optimization has a set
goal (a business requirement or technical requirement where another system depends
on the time your sistem completes a task and the chain fails if your code is not
fast enough)).
Ref: https://www.toptal.com/full-stack/code-optimization and
https://llllllllll.github.io/principles-of-performance/how-to-optimize-code.html
Process of optimization
Python Crash Course
Optimization techniques: others
Recommended talk: https://www.youtube.com/watch?v=YjHsOrOOSuI
… in short use python mechanisms instead of writing algorithms in python.
Python Crash Course
Parallelism and concurrency
Concurrency is the execution of multiple instruction sequences at the same time.
This is possible when the instruction sequences to be executed simultaneously have
a very important characteristic, which is that they are largely independent of each
other. This characteristic is important both in terms of the order of execution and
in the use of shared resources.
with order of execution, this means that the order of execution of these
instruction sequences should have no effect on the eventual outcome. If task 1
finishes after tasks 2 and 3, or if task 2 is initiated first, but finishes last,
the eventual outcome should still be the same.
with shared resources, the different instruction sequences should share as few
resources between each other as possible. The more shared resources that exist
between concurrently executing instructions, the more coordination is necessary
between those instructions in order to make sure that the shared resource stays in
a consistent state. This coordination is typically what makes concurrent
programming complicated. However, we can avoid many of these complications by
choosing the right concurrency patterns and mechanisms depending on what we are
trying to achieve. For example read-only operations are easy to make parallel.
Types:
… even instruction level parallelism on a single core of a CPU (SIMD)
Parallel programming - splitting the task into subtasks and assigning the tasks to
separate cores / processors of the machine to be executed simultaneously.
Multiprocessing: best for CPU-bound / CPU-intensive tasks: string processing
(regex), number crunch, search, graphics, etc. Multithreading: best or I/O bound
tasks: db reads/writes, web service calls, data download and upload i.e. file I/O.
Async programming (single thread concurrency) - useful for I/O bound tasks: db
reads/writes, web service calls, data download and upload i.e. file I/O.
Distributed computing / programming (parallelism over multiple machines): DDoS
(primitive version of distributed tasks), Hadoop / Apache Spark computations
(distributed computing engines).
… and so on (multiple datacenters).
Python Crash Course
Parallelism and concurrency

Note, some people say:


concurrency → “doing multiple things at same time while switching between them”.
parallelism → “doing multiple things at same time independently”.
Concurrency vs. parallelism → 2 queues alternate w.r.t to bank teller vs. 2 queues
are handled simultaneously by two tellers.
NOTE: in python multithreading does not imply parallelism because of GIL.
Parallelism is a sub-type of concurrency in general.
Python Crash Course
Parallelism and concurrency
One of the best videos on multithreading vs. multiprocessing:
https://www.youtube.com/watch?v=AZnGRKFUU0c
Python Crash Course
Multithreading
Multithreading
A process is a running instance of a programm inside the OS that has a process ID
(PID), name, assigned memory, some security attributes, sockets allocated and
associated threads (threads of execution).
A thread is the smallest sequence of instructions that can be managed by the
operating system and independently scheduled to run on the CPU (one thread per CPU
core for example). Multiple threads belong to a process and share the memory
belonging to that process (while processes can not access other process memory
directly (unless explicitly sent via IPC, segmentation fault)).
Threads are used in GUI apps (media player plays the songs, the other thread
downloads the song and the 3rd is responding to user clicks and navigation), web
app - pool of threads to respond to multiple user requests simultaneously (thread
pool pattern).
You have 30 files, and 10 threads in sequential case it will take 30s, in
multithreaded case 30 / 10 + C (where C is overhead coefficient) - maximum lower
bound of performance. ~Paradox: a multithreaded program runs faster in terms of
time, but performs more instructions in total. Multithreaded programs are not as
energetically efficient.
In python we use threads in one of 3 ways:
passing a function to the threading.Thread constructor (creating a thread object)
creating a class that inherits from threading.Thread and has a run() method
or creating a ThreadPoolExecutor and passing a function to be executed to it
(remember concurrency design patterns?).
used when there are too many threads to create at once (they are expensive to
create after all and creating a 100 at once can stall the system for a bit) - so
not to overwhelm the system.
when threads need to be reaused (we want to reuse something that is expensive to
create)
when you can only execute limited amount of threads due to constraints from
external systems - for example an API that starts blocking any client sending more
than 4r/s.
Let’s see a DEMO how to create each one in a simple way and then a more realistic
example.
Python Crash Course
Multithreading

Python Crash Course


Multithreading
We can monitor threading activity with OS tools
We can suspend a thread to see which thread is responsible for which operation
Python Crash Course
Multithreading
//
Python Crash Course
Multithreading
Multithreading, a couple of points to note:
join() method blocks the main thread untill the thread we are waiting for completes
each thread has a lifecycle where in the middle the thread osilates between ready
and running states and blocked when waiting for I/O
each thread has it’s stack and registers (since a function is run on a thread we
can infer that) but the heap / global variables are shared.
we don’t have control over how the threads are being scheduled to use the CPU,
unless we start blocking / synchronizing them. We can’t choose the CPU core it’s
run on, how much it is run or the sequence of threads (again, unless synchonization
is involved).
When the scheduler decides to give some time to the other thread it suspends the
current thread mid execution and performs a thread-context-switch if the new thread
selected to run is from the same process. If it’s from another process a more
expensive context switch is performed, see:
https://www.geeksforgeeks.org/difference-between-thread-context-switch-and-process-
context-switch/ and https://stackoverflow.com/questions/5440128/thread-context-
switch-vs-process-context-switch . Both at types of context switch - the process of
saving and restoring the state of a thread or process when scheduler suspends or
runs the thread.
Python Crash Course
Multithreading
Multithreading, thread interference or race condition:
A problem that is hard to debug, since it may occur in the application only under
heavy load
Solution thread synchronization in various ways (locking primarilly, but
semaphores, barriers and others are also available)
Python Crash Course
Multithreading
Locking:
Thread acquires a lock when entering a section of code that is used for
manipulating shared data. The lock and only be unlocked by the same thread once
it’s acquired. Other threads if they enter the same section of code are BLOCKED
until the first one releases lock. Releasing locks are usually done in finally
block or with the with() statement.
You can do something else if the lock is already acquired, also you can use re-
rentrant locks:

Python Crash Course


Multithreading
Interthread communication
threading.Semaphore - is a set of locks that can be given and released when a
thread enters and exits a critical section. Common analogy: bouncer that admits
limited number of people to a party or a security guard admitting only a certain
amount of people (threads) to a shop.

threading.Event - an event is an object created by one thread and then used /


waited upon by other threads. When a thread calls wait() on an event it enters a
blocked state and waits until another thread calls set().

Python Crash Course


Multithreading
Interthread communication
threading.Condition - a more complex interthread communication object. Usually used
for multithreaded producer-consumer pattern implementation, however this is not
used often these days as the producer-consumer pattern can be implemented using
Queues.
Before seeing an example of a Multithreaded Queue let us note, that there are more
threading primitives: https://realpython.com/intro-to-python-threading/#threading-
objects

Python Crash Course


Multithreading
Interthread communication
Queue - just like a queue in real life allows for processing items in a sequential
manner
The consumer of the queue items to take them as fast as it can w/o being
overwhelmed. This is precisely why queue are sometimes chosen.
It’s a buffer for storing and sequentially retrieving messages / data and,
obviously, a datastructure with a specific API.
An example of message oriented or “message-passing” architecture rather than
thread-syncronization and locking based.
Producer consumer pattern with queue ref: https://realpython.com/intro-to-python-
threading/#producer-consumer-using-queue

Python Crash Course


Multithreading
GIL and multithreading
Use python threading when doing I/O intensive workloads! This is a recommended
practice.
The GIL is a lock inside the python interpreter that prevents multiple thread form
executing on multiple cores at the same time
Need for protection for the interpreter-internal datastrutures.
I/O bound work (like webscrapping) can benefit from multithreading since the
majority of time is spent waiting for I/O to complete. When a thread waits it
releases the GIL and other threads can also initiate or complete their I/O bound
work. 1ms / 1ns ≈ 11.5 days / 1s = 1M
Single lock in the GIL, rather than multiple locks makes the interpreter faster in
a single-threaded scenario - fewer locks to handle.
“GILectomy” is a long discussed process of removing the GIL from CPython, until it
is done there are two alternatives for CPU intensive work: using a non-GIL
interpreter (IronPython, Jyton) or Multiprocessing … let’s turn to Multiprocessing
now.
Python Crash Course
Main classes:
Thread: This class represents an individual thread of execution. You can create an
instance of the Thread class by specifying a target function to be executed in the
new thread. The Thread class provides methods for starting, joining, and managing
the thread.
Lock: The Lock class provides a way to synchronize access to shared resources
across multiple threads. It allows you to enforce mutual exclusion so that only one
thread can access the shared resource at a time.
RLock: The RLock class, similar to the RLock class in the multiprocessing package,
represents a reentrant lock. It can be acquired multiple times by the same thread
without causing a deadlock.
Condition: The Condition class provides a way to synchronize threads based on a
condition. It allows threads to wait until a particular condition is satisfied
before proceeding. The Condition class is often used in producer-consumer
scenarios.
Semaphore: The Semaphore class is a synchronization primitive that allows you to
limit the number of concurrent access to a shared resource. It maintains a counter
that represents the number of available resources, and threads can acquire or
release the semaphore based on this counter.
Event: The Event class provides a simple way to communicate between threads using a
flag. It allows one thread to signal an event, and other threads can wait for that
event to occur before continuing their execution.
Barrier: The Barrier class provides a synchronization point for a fixed number of
threads. It allows threads to wait until all participating threads have reached the
barrier before continuing.
Timer: The Timer class is a subclass of Thread that allows you to schedule a
function to run after a certain delay. It is useful when you want to execute a
function in a separate thread after a specified amount of time.

Multithreading
Python Crash Course
Multiprocessing
Multiprocessing
A process (running instance of a program w/ threads, properties and memory) do not
share memory space (like threads inside a process do) by default (shared memory
exists, but it’s not the default)! That means if we could clone a python process we
could potentially execute on multiple procesor cores at the same time!
Other advantages … and disadvantages.

The multiprocessing lib in python is very similar to the threading library, but
it’s easy to terminate processes from the parent process, while threads are not
terminatable so easily (they can disrespect termination singaling). However abrupt
terminating processes is only advised if they do not have access to shared
resources as killing a process will not prevent it from running finaly blocks and
other exit handlers potentially leaving the resource in an inconsistent state.
Let’s see a simple multiprocessing demo.
Python Crash Course
Multiprocessing
Multiprocessing
We can see that the multiprocessing version is slower? Why?
Collab does not allow for multiprocessing (as of 2022).
Spawning processes and threads takes time and CPU cycles! In fact a multithreaded
or multiprocessed app will always take more CPU cycles than the same app that does
not use these mechanisms simply due to process and thread management overhead
(start, join and so on, even without any synchonization mechanisms).

Interprocess communication (IPC)


OS supported communication channels: pipes and queues.
multiprocessing.Pipe (bydirectional by default, no locking / consistency
guarentees)
multiprocessing.Queue and .JoinableQueue are more advanced and better

Sharing state
shared memory: multiprocessing.Value
manager processes

Synchornization
Same concepts as in threading lib
Python Crash Course
Multiprocessing
Main classes:
Process: represents individual process and provides methods for starting,
terminating, and managing it.
Pool: provides a way to create a pool of worker processes. It allows you to easily
parallelize the execution of a function across multiple input values. Manages the
creation and management of the worker processes and provides methods like map(),
apply(), and apply_async() for executing functions in parallel.
Queue: thread-safe First-In-First-Out queue implementation for communication
between processes (IPC). Used to pass data between the parent process and its child
processes.
Lock: a way to synchronize access to shared resources across processes. Mutual
exclusion - only 1 process can access resource at a time.
RLock: a reentrant lock, which means that it can be acquired multiple times by the
same process without causing a deadlock. This is in contrast to a regular Lock,
which would deadlock if a process tries to acquire it more than once.
Event: provides a simple way to communicate between processes using a flag. It
allows one process to signal an event, and other processes can wait for that event
to occur before continuing their execution.
Condition: provides a more advanced synchronization mechanism compared to Lock and
Event. It allows multiple processes to wait for a condition to be satisfied before
proceeding. It is often used in producer-consumer scenarios.
Semaphore: is a synchronization primitive that allows you to limit the number of
concurrent access to a shared resource. It maintains a counter that represents the
number of available resources, and processes can acquire or release the semaphore
based on this counter.
Barrier: synchronization point for a fixed number of processes. Allows processes to
wait until all participating processes have reached the barrier before continuing.
Python Crash Course
Concurrent.features
Main classes:
ThreadPoolExecutor: represents an executor that uses a pool of worker threads for
executing tasks concurrently. It allows you to submit functions or callable objects
for asynchronous execution and manages the creation and recycling of threads.
ProcessPoolExecutor: represents an executor that uses a pool of worker processes
for executing tasks concurrently. It is suitable for CPU-bound tasks that benefit
from parallel processing.
Executor: abstract class serves as the base class for ThreadPoolExecutor and
ProcessPoolExecutor. It defines a common interface for submitting tasks for
execution and managing their execution.
Future: represents the result of an asynchronous computation. You can obtain a
future object by submitting a task to an executor. Futures provide methods for
checking the status of the computation, cancelling it, and retrieving the result
when it becomes available.
as_completed(): takes an iterable of futures and returns an iterator that yields
completed futures as they become available. It allows you to process the results of
multiple futures as they finish, without waiting for all of them to complete.
wait(): takes iterable of futures and waits for them to complete. Allows to wait
for a group of futures to finish and optionally specify a timeout.
Python Crash Course
Async
Asynchronous programming is based on single threaded asynchrony for multiplexing
I/O bound workloads i.e. single thread switching between tasks, typically for I/O
bound workloads.
Historically came to prominence with event driven architectures: nginx, nodejs -
use event loops.
While event loop is internal in nodejs, in python if we decide to use async we can
manage the event loop - we need to engage it explicitly.
A great chess analogy: https://realpython.com/async-io-python/#async-io-explained …
and let’s run some simple demos from this tutorial.
With python use asyncio package to perform explicit asynchronous programming
We “decorate” our functions with async keyword making them into coroutines -
special functions that are wrapped and that have the ability to be executed
asychronously.
Let’s see a simple demo and get familiar with the syntax - some great examples in
the tutorial: https://docs.python.org/3/library/asyncio-task.html
Why is asynchronous programming useful? A single thread switching between tasks
when they are blocked in much more memory efficient than multiple threads that can
are separately blocked.
NOTE: just like with multiprocessing there are limitations when using async within
google collab: https://stackoverflow.com/questions/55409641/asyncio-run-cannot-be-
called-from-a-running-event-loop
Python Crash Course
Async
Main classes:
EventLoop: represents an event loop, which is responsible for executing coroutines
and managing their execution. The event loop provides methods for running
coroutines, scheduling callbacks, and managing timeouts.
Task: represents a coroutine wrapped in a future. It allows you to schedule and
manage the execution of a coroutine in the event loop. Tasks can be used to monitor
the progress and retrieve the result of a coroutine.
Future: represents the eventual result of an asynchronous operation. Futures
provide a way to interact with coroutines and receive their results. They can be
awaited, cancelled, and used to add callbacks for handling the completion of an
operation.
Semaphore: is a synchronization primitive that limits the number of coroutines that
can access a shared resource concurrently. It uses counters to control the
concurrency and allows coroutines to acquire or release the semaphore.
Queue: an asynchronous queue implementation that allows communication between
coroutines. Coroutines can put items into the queue and get items from it
asynchronously.
Lock: provides a way to synchronize access to shared resources in an asynchronous
context. It allows only one coroutine to acquire the lock at a time, ensuring
mutual exclusion.
Condition: provides an asynchronous equivalent to the Condition class in the
threading package. It allows coroutines to wait for a condition to be satisfied
before proceeding. Coroutines can wait, notify, or notify_all on a condition
object.
TimerHandle: represents a timer handle that can be used to schedule a callback to
be executed after a certain time interval. It allows scheduling timed events in the
event loop.

Python Crash Course


Microoptimizations
while 1 vs. while True (if you wait on a socket, like a web server).
no longer true in Python 3, see: https://stackoverflow.com/a/3815387/1964707
more about bytecode inspection: https://towardsdatascience.com/understanding-
python-bytecode-e7edaae8734d
__slots__
allow memory savings when we have many instance of an object
usually useful when you hold a lot of objects in a list or other datastructure
Ref0: gentle intro: https://youtube.com/watch?v=1UBr94hg0FE
Ref1: more technical overview: https://www.youtube.com/watch?v=Iwf17zsDAnY
More: https://stackify.com/20-simple-python-performance-tuning-tips/
Use the newest version of python if you can
Python Crash Course
import dis

def f1():
while True:
pass

def f2():
while 1:
pass

print(dis.dis(f1))
print("---------")
print(dis.dis(f2))
TBD
// disassembling python
// try Pypy
// try Mojo
// do not parallelize until you optimize via elimination. If sequential operations
are expensive and take a lot of time, parallelizing them can overwhelm the
bottleneck system completely.
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on

Additional information
--- Content from 1ZXHvn6uwORVQh6P4Ok2CGorMvyyhNzhy.pptx ---
Artificial Intelligence
Sequential Data Analysis
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


FFT, applications
01
02
AR, MA, AR+MA
Sequential Data Analysis
00
Regression or not?
ARCH, GARCH
03
04
ARIMA + GARCH
Anomaly detection
07
08
Deviation bands
06
Multivariate forecasting w/ VAR
STL for anonamly detection
09
10
Luminol library
05
Seasonal decomposition & STL
11
Prophet library
Linear regression (OLS) assumes you have independently and identically distributed
(IID) data. This is not the case with time series data. In time series data, points
near in time (lags) tend to be strongly correlated with one another. In fact, when
there aren no temporal correlations, time series data is hardly useful for
traditional time series tasks, such as predicting the future or understanding
temporal dynamics. This makes OLS usage for time series data complicated.
However this does NOT mean that LR reagression CAN'T be tried on TS data. It can,
but there are 6 mathematical conditions for it to be used (conditions for both the
data and the errors (variance of errors is independent of time, for example)),
making it impractical and very constrained. You would need to prove these
assumptions hold for your dataset and only then you would know that you can apply
LR. Concretelly we can summarize one of the analysis aspects that needs to be
fullfiled before applying linear regression to time series as Gauss-Markov Theorem
OLS is BLUE . This refers to the theorem that if your linear regression model
satisfies the first six classical assumptions, then ordinary least squares (OLS)
regression produces unbiased estimates that have the smallest variance of all
possible linear estimators. Read more about it here:
https://towardsdatascience.com/how-to-model-time-series-data-with-linear-
regression-cd94d1d901c0
Note also than linear models are popular even in fields where those assumptions do
not hold, like high frequency trading although they are only used when the stakes
are not high and only because they are fast [citation needed/source?]
Before exploring machine learning methods for time series, it is a good idea to
ensure you have exhausted classical linear time series forecasting methods.
Classical time series forecasting methods may be focused on linear relationships,
nevertheless, they are sophisticated and perform well on a wide range of problems,
assuming that your data is suitably prepared and the method is well configured.
We will look at:
alternatives to linear regression OLS - autoregressive models aim to exploit
autocorrelation in time series to predict future values.
fourier analysis - a method for translating frequency domain information to time
domain information of a time series (singal processing)
Ref:
https://www.reddit.com/r/statistics/comments/6ltzyw/why_cant_you_use_linear_regress
ion_for_time/
Regression or not?
Sequential Data Analysis
Fourier analysis is the study of the way general functions may be represented or
approximated by sums of simpler trigonometric functions (how complex sinusoidal
waves can be decomposed into simple sinusiods and v.v).
Some terms:
Fourier Transform: decomposing the complex wave pattern in time domain into a
frequency domain (that can be then presented as separate sine waves).
Inverse Fourier transform: mathematically synthesizes the original function from
its frequency domain representation.
Fast Fourier transform (FFT): the mathemtical algorithm that performs FT on it's
discrete representation in an electronic device.
Some materials:
Intro to FT: https://www.youtube.com/watch?v=spUNpyF58BY
Intro to FT (w/ Homer Simpson): https://betterexplained.com/articles/an-
interactive-guide-to-the-fourier-transform/
DFT: https://www.youtube.com/watch?v=nl9TZanwbBk
FFT: https://www.youtube.com/watch?v=E8HeD-MUrjY
FFT Algorithm: https://www.youtube.com/watch?v=toj_IoCQE-4
Denoising with FFT, Python: https://www.youtube.com/watch?v=s2K1JfNR7Sc
FFT, applications
Sequential Data Analysis
Every signal in the real world is a time signal and is made up of many sinusoids of
different frequencies. So, time domain signal can be converted into the frequency
domain to view different frequency components of the temporal signal. One important
field where this is used is predictive maintenance (also signal processing, very
common in wireless data transmission with various protocols).
What is predictive maintenance? The activity of trying to gain understanding that
would lead to forecasting of failures of engineering systems (trains, ships,
hardrives, turbines to name a few). Everything that has a circular motion (not
only) can also have periodic a vibration which could be analyzed by fouriers
transform.
Some materials:
Video: https://www.youtube.com/watch?v=0TH5SLghYPY
Video: https://www.youtube.com/watch?v=DUznmZvSQOU
Bearing:
https://www.researchgate.net/publication/258178475_The_effects_of_the_shape_of_loca
lized_defect_in_ball_bearings_on_the_vibration_waveform#pf7
The most important application of Fourier transform in context of predictive
maintenance is vibration analysis which makes use of the fact that all rotating
equipment vibrates to a certain degree. The incoming vibration data from the
sensors is converted into the frequency domain where you can analyze the frequency
of vibration and compare it to the standard baseline to see if your equipment is
functioning optimally or not.
FFT, applications
Sequential Data Analysis
FFT can be used for data / signal denoising - eliminating noise, purifying the
signal
How can noise enter the system - think about a networking cable near powerful
electromagnetic equipment. The equipment will have a certain frequency (from
electric motor) that will interfere with the internet cable.
After obtaining the frequency and amplitude information we can use butterworth low
pass filter and ignore data above certain frequency, where noise is present.
References:
A simpler example is available here: https://medium.com/@nehajirafe/using-fft-to-
analyse-and-cleanse-time-series-data-d0c793bb82e3
Another example of denoising: https://www.youtube.com/watch?v=s2K1JfNR7Sc
FFT, applications
Sequential Data Analysis
Forecasting with fourier extrapolation - repeats your series with period N, where N
- length of your time series
Example:

Note: forecasting / extrapolation is just accidental for FFT. And for the next few
models it will be the main purpose. Keep that in mind and don’t think that FFT and
ARMA are closelly related - they are not.
FFT, applications
Sequential Data Analysis
Autoregressive (AR) models operate under the premise that past values have an
effect on current values. AR models are commonly used in analyzing nature,
economics, and other time-varying processes. As long as the assumption holds, we
can build a AR model that attempts to predict value of a dependent variable today,
given the values it had on previous days.
Moving Average (MA) Model Assumes the current value of the dependent variable
depends on the previous days error terms.
The ARMA model is simply the combination of the AR and MA models.
AR + MA mechanisms are at the core of a whole family of linear autoregressive
models, which incluse ARMA, ARIMA, SARIMA, SARIMAX
Important: all of these models work well only for short forecasting
horizons/periods unless the pattern is simple to forecast. There seems to be no
rule of thumb as to how many (10%, 1%) values you can forecast, but at any rate you
are safest when you forecast just a single value - single value forecast should be
the most accurate.
Here is a theoretical introduction: https://www.youtube.com/watch?v=kaXKnjCvEUQ
AR, MA, AR+MA
Sequential Data Analysis
Autoregressive Moving Average (ARMA) method models the next step in the sequence as
a linear function of observations and residual errors at prior time steps (lags,
which are the parameters p and q).
The notation for the model involves specifying the order for the AR(p) and MA(q)
models as parameters to an ARMA function, i.e. ARMA(p, q).
An ARIMA model can be used to develop AR or MA models. This is a combination of AR
+ MA models with the added parameter to adjust for non-stationarity of the mean
function (i.e., the trend) via differencing. ARIMA(p, 0, q) is the same as ARMA(p,
q)
Types of ARMA Models:
ARIMA: Non-seasonal Autoregressive Integrated Moving Averages
SARIMA: Seasonal ARIMA used for seasonal data, see: https://www.youtube.com/watch?
v=IK67f3IItfw
SARIMAX: Seasonal ARIMA with exogenous variables.
We can use pyramind auto arima package: https://pypi.org/project/pmdarima/ -
although the library is called pmdarima, it does evaluate SARIMA models if
seasonality is set to true: auto_arima(..., seasonal = True, …)
AR, MA, AR+MA
Sequential Data Analysis
Parameter selection process for autoregregresive models:
ACF and PACF plots are used to determine the order or MA and AR terms respectively.
d - (differencing in ARIMA) check for stationarity with augmented dickey-fuller
test, choose the parameter by trying different values (possibly in a loop).
p - (for AR) use PACF, model order should be when the lags are not above
significance line
q - (for MA) use ACF, model order should be when the lags are not above
significance line
the residuals of the model need to have constant mean, no autocorrelation, be
normally distributed
If you violate these 3 principles in the residual data then you need to tune your
model or choose a better one because there is usefull information still there.
Or you can use pmdarima that can select for the best model.
Good resources:
https://www.baeldung.com/cs/acf-pacf-plots-arma-modeling
https://people.duke.edu/~rnau/411home.htm
AR, MA, AR+MA
Sequential Data Analysis

AR, MA, AR+MA


Sequential Data Analysis
A change in the variance or volatility over time can cause problems when modeling
time series with classical methods like ARIMA. This phenomenon is often present and
also known as conditional variance, or volatility clustering.
ARCH or AutoRegressive Conditional Heteroskedasticity method provides a way to
model a change in variance in a time series that is time dependent (but fixed),
such as increasing or decreasing volatility. Example: electrical grid usage TS data
- the volatility is bigger durring the day and smaller at night, so the volatility
is changing but in a non-random (non-stochastic) manner.
Examples of heteroscedastic data: climate, realestate prices, salaries (does
government fight heteroscedasticity, is it natural?)
An extension of this approach named GARCH or Generalized Autoregressive Conditional
Heteroskedasticity allows the method to support changes in the time dependent
volatility (the volatility periods are themselves stochastic), such as increasing
and decreasing volatility in the same series. GARCH is a better fit for modeling
time series data when the data exhibits heteroskedacisticity and volatility
clustering. ARMA models are only effective for homoskedastic data.
ARCH, GARCH
Sequential Data Analysis
Hetero and homoskedasticity - One aspect of a univariate time series that these
autoregressive models do not model is a change in the variance over time.
Classically, a time series with modest changes in variance can sometimes be
adjusted using a power transform, such as by taking the Log or using a Box-Cox
transform. There are some time series where the variance changes consistently over
time. In the context of a time series in the financial domain, this would be called
increasing and decreasing volatility.
In time series where the variance is increasing in a systematic way, such as an
increasing trend, this property of the series is called heteroskedasticity. It’s a
fancy word from statistics that means changing or unequal variance across the
series. If the change in variance can be correlated over time, then it can be
modeled using an autoregressive process, such as ARCH.
There are many variations of these models:
https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity
ARCH, GARCH
Sequential Data Analysis
Stockmarket prices are a classic stochastically heteroscedastic TS example. But,
the values are also depended on the past values so that means AR/MA models also
apply. These properties imply that we need to use both some kind of ARMA + ARCH
models to account for all the inforamtion that is embeded in data (to leave fewer
unexplained patterns, smaller residuals): arch.arch_model(arima_residuals, p=1,
q=1)
How to choose p and q values for GARCH? See:
https://stats.stackexchange.com/questions/175400/optimal-lag-order-selection-for-a-
garch-model
Residual trick: You may choose to fit an ARMA model first and then fit a GARCH
model on the ARMA residuals, but this is not the preferred way. Your ARMA estimates
will generally be inconsistent. (In a special case where there are only AR terms
and no MA terms, the estimates will be consistent but inefficient.) This will also
contaminate the GARCH estimates. Therefore the preferred way is to estimate both
ARMA and GARCH models simultaneously for which there is no Python library (but
there is in R) AFAIA.
One way to overcome this problem (no residual trick) is to train a lot of different
ARIMA(p1, d, q1)-GARCH(p2, q2) models, and select the best working one based on
criteria such as AIC or BIC.
AIC and BIC are Information criteria methods used to assess model fit while
penalizing the number of estimated parameters. When performing model selection, the
one with the lowest AIC or BIC is preferred. BIC penalizes the amount of model
terms more than AIC.
[TODO] Link to AIC vs. BIC.
Research from the real world: https://link.springer.com/article/10.1007/s12053-019-
09800-3
Reference: https://medium.com/analytics-vidhya/arima-garch-forecasting-with-python-
7a3f797de3ff
ARIMA + GARCH
Sequential Data Analysis
We know that STL stands for seasonal-trend decomposition procedure based on Loess.
This technique gives you an ability to split your time series signal into three
parts: seasonal, trend and residue.
It’s more primitive relative - seasonal decomposition - suffers from shortcomings:
Assumes constant seasonal component
Is not defined at the beginning since it uses a moving average
Due to moving average it is not sensitive in catching fast rise
Seasonal decomposition & STL
Sequential Data Analysis
Vector Autoregression (VAR) is a forecasting algorithm that can be used when two or
more time series influence each other. That is, the relationship between the time
series involved is bi-directional. It is considered as an Autoregressive model
because, each variable (Time Series) is modeled as a function of the past values,
that is the predictors are nothing but the lags (time delayed value) of the series.
How is VAR different from other Autoregressive models like AR, ARMA or ARIMA? The
primary difference is those models are uni-directional, where, the predictors
influence the Y and not vice-versa. Whereas, Vector Auto Regression (VAR) is bi-
directional. That is, the variables influence each other.
In the VAR model, each variable is modeled as a linear combination of past values
of itself and the past values of other variables in the system. Since you have
multiple time series that influence each other, it is modeled as a system of
equations with one equation per variable (time series).
Where, Y{1,t-1} and Y{2,t-1} are the first lag of time series Y1 and Y2
respectively. The above equation is referred to as a VAR(1) model, because, each
equation is of order 1, that is, it contains up to one lag of each of the
predictors (Y1 and Y2). As you increase the number of time series (variables) in
the model the system of equations become larger.
Multivariate forecasting w/ VAR
Sequential Data Analysis
Anomaly detection problem for time series is usually formulated as finding outlier
data points relative to some standard or usual signal. We ask - what doesn't fit
here - when looking for anomalies?
Anomaly - something that does not follow a pattern seen before. Or sudden
appearance of pattern where there was randomness only.
While there are plenty of anomaly types, we’ll focus only on the most important
ones from a business perspective, such as unexpected spikes, drops, trend changes,
level shifts (steps), even sudden disappearance of noise in the industry, sudden
appearance of pattern. In general it might be good to talk about point and trend
anomalies:
Point anomalies are suden spikes (exceed the values seen before)
Pattern anomalies are a bunch of values that don't follow the previous pattern
(trend change, patten change)
There are many libraries for anomaly detection and it's a very active field:
https://github.com/rob-med/awesome-TS-anomaly-detection
Anomaly detection
Sequential Data Analysis
Terminology used

Anomaly vs. Outlier: Very important when searching for literature and material
(some books and articles are called 'outlier detection', others 'anomaly
detection')! These terms are not stricly defined nor differentiated and some argue
that there is no difference, especially in everyday talk, but in academic
literature and when searching for information online there is some difference: Read
more: https://datascience.stackexchange.com/questions/24760/what-is-the-difference-
between-outlier-detection-and-anomaly-detection

“Retrospective” vs. realtime: more anomally detection algorithms are based on


“retrospective” anomally detection. We are going to that as well as this is not a
dedicated course. Realtime anomaly detection can be harder or much harder depending
on the requirements (both methodologically and in terms of implementation). See:
https://medium.com/pinterest-engineering/building-a-real-time-anomaly-detection-
system-for-time-series-at-pinterest-a833e6856ddd
Anomaly detection
Sequential Data Analysis
We can use rolling statistics to define deviation bands - rolling average + rolling
stddev, for example.
There are common techniques in the literature: ESD (exponential deviation) bands,
Bollinger Bands.
This is a simple and intuitive approach to solve this problem, however there are
shortcomings:
Inability to detect a change in volatility

Trend changes (depending on tolerance)


Appearance of disappearence of different patterns

They can be overcome using other simple techniques, like detecting changes in
rolling standard deviation.
TODO: crest factor based, RMS, statistical moments - other feature engineering
techniques: https://www.analyticsvidhya.com/blog/2022/01/moments-a-must-known-
statistical-concept-for-data-science/
Deviation bands
Sequential Data Analysis
We can simplify anomaly detections with STL.
If you analyze deviation of residue and introduce some threshold for it, you’ll get
an anomaly detection algorithm. The not obvious part here is that you should use
median absolute deviation (3 or 4 sigma (three sigma rule, which states that 99.73%
of values in a normal distribution)) to get a more robust detection of anomalies.
The leading implementation of this approach is Twitter’s Anomaly Detection library
(available only in R, abandoned project, algorithm is described here:
https://arxiv.org/pdf/1704.07706.pdf ). It uses Generalized Extreme Student
Deviation test to check if a residual point is an outlier.
STL for anonamly detection
Sequential Data Analysis
Luminol is a light weight python library for time series data analysis created by
LinkedIn.
The two major functionalities it supports are anomaly detection and correlation
btw/ 2 ts.
It can be used to investigate possible causes of anomaly.
Ref: https://github.com/linkedin/luminol
Luminol library
Sequential Data Analysis
Prophet, or “Facebook Prophet,” is an open-source library for univariate (one
variable) time series forecasting developed by Facebook. Prophet implements what
they refer to as an additive time series forecasting model, and the implementation
supports trends, seasonality, and holidays (significant events).
Ref: https://facebook.github.io/prophet/docs/quick_start.html
Previously import fbprophet now import prophet
Less tuning - can be compared to auto arima in that sense.
Implementations: https://towardsdatascience.com/anomaly-detection-time-series-
4c661f6f165f and https://www.youtube.com/watch?v=0wfOOl5XtcU
Library link: https://github.com/facebook/prophet
Prophet library
Sequential Data Analysis
Envelope of an oscillating signal is a smooth curve outlining its extremes.
Analysis of (wave) envelopes emphasizes the extreemes of time signal. This is
contrary to a lot of techniques that would dismiss extremes in favor of more
regular pattern.
Very important in mechanical bearing analysis, ref:
https://www.bksv.com/media/doc/bo0187.pdf
https://en.wikipedia.org/wiki/Envelope_(waves)
https://hal.science/hal-01714334/document
Envelope analysis
Sequential Data Analysis
Fit a OLS regression for times series and compare to ARMA family models and others.
Try to prove that OLS is not as useful as autoregressive models (...or disprove
it).
You can explore whether ANOVA methods are used for time series analysis.
You can also investigate Markov models.
Explore the Luminol and Prophet libraries further - are there better libraries?
Which has more features.
Investigate different forecasting models for robustness against anomalies /
outliers (fit ARIMA on ideal sinusiod, introduce 1 point anomaly or step anomaly,
see what happens to the prediction, … etc.)
Further explorations
Sequential Data Analysis
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Sequential Data Analysis


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1D9B5X8-cmi0P8aX8eMFfyKgPsOtxsnGw.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


KMeans
01
02
Gaussian Naive Bayes
Natural Language Processing
00
Classifying with ML
Topic modeling
03
04
Keyword extraction with RAKE
What steps do I need to take to develop a ML classification model (some exceptions
for DL - not all steps are the same)?:
Obtain text samples
Stopwords removal - Stop words like “and”, “if”, “the”, etc are very common in all
English sentences and are not very meaningful in deciding the theme of the article,
so these words can be removed from the articles.
Removal of sensitive and redundant data - removing emails or usernames from data
that is supposed to be general in order to remove model dependece (bias) on
specific information. Or replace them with tokens: <email>, <username>, etc.
Punctuation removal - Exclude all punctuation marks from the set([‘!’, ‘#’, ‘”‘,
‘%’, ‘$’, “‘”, ‘&’, ‘)’, ‘(‘, ‘+’, ‘*’, ‘-‘, ‘,’, ‘/’, ‘.’, ‘;’, ‘:’, ‘=’, ‘<‘,
‘?’, ‘>’, ‘@’, ‘[‘, ‘]’, ‘\’, ‘_’, ‘^’, ‘`’, ‘{‘, ‘}’, ‘|’, ‘~’]).
Word Lemmatization - It is the process of grouping together the different inflected
forms of a word so they can be analyzed as a single item. For example, “include”,
“includes,” and “included” would all be represented as “include”. The context of
the sentence is also preserved in lemmatization as opposed to stemming (another
buzz word in text mining which does not consider the meaning of the sentence).
Digit removal
Feature extraction / vectorization (Tf-Idf, count, others)
Model choice
Model training
Model evaluation
Module tunning (untill satisfactory performance is achieved)
Classifying with ML
Natural Language Processing
Same theory about classification applies to NLP.
Binary classification example: possitive/negative reviews (sentiment analysis).
Multiclass classification example: questions redirect to some specific department
based on how they were classified (FAQ).
Classifying with ML
Natural Language Processing
K means clustering can be used to understand the topic of the text.
We train the model on some data - for example sentences about cats and google.
This should create two clusters that correspond to these topics - although there is
no guarantee that clustering centroids will be cat and google, it likely will be if
the training data is good.
Then, passing a new sentence should give us a prediction based on the closeness of
the cluster.
KMeans
Natural Language Processing
Commonly a naive bayes model is used for classification.
One of the popular examples in the literature is spam/ham classification is often
implemented with naive nayes model.
We will perform a classification type called sentiment analysis, which tries to
attribute a piece of text to categories “possitive” / “negative”. Imdb movie review
dataset is commonly used for this purpose.
To refresh your knowledge about NB, see: https://www.youtube.com/watch?
v=H3EjCKtlVog
NB model can be used for multiclass classification.
Gaussian Naive Bayes
Natural Language Processing
Definition - essentially topic extraction from documents aka describing what the
document is about (unsupervised) - topics (technical term). People are good at
describing what a given document is about, let's see how it's done with machines.
Sometimes in this NLP context topics are described as lattent factors (that is why
these models can be used in a larger pipline of techniques).
Topic modeling
Natural Language Processing
Concrete description from wiki (rephrased and modified):
In machine learning and natural language processing, a topic model is a type of
statistical model for discovering the abstract "topics" that occur in a collection
of documents. Topic modeling is a frequently used text-mining tool for discovery of
hidden semantic structures in a text body.
Intuitively, given that a document is about a particular topic, one would expect
particular words to appear in the document more or less frequently: "dog" and
"bone" will appear more often in documents about dogs, "cat" and "meow" will appear
in documents about cats, and "the" and "is" will appear approximately equally in
both.
A document typically concerns multiple topics in different proportions; thus, in a
document that is 10% about cats and 90% about dogs, there would probably be about 9
times more dog words than cat words.
The "topics" produced by topic modeling techniques are clusters of similar words. A
topic model captures this intuition in a mathematical framework, which allows
examining a set of documents and discovering, based on the statistics of the words
in each: what the topics might be and what each document's balance of topics is.
Thus enabling similar document retrieval (recommendation in medium, substack).
Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A
good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a
topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.
Topic modeling
Natural Language Processing
Applications of topic modeling
For Example – New York Times are using topic models to boost their user – article
recommendation engines. Various professionals are using topic models for
recruitment industries where they aim to extract latent features of job
descriptions and map them to right candidates. They are being used to organize
large datasets of emails, customer reviews, and user social media profiles.
Additionally topic model algorithms are unsupervised, but can be used in
combination with supervised algorithms for various tasks, see:
https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-
learning-input-cf8ee9e5cf28
Topic modeling
Natural Language Processing
Let’s discuss the input → model → output chain
Input - document-word matrix
Models
Many exist: LDA, LSA, NMF and more
The objective of all the algorihtms will be the same: extract the topics (sometimes
called 'the lattent factors'), but the implementation is different.
All of these algorithms are complex (although computationaly simpler than deep
neural networks). Understanding them completelly requires quite advanced linear
algebra. We will cover how to use them and how to understand them from a more
general vantage point. A more in depth introduction can be found in numerous books
on NLP, but a good starting point would be this:
https://medium.com/@souravboss.bose/comprehensive-topic-modelling-with-nmf-lsa-
plsa-lda-lda2vec-part-1-20002a8e03ae
Output - two matrices: document-topic and topic-word matrices (a number of topics
will be much smaller than the original vocabulary). Additionally, sampling will be
performed in order to improve the topic model. The details are mathematical, so we
will not cover them, but some details can be found here:
https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-
python/
Note: in NLP there various resolutions of data though which the problems are
solved: character level models, word level, sentence/statement level, document
level (we could separate n-grams and paragraphs, but these are less common). Topic
modeling is often document level model (because we might want to use it for
recommendations of similar articles).
Topic modeling
Natural Language Processing
A fellow Lithuanian describes his experience with topic modeling:
https://www.youtube.com/watch?v=3mHy4OSyRf0
Criticisms of LDA: https://eigenfoo.xyz/lda-sucks/
Topic modeling
Natural Language Processing
Keywords answer the question: what is important about this document, or more
rigourously: "what sequence of contingous words are important in the document".
Keywords and topics are closelly related so topic model can be used in place of a
keyword extraction approach.
However in this section we want to get familiar with RAKE algorithm - Rapid
Automatic Keyword Extraction. Assumption that RAKE makes are:
Keyword extraction with RAKE
Natural Language Processing
RAKE uses a scoring system. How does it calculate the score for a word? score =
degree / frequency
frequency: how often the word appears
degree: how often the word co-occurs with other words
So a word that co-occures frequently but only with a small subset of words will
have a high score
Generating keywords from text and providing automatic search results. Practical
project example: https://www.youtube.com/watch?v=Cv5U671me-0
More on RAKE: https://www.analyticsvidhya.com/blog/2021/10/rapid-keyword-
extraction-rake-algorithm-in-natural-language-processing/
Keyword extraction with RAKE
Natural Language Processing
Check if topic modeling is more useful when stopword removal is performed. Do you
think stopwords help or hinder the topic model to identify the correct topic-word
and document-topic frequencies?
Further explorations
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1QNMDB06V_xyMOiVz0QrYhEe1guFa2EII.pptx ---


Artificial Intelligence
Sequential Data Analysis
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Problems with DL for Sequences
01
02
RNN Structure and Representation
Sequential Data Analysis
00
RNN Definition and Usecases
Training
03
04
Simple RNN
RNN taxonomy
07
08
LSTMs and GRUs
06
Explaining Batch Size
Common questions and terms
09
10
Further explorations
05
RNNs in Keras
Recurrent Neural Networks (RNN) are a class of Artificial Neural Networks used in
Deel Learning that can process a sequence of inputs and retain its state while
processing the next sequence of inputs. Traditional neural networks will process an
input and move onto the next one disregarding the sequential aspect of the data.
Data (such as time series) have a sequential order that needs to be followed in
order to understand it, to extract all the patterns in the data. Traditional feed-
forward networks cannot comprehend this as each input is assumed to be (almost)
independent of each other whereas in a time series setting input is dependent on
the previous input to varying degrees.
The order in which you feed tabular data or batches of images into FCFFNN or CNN is
not statistically significant (except for the situations where the initialization
problem / early bias is present (that is why we have shuffling parameters in keras
and pytorch)). The breed of dog or salary of a person will be predicted indepently
of the order of samples presented durring training. RNNs work differently, because
the data is sequential - so the predictions depend on the previous values
(necessarily at least to some degree, however small it might be).
RNN Definition and Usecase
Sequential Data Analysis
image source: IBM
RNNs are mostly used for ordinal (NLP, Genetic sequences) and sequential (time
series) tasks.
Usecases more concretely:
Video captioning (maybe surprisingly, this is a computer vision task)
Speech / text generation (character and word level, next part in the course)
Language translation
Stock price prediction (forecasting)
Genome anomalies, patterns, generation
Autonomous driving - trajectory prediction
This does not mean that these problems can only be solved using RNNs - just a tool
in our toolbox
However in recent years there are experiments / research to use RNNs for image
classification: https://arxiv.org/abs/2007.15161 . Also, “sequential learning in
the absence of sequential data” is also a thing, like attention steering to read an
image: http://karpathy.github.io/2015/05/21/rnn-effectiveness/#:~:text=Sequential
%20processing%20in%20absence%20of%20sequences
RNN Definition and Usecase
Sequential Data Analysis
There are two problems with FCFFNNs for sequences:
inability to handle data of varying length (same problem for CNNs)
inability to learn in an order dependent manner (Teddy Roosevelt vs. Teddy Bear)
A key challenge is that we somehow need to construct a neural network with a fixed
number of parameters, but with the ability to process a variable number of inputs
(this is not a simple point, there is a lot of misunderstanding about which
dimension can vary). Input lenght needs to be fixed and be the same in order for
both FCFFNNs to work and for the CNNs to work (as noted in the diagram above).
Remember we actually did sentiment analysis on the reviews of movies in a tabular
form durring the discussion of tabular data analysis with Deep Learning. We needed
to fix the size of the review in order to use it with FCFFNN ('clipping to the
minimum' could be used where the shortes review dictates how long of a text we will
pass through the network). And this is a good example of how we CAN NOT process
variable lenght inputs. An RNN would not care if one review is shorter and the
other is longer, it would just for with whatever size input it's given (and in this
case this would be a many to one RNN - text to be analyzed is fed into an RNN,
which then produces a single output classification (e.g. This is a positive
review). More on that latter).
As for example with CNNs - just think that we had to resize our images, make them
fit a certain shape and form (aka aspect ratio) ir order to use CNN. And we will
discuss latter on how CNNs can be better understood in light of RNNs and whether
RNNs can be used for things other than sequential data.
Problems with DL for Sequences
Sequential Data Analysis
Recurrent neural networks have connections that have loops, adding feedback and
memory to the networks over time. This memory allows this type of network to learn
and generalize across sequences of inputs rather than individual patterns.
A powerful type of Recurrent Neural Network called the Long Short-Term Memory
Network (LSTM) has been shown to be particularly effective when stacked into a deep
configuration, achieving state-of-the-art results on a diverse array of problems
from language translation to automatic captioning of images and videos [could be
outdated]. Recurrent neural networks (can deal with sequences of variable length
(unlike feedforward nets)).
Problems with DL for Sequences
Sequential Data Analysis
Structural problems can be easilly dealt with using FCFFNN and CNN. However when we
add time dimension we need RNNs. They enable us to "learn from the past" i.e.
adjust our predictions based on past values.
There are two representations of the structure of a recurrent neuron:
unrolled / unfolded one
rolled / folded one
The yI dependence on yI-1 is present @ training time and inference time
RNN Structure and Representation
Sequential Data Analysis
The output of the recurrent neuron is also different - not necessarily one value.

Does the network have any weights and biases, like a traditional FCFFNN? Yes.
RNN Structure and Representation
Sequential Data Analysis
Let’s take a look at these introductory videos:
https://www.youtube.com/watch?v=LHXXI4-IEns
https://www.youtube.com/watch?v=yZv_yRgOvMg (theory, contains raw RNN for NLP)
RNN Structure and Representation
Sequential Data Analysis
For training a simple FCFFNN we used the backpropagation algorithm, which was also
used for CNNs. When we add "memory" or "temporal dinamic behavior" things get
complicated. So while the problem is structural and suceptable to be solved by
FCFFNNs and CNNs we used backpropagation, but with RNNs we use BPTT or
"backpropagation through time" (see next slide).
Teacher Forcing - when “ground truth” (y) is re-fed into the model during training
rather than the models own predictions. It is quite counter intuitive and not well
explained / used in the literature. A list of books is in wikipedia entry for this
topic: https://en.wikipedia.org/wiki/Teacher_forcing . You can find this video to
be somewhat informative: https://youtu.be/C08mT2VSHGg?t=275 . Something similar to
teacher forcing is commonly used for seq-2-seq models / encoder-decoder
architecture.
TBPTT - truncated backpropagation through time. Truncated Backpropagation Through
Time, or TBPTT, is a modified version of the BPTT training algorithm for recurrent
neural networks where the sequence is processed one timestep at a time and
periodically (k1 timesteps) the BPTT update is performed back for a fixed number of
timesteps (k2 timesteps). https://machinelearningmastery.com/gentle-introduction-
backpropagation-time/
Training
Sequential Data Analysis
Backpropagation through time - the staple technique for training feedforward neural
networks is to back propagate error and update the network weights. Backpropagation
breaks down in a recurrent neural network, because of the recurrent or loop
connections. This was addressed with a modification of the Backpropagation
technique called Backpropagation Through Time or BPTT.
[Need to clarify] Instead of performing backpropagation on the recurrent network as
stated, the structure of the network is unrolled, where copies of the neurons that
have recurrent connections are created. For example a single neuron with a
connection to itself (A->A) could be represented as two neurons with the same
weight values (A->B). This allows the cyclic graph of a recurrent neural network to
be turned into an acyclic graph like a classic feed-forward neural network, and
Backpropagation can be applied.
This unrolling is the depth of the RNN. Because of the fact that sequences passed
thought the RNN can be very long it is unussual to see many layers in RNNs (3-5 is
already quite big) - shallow.
Ref: TODO
Training
Sequential Data Analysis
We will formulate our problem like this – given a sequence of 50 numbers belonging
to a sine wave, predict the 51st number in the series (forecasting problem).
When working with RNNs we have to understand another concept: the data needs to fed
in a specific format.
X = (seq_count x timesteps_in_each_seq x number_of_features_at_each_timestep) and Y
(seq_count x predicted_feature_count_for_each_timestep) E.g.: (100, 50, 1), (100,
1) - we are predicting a single number for sequence of which we have a hundred.
We can create an RNN from scratch:
https://www.analyticsvidhya.com/blog/2019/01/fundamentals-deep-learning-recurrent-
neural-networks-scratch-python/
More object oriented approach: https://github.com/pangolulu/rnn-from-scratch
Simple RNN
Sequential Data Analysis
We have to have input as 3rd order tensors - why?
We will see latter that RNNs, LTSMs require 3D input for the X values. This is
often very confusing for beginners and people who return back to the field after a
break (make sure you remember this, because even if you are an employeed data
scientist you might not work with RNNs and forget this information, which is
important in case you want to understand RNNs. If you don’t understand them you can
still train them, but less effectivelly). What are those 3 dimensions:
Samples / how many sequences. One sequence is one sample. A batch is comprised of
one or more samples (batch size not specified).
Time Steps / how many values in each sample. One time step is one point of
observation in the sample.
Features / collumn count. One feature is one observation at a time step. This is
essentially how many values at each time step you have (unidimensional vs.
multidimensional data).
Simple RNN
Sequential Data Analysis
If you have two collumns - you have two features (multivariate time series) and you
want to feed it into the network as two samples of size 5 that both fave 2
features. Presure and temperature
Simple RNN
Sequential Data Analysis
The following things hold true working with RNNs:
The RNN input layer must be 3D.
The meaning of the 3 input dimensions are: no. samples, time steps, and features.
The RNN input layer is defined by the input_shape argument on the first hidden
layer (in keras, for example).
The input_shape argument takes a tuple of two values that define:
the number of time steps per sequence (none) and
number of features (“columns in a table”, in keras) … we’ll see that latter, ref:
https://stackoverflow.com/a/61155200/1964707
The number of samples is assumed to be 1 or more.
The reshape() function on NumPy arrays can be used to reshape your 1D or 2D data to
be 3D.
The reshape() function takes a tuple as an argument that defines the new shape.
Simple RNN
Sequential Data Analysis
There are numerous resources for RNNs in Keras on the internet, the essential
theory
Add one of three popular types of recurrent layers SimpleRNN, GRU, LSTM (the last
two will be covered latter) into a sequential container (at least for most cases).
See: https://keras.io/api/layers/recurrent_layers/
You can stack them together if the moder is not powerful enough (obtain deep RNN)
If they are stacked set return_sequences=True in all layers intermediate layers
(last RNN layer is an exception and to this rule and there is more theory involved
here). If you don’t the output will be 2D and the next layer will complain.

Keras has complex RNN layers with multiple regularizers, multiple initializers
(these are due to multiple weight matrices associated with a single cell/neuron),
dropout for the layer and the hidden state, even more than one activation function
that can be different. We will cover some of the options in the future, but please
spend some time to research other options for your specific tasks/problems.
RNNs in Keras
Sequential Data Analysis
Simplest RNN

model = keras.models.Sequential()
model.add(keras.layers.SimpleRNN(1, input_shape=[None, 1]))

That’s really the simplest RNN you can build (contains a single layer, with a
single recurrent unit).
We do not need to specify the length of the input sequences, since a recurrent
neural network can process any number of time steps (this is why we set the first
input dimension to None).
By default, the SimpleRNN layer uses the hyperbolic tangent activation function.
The initial state h is set to 0, and it is passed to a single recurrent neuron,
along with the value of the first time step, x . The neuron computes a weighted sum
of these values and applies the hyperbolic tangent activation function to the
result, and this gives the first output, y . In a simple RNN, this output is also
the new state h . This new state is passed to the same recurrent neuron along with
the next input value, x , and the process is repeated until the last time step.
Then the layer just outputs the last value, y. All of this is performed
simultaneously for every time series (in a batch).
For each neuron, a FCFFNN model has one parameter per input and per time step, plus
a bias term (total of 51 parameters). In contrast, for each recurrent neuron in a
simple RNN, there is just one parameter per input and per hidden state dimension
(in a simple RNN, that’s just the number of recurrent neurons in the layer), plus a
bias term. In this simple RNN, that’s a total of just 3 parameters. This does not
mean that the RNN model takes less memory and obviously not that it is faster to
train or infer!

RNNs in Keras
Sequential Data Analysis
Deep RNNs - layering
Given a standard feed-forward multilayer Perceptron network, a recurrent neural
network can be thought of as the addition of loops to the architecture. For
example, in a given layer, each neuron may pass its signal latterly (sideways) in
addition to forward to the next layer. The output of the network may feedback as an
input to the network with the next input vector and so on.
If your model “hits a wall” in the training process and the error is not decreasing
and remains stable, that is probably due to the lack of internal state space /
power of the network you can make it multilayered.
model = keras.models.Sequential()
model.add(keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]))
model.add(keras.layers.SimpleRNN(20, return_sequences=True))
model.add(keras.layers.SimpleRNN(1))

Make sure to set return_sequences=True for all recurrent layers (except the last
one, if you only care about the last output). If you don’t, they will output a 2D
array (containing only the output of the last time step) instead of a 3D array
(containing outputs for all time steps), and the next recurrent layer will complain
that you are not feeding it sequences in the expected 3D format. This parameter
essential says that the previous layer should pass all the output's / predictions
to the next layer.

RNNs in Keras
Sequential Data Analysis
Batch size in case or RNN indicates how many sequences at a single iteration are
passed to NN
Explaining Batch Size
Sequential Data Analysis
Consider the following taxonomy of sequence problems that require a mapping of an
input to an output (from Andrej Karpathy).
One-to-One: image classification.
One-to-Many: sequence output, for image captioning, classical example - music
generation by genre.
Many-to-One: sequence input, for sentiment classification, next ts stock
forecasting.
Many-to-Many: sequence in and out, for machine translation (phrase in one language
→ phrase in a another) (encod.-decod.)
Synced Many-to-Many: synced sequences in and out, for video classification.
Implementation of each: https://stackoverflow.com/questions/43034960/many-to-one-
and-many-to-many-lstm-examples-in-keras
RNN taxonomy
Sequential Data Analysis
We can discuss another type of taxonomy/classification of RNNs, not based on ratio
of input to output but based on the internal mechanism of the neuron, also called a
“cell”.
Schematically (see side picture).
Why are these variations? They solve the problem of simple RNN: Short term memory -
inability to effectively use information that is farther down the chain, simple
RNNs have a hard time learning long from long sequences.
Let’s check this nice summary: https://www.youtube.com/watch?v=8HyCNIVRbSU
LSTMs and GRUs
Sequential Data Analysis
Some additional things to remember:
LSTMs and GRUs are faster to converge than DenseNNs and SimpleRNNs.
LSTMs are more complicated than GRUs.
Historically GRU came latter than LSTMs and where invented to simplifying LSTMs
while retaining most of the power they have.
There is no consensus as to which one to use where: as ussuall you can/need to try
both to see which one works better. Additionally if we try all we can learn whether
our data has long time/lag dependencies.
There are some general considerations: LSTMs have historically been proven to be
more powerfull and flexible. However due to their complexity they are slower and
not so practical when building complicated big recurrent networks. So there is a
tradeoff: if your data requires bigger network - GRUs could be a good choise, if
your data requires more powerfull cells - LSTMs are ussually advised. However some
research has argued that GRUs are more powerful in most cases (citation needed).
However: https://www.researchgate.net/publication.
LSTMs and GRUs
Sequential Data Analysis
Handling variable lenght sequences
We have said that RNNs can accept variable length sequences.
It is possible to do that with ragged tensors: See:
https://stackoverflow.com/questions/62031683/ragged-tensors-as-input-for-lstm
Most of the time variable lenght sequences are not being passed to RNNs, they are
padded. However, when using frameworks the padding is not that simple, we use
framework specific metchanisms for that - see this tutorial for a full discussion:
https://www.tensorflow.org/guide/keras/masking_and_padding#mask_propagation_in_the_
functional_api_and_sequential_api - this involves padding: the process of adding
fake/ignorable data - and masking: telling the NN model to ignore padded data via
mask flag.
And: https://stats.stackexchange.com/a/452205/162267
Note - for time series data there is not much need for padding, the data can be
naturally assembled into several hundred sequences of the same length using numpy
eitherway.
Also: https://datascience.stackexchange.com/questions/26366/training-an-rnn-with-
examples-of-different-lengths-in-keras/27879#27879
Sequential Data Analysis
Keras previously had CuDNNGRU and CuDNNLSTM layers for training RNN architectures
on a GPU.
Now keras LSTM and GRU layers implicitly default to GPU implementations under
certain conditions.
What are those conditions: https://stackoverflow.com/a/60468424/1964707
TODO: Demo
TODO: What happens with Pytorch
Training on GPU
Sequential Data Analysis
Do RNNs have biases and how are they connected?
https://stats.stackexchange.com/questions/169329/bias-inputs-in-an-rnn/169555 +
use_bias=True by default in keras

What are Bidirectional RNNs (BRNNs). Biderectinal RNNs can take past and future
sequence values into account. Very usefull for analyzing sentences in natural
language processing. They can not be used for real time tasks, only when you have
the whole data, e.g.: "I think Teddy Roosvelt was a good president!" (Teddy depends
on president). We will be learning about those in NLP part. With stock prices BRNNs
would not be useful for forecasting but might be useful for data imputation.

Are RNNs the only ones capable of handling sequences? No. Transformers and other
architectures can also do that, see excerpt.

Is data fed into RNN one time-step at a time (very important to use the term-
timestep, LLMS start talking about batching if you do not specify something like
“one value at a time”, be precise). Yes in the same batch.

Do sequences have to be of the same length? No.


Common questions and terms
Sequential Data Analysis
How to calculate trainable parameters in a RNN (or LSTM or GRU)? Easiest way is to
create a layer and then just call it passing data.

How does forward propagation happen in RNN?


Ref: https://jmyao17.github.io/Machine_Learning/Sequence/RNN-1.html [untested]
Ref: https://pub.towardsai.net/whirlwind-tour-of-rnns-a11effb7808f [untested]
Ref: https://mi-pages.informatik.uni-ulm.de/explornn/ [not very useful, but
somewhat]
Ref: https://www.youtube.com/watch?v=u8utlK_c5C8 []
Ref: https://www.youtube.com/watch?v=DFZ1UA7-fxY []
Common questions and terms
Sequential Data Analysis
Comparing to classical forecasting
So we would:
Stationarize the data using STL (or other)
Train RNN only on residuals (this should make the model train faster)
When generating forecasts also predict the next trend and seasonal value
Combine the predition
Common questions and terms
Sequential Data Analysis
You can go deeper into LSTMs and GRUs
You can explore Wavenet
What is the minimal RNN needed architecture and the minimal amount of data needed
to learn a fibonnaci sequence forecasting, y = x^2 / factorial / sine / summation?
Try to predict a sequences from an IQ test (the "what comes next question").
Research if the general case of possible before starting.
Further explorations
Sequential Data Analysis
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Sequential Data Analysis


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1imf-bdOn8OAGMof8xpXWzzOftV1D6jWb.pptx ---


Artificial Intelligence
Python Crash Course
Lecturer
Mindaugas Bernatavičius

Today you will learn


Lecturer introduction
Teaching methodology remarks
01
02
03
04
05
Student intro and questions about them
Course structure
Python Crash Course
Administrative questions
Lecturer introduction
Python Crash Course
About me:
Name: Mindaugas Bernatavičius to find me on Teams and LinkedIn. Advice: create
LinkedIn and connect.
IT carrer since ~2011, started applying to IT jobs.
Started as data entrance engineer, then QA, QA lead, security engineer, performance
engineer, engineer in networking company, then Principal Software Engineer.
Currently a Software Engineer at a startup company, Freelancer, Mentor.
Languages: Java was the first language (VBA), then PHP, SQL, Bash, Python … etc.
4 parallel carrer tracks: main job, freelance projects, consulting / teaching
individually and techning professionally (bootcamps).
About 5 years with Python on an on/off basis (both at the job and freelance).
Lecturer introduction
Python Crash Course
Pedagogical experience:
6+ years since I started teaching profesionally as a personal consultant to people
and small bussinesses.
~10 success stories in helping to find jobs to junior engineers or increase
salary / upgrade positions.
100+ university students consulted.
Probably ~ 250-300 people thought (Java, C#, C++, Python, Web, Security, Servers,
Networking, Data related) in group settings.
4+ years with CodeAcademy (2020)
Let’s connect:
LinkedIn: “Mindaugas Bernatavičius”
SO: https://stackoverflow.com/users/1964707/mindaugas-bernatavi%c4%8dius
Github: https://github.com/MindaugasBernatavicius
Blog (currently inactive): https://blog.mindaugas.cf/
Youtube: “Programavimo Mokytojas”
Student intro and questions about them
Python Crash Course
Enough about me, let’s talk about you (this will also test your ability to
participate verbaly).
I will need (this is not mandatory, you can share only things you want):
your name
your preferred operating system / what you will use in this course
your PC / Laptop parameters: RAM, CPU, HDD vs. SSD, GPU
your IT experience
your Python experience
your Math experience
your English level
your goal for this course (be as concrete as possible)
your success criteria for this course - “if I achieved this, I would be happy”
… I will record the answers to understand how to steer the course better.
P.S. You can offer any topic you want to cover yourselves (OCR, excel automation,
fourier analysis, instance segmentation).

Teaching methodology remarks

Python Crash Course


From unconscious incompetence to countious competence it the goal.
Structured and methodical, but will try to improvise when students need that.
I like to understand the topic from fundamentals, then moving to higher levels
(bottom-up). Will try to use this approach here …
Believer in the learning curve (which can be rephrased as: 80/20 principle).
Emphasis on the balance between theory and practice (as theory (terminology and
concepts) is also important for job inteviews and self-guided studies, working in
teams).
Continuous feedback from students - I will ask you to evaluate my work (+ or - at
the end of each class @ minimum)
Administrative questions

Python Crash Course


Main points:
Attendance list will be filled at the start of the lecture (or not)
4 hours per day, 18:00 - 22:00.
Monday through Thursday - 4x a week.
Breaks: 10, 10, 10 (Poll).
Materials: slides, code in notebooks, video recordings, supplementary material from
the web.
Python Crash Course
Course structure

Discussing each part separately


Materials are kept in a table (see below)
Lecturers goals: supply the materials, motivation for the students to start
coding / creating projects and research more and more independently as the course
progresses.
Homework / class exercises? How much time will you dedicate?
Course structure

Python Crash Course


There are 4 larger parts:
Python and Tools (1-3)
Machine Learning (4)
Deep Learning (5, 7, 9) (we go from Perceptron to MLP, to DNN, CNN, RNN,
Transformers … Style GAN)
Applications of Deep Learning (6, 8, 10-12) (we apply DNNs to Tabular, CNNs to
images and videos)
Assesments

Python Crash Course


Separate tests (during class) for each part (1-13).
Projects (optional, but highly encouraged) for each part (1-13) , or until we have
time (untill approx. part 10, usually).
Final project (mandatory) + Final Test (mandatory)
Scoring:
FINAL_TEST + (AVG(TESTS) * 0.2), if TESTS > 0
FINAL_PROJECT + (AVG(PROJECTS) * 0.2), if PROJECTS > 0
How to succeed

Python Crash Course


In this course:
Dedicated practice, find some additional time (even 15min. a day). Drill exercises
idea and time tracking.
Practice is the most important, but theory (vocabulary, terms, concepts and so on)
is not so distant second.
Informal group of like minded people (doing homework together, projects together
and so on). Lets create a Teams group “Student Zone”.
More tips? Yes!
In the job market:
Choose country.
Job market research (companies, skills required, internships, junior roles).
cvmarket.lt / cvbankas.lt / linkedin, indeed.com/linkedin - for roles abroad.
General rule: be specific (choose skill, country, company, role, etc.)
Skills for a role vary by company - get a list, get courses/books, practice, create
a few projects (portfiolio).
Lead generation: contact companies, recruiters directly.
Personal marketing: blog/linkedin articles, youtube videos, github portfolio,
kaggle profile.
Python Crash Course
https://www.linkedin.com/pulse/top-ai-companies-lithuania-deividas-mata%25C4%258Di
%25C5%25ABnas/?trackingId=vzr3nMm%2BR6OJG9NBoKFSNQ%3D%3D
Course plan
You can get familiar with it using this link
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/
Python Crash Course
Detailed course plan
Slides, tasks and so on

--- Content from 1v6OBrathrPgeQoEMNiAi4wvqXuasiRl4.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Overview
01
02
Data Structures
Python Crash Course
00
Why Pandas? Installation
Data input formats
03
04
Grouping operations
05
06
07
Indexing and filtering
Datascience pipeline and datascience pyramid
08
Vizualization and matplotlib
Pandas performance
Why Pandas? Installation
Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language + other deps.
Ref: https://pandas.pydata.org/
Pandas is the defacto tool for tabular data in data analysis tasks.
The main benefits: speed, size of the community giving support, intuitive
dataformats used to represent the data.
Installation is simple with pip , and pandas is already installed in google colab.
Built on top of numpy, so we have that as a dependency.

Additionally we will see more dependencies latter on that various pandas methods
use behind the scenes, ref. Please use these integration tools if possible because
they are often more performant than doing things in python.
https://pandas.pydata.org/docs/getting_started/install.html#optional-dependencies
Python Crash Course
Overview
Main points:
“goal of becoming the most powerful and flexible open source data
analysis/manipulation tool available in any language”
Suitability for heterogenous tabular data.
Easy indexing, slicing and munging (cleaning and transforming)
… and much more:

Ref: https://pandas.pydata.org/docs/getting_started/overview.html

Performing operations @ the database level vs @ pandas vs. @ python. Example: load
data from DB and data enrichment from file - perform as many operations in DB then
pandas (for data loaded from file).

Python Crash Course


Data Structures
2 main datastructures (there are more, but these are the main ones):

Why more than 1: https://pandas.pydata.org/docs/getting_started/overview.html#why-


more-than-one-data-structure
Declarative syntax
What about mutability? Favours immutability:
https://pandas.pydata.org/docs/getting_started/overview.html#mutability-and-
copying-of-data

Python Crash Course


Data Structures
See:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.ht
ml
Main points: each row in a dataframe is a series.
Demo: Creating series and dataframes. 4 ways from python - which one is best?
Python Crash Course
Data input formats
Pandas can load multiple types of data right into a dataframe.
Includes things like XLS (and other binary formats), CSV, JSON, HTML (and other
text files, even flat text files) and SQL / relational data.
If there is something that Pandas does not support directly you can use python
object and any data can be converted to python with code!
Ref:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html
Demo: Data input formats

Since we are working with a Dataframe that has two dimensions, what should we do
with a highly nested JSON? Pandas has a tool for flattening hierarchical data like
json: pd.json_normalize()
Demo: https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-
flattening-json-13eae1dfb7dd

Python Crash Course


Indexing and filtering
Examples (these are all important, use cheat sheets:
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf ):
df[‘x’] → select by column
df.col_name → select by column, this can have unexpected result and is
not the best practice.
If column names are not strings, or if they conflict with methods of DataFrame,
attribute-style access is not possible.
df[[‘x’, ‘y’]] → select multiple collumns
pd.unique(col) → get unique values + can be used with len()
s.value_counts() → counts how many items of each value there is in a series
df[‘x’] .value_counts() → powerful way of getting count of how many times
different values in column x appear. It is filterable!
df[‘x’] == ‘xyz’ → filtering, return boolean series with index
(df[‘x’] == ‘x’) & (df[‘y’] == ‘y’) → needs to be put in parentheses due to high
order of precedence of unary and operator
df.loc[df[‘x’] == ‘xyz’, : ] → select items by labels from the dataframe using row
and column indexers: df.loc[<row_indexer>, <col_indexer>]
df.iloc[x, y] → select items by possition of rows and collumns
df[‘x’:‘y’] → slicing refers to rows, not collumns - do not mix with
multi column indexing: df[[‘x’, ‘y’]]
More: https://stackoverflow.com/a/17071908/1964707
Python Crash Course
Indexing and filtering
Examples:
df[‘x’] * df[‘y’] → values are multiplied.
Often an error can occur and we need to convert values to numbers.
pd.to_numeric(df[‘a’], errors=’coerce’). Coerce means that values that can not be
converted forced to be NANs.
df.loc[:, ‘x’] = pd.to_num... → converting certain column to numbers
df.assign(c=df.[a] * df[b]) → create a new column from the values of old
ones - can apply operators
df[‘x’].max() → get the biggest value
df[‘x’].idxmax() → get the index of the biggest value, it is common to
use df.loc[df[a].idxmax(), : ]

Python Crash Course


Grouping operations
Motivation:
Aggregation (.min() , … , std() | .agg()): Problems like “what was the most
profitable month for our company” when we have profit data month by month it is
straightforward. A more complex problem is what is the most profitable month each
year.
Transformation (.transform()): Filling values by some statistical measurement:
imputation of the most common value (mode), or average, interpolation, etc. Data
imputation is one form of data transformation and one way to handle missing data.
Group filtering (.filter()): Drop the groups where some column x does not have
duplicate values. There is the .filter() function for group filtering and it is
notable that there are many more ways to filter non-group data:
https://towardsdatascience.com/8-ways-to-filter-pandas-dataframes-d34ba585c1b8

Examples:
for a, group in df.groupby(a) → return dataframe groupby object (<class
'pandas.core.groupby.generic.DataFrameGroupBy'>).
This returns multiple dataframes with all same values in the column we grouped by.
df.groupby(a)[col].head() → shorthand form
df.groupby(a).agg(func) → apply aggregation
df.groupby(a).transform(func) → apply transformation
df.groupby(a).filter(func) → apply filter

Python Crash Course


Datascience pipeline and datascience pyramid
We can define datasciece as the process of converting raw data into insights and
informed actions (in terms of input - output). Data driven decisions.
Datasciene is also a process that describes how this transformation happens -
process which can be described as stages in a pipeline.
The structure of a pipeline in general terms: data source(s) → various stages of
data processing → data sink (a ML model can be a sink aswell)
There are multiple ones on the web, but we can see the main commonalities.
Why is this helpful: you can position / orient yourself easier when solving a
problem (simple classroom task → creating datascience company).
Case analysis: BiDroid freelance project, Citybee hack, Question → Data Acuisition
→ … → Answer / Insight

Python Crash Course


Datascience pipeline and datascience pyramid
Datascience pyramid - explains where most “data people” work, where is the highest
demand. Also our course structure. Also explains that w/o a good data collection
process, good data storage process it is impossible to do datascience and ML.
Additionally: MLOPS

Python Crash Course


Datascience pipeline and datascience pyramid
Lets imagine a company that specializes in detecting empty spots in parking lot:
Data Instrastructure: cameras, network (LTE, vs. Ethernet).
Data Engineering: create efficient database / datastore, choose databases, adding
indexes for efficient querying, etc.
Data Analyst: create dashboards with service usage data, service load data, answer
simple questions: how many cars have we processed in the last several year and what
is the extected growth.
Data Scientist: more complicated questions (ussually statistical in nature) … or
crete the empty spot detection models.
ML Engineer: create the models that do empty spot the detection.

Python Crash Course


Visualization and matplotlib
Pandas integrates with matplotlib seamlessly.
When you call Series.plot() or DataFrame.plot() you are calling a matplotlib
wrapper function.
We will talk about data vizualization and matplotlib more in the upcoming parts,
but we should learn the basics in the context of pandas.
The central concept in matplotlib ir is the figure. The figure is a container for
plots (or subplots):

Demo: creating, configuring and saving the plot.


Question about visualization: you have a plot where several values are “black
swans”, “six-sigma events” - all the other values are very small in comparison and
you can’t even see a trend in them (let’s say each year the amount of snow increase
on average). What simple concept can we use see the trend w/o eliminating the
extreme values? Where in IT can we see such scenarios?

Python Crash Course


2, 2, 1
2, 2, 4
Pandas performance
When we talk about pandas performance we are talking about improving calculation /
transformation performance. Tasks like loading file / network data to create
dataframes can be easily handled by traditional i/o performance improvements - like
launching multiple threads to load many many files.
One such example is not trying to concatenate dataframes inside a loop:
https://stackoverflow.com/a/36489724/1964707
See general recommendations:
https://pandas.pydata.org/docs/user_guide/enhancingperf.html
See recommendations for additional dependencies:
https://pandas.pydata.org/docs/getting_started/install.html#recommended-
dependencies
Explanations and examples:
https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-
query.html

Numexpr performance: https://medium.com/productive-data-science/speed-up-your-


numpy-and-pandas-with-numexpr-package-25bd1ab0836b
Ways to quickly generate large dataframes:
https://stackoverflow.com/questions/52588653/why-is-pandas-eval-with-numexpr-so-
slow
Pandas faster than Numpy in certain cases:
https://stackoverflow.com/questions/62424612/why-is-pandas-faster-then-numpy-on-
simple-mathematical-operations
Using the swifter package with .apply(): https://towardsdatascience.com/10x-times-
faster-pandas-apply-in-a-single-line-change-of-code-c42cb5e82f6d

Python Crash Course


Pandas performance
Additional optimizations: https://towardsdatascience.com/10x-times-faster-pandas-
apply-in-a-single-line-change-of-code-c42cb5e82f6d

Python Crash Course


BONUS: Pandas in Pycharm
One way to inspect pandas dataframe inside Pycharm is to debug the program and
press “view as Dataframe” on the variable!
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1itGTPFRhyQF6XG-h7gvFpJ05Rwk7x2vL.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Histograms
01
02
Line charts
Python Crash Course
00
Data Visualization
Scatter plots
03
Bar graphs
04
05
06
Other graphs and charts
Aggregating and sampling
07
Lying with graphs
08
Bokeh and Seaborn
Data Visualization
Data visualization is a very important process in the data science pipeline and
skill in your carrer.
Often one can get a much better insight visually
People working with data often want to present their findings in meetings visually
to be more convincing
We already know how to visualize data using pandas, but this time we will talk more
about visualization techniques and when to use which, anwer questions like: when
should we use piechart, numeric histogram and so on.
Additionally we will include a few tools that are commonly used for data
visualization in the python world, like: bokeh and seaborn.

Python Crash Course


Histograms
This type of chart displays data grouped into “bins” and how much of each bin is
there in a dataset. Essence: bins, x-independent var, y-quantity of measurements.
Some of the examples where histograms are useful:
You see that the average time to first byte (TTFB) - an important metric in the web
development / SEO world - became worse all of a sudden this week. A histogram can
tell you easily if all your requests became slower or maybe just the requests that
were previously slow now are even slower. Essentially by comparing histograms you
would be able to form a troubleshooting hypotheses (if only the slowest requests
got slower - could you explain why that could be? What are the possible reasons if
all requests got slower?).
Compare salary bins and their counts between two companies that report very similar
salary averages to understand if you have a better chance at getting payed more in
one company or another.
What the average / standard deviation tells we can begin to understand with
histograms - if we see that the average TTFB got worse, it does not tell us why.
But a histogram can tell us why.
Histogram can be in some way a “fingerprint” for specific data, specific situation.
Vibration data - if the vibrations in a specific frequency bin increase it might
mean that the mechanical system might be on it’s way out.

Python Crash Course


Histograms
Types of histograms: https://en.wikipedia.org/wiki/Histogram#Examples
Questions: What interpretations / reasons could you give for a bimodal or trimodal
distribution of company salaries?
How to choose the binning strategy: change in resolution should not change the
shape of the histogram too much, and it is advisable to experiment with several bin
values it when doing analysis. Histogram should not hide or abstract important
information (like outliers or multimodality in the distribution).
Binning choice can confuse w/o enlightening - this is not data anymore, but
interpretation and you have to be careful.

Python Crash Course


Uniform
Line charts
Very often used to examine a trend over time, so commonly seen when time series
data is involved.
Can be used in non time-related contexts, not often are.
Examples:
Your website performance becomes noticeably bad. Was it getting bad continuously or
did it get bad suddenly.
Same for sales - abrupt changes indicate an impactful sudden event and long trends
can indicate issues that are much harder to track and understand. Youtube chanell
growth - sudden change is easier to explain than long term trend.
Best practices:
The dots connecting the line should be either logorithmically spaced or
equidistant.
When doing comparative analysis it is usually a good idea to bring the compared
linecharts to the same scale (see GDP curves). Normalization or log scale graphs -
two ways to solve this.
See: https://www.fusioncharts.com/line-charts
Python Crash Course
Scatter plots
Show how two variables are related with each other and how strong is that
relationship (correlation). More dispersion between the points - less correlation.
Examples:
People’s height and weight correlate, but there is variation. Muscle tissue weights
more.
“Job performance score” vs. years in the company. It is very easy to measure this
in sales, or some highly monotonic and deterministic manual jobs.
Skill and years of practicing that skill (chess).
Scatter plot does not average out the values and more generally is free of
inferential artifacts (unlike line charts and histograms), thus it can help us see
clusters of / individual outliers - maybe someone who only practiced sales for 1
years outsells someone who did it for 5 - we can learn what that person does and
spread the knowledge throughout our company.
Python Crash Course
Bar graphs
Bar graphs can be used to display many things, however they provide the most value
when different categories are compared side by side.
Examples:
Expenditures on food per age-category/age-group.
Social media usage per age-category/age-group.
Usage of non-renewable energy sources per country for a given year (can be commonly
displayed on a line chart aswell) - kwh/capita.
You need to understand the revenue vs. cost that each product or service that your
company provides is bringing in? Bar graph can help with that.
Stacked bar chart / graph is a common variation of bar chart worth knowing. It can
be used the same way as multibar chart, but commonly is used when displaying the
growth of something over a period of time and how big of a part does each category
take up. In the example of social media usage however stacked bar chart would be
disadventagious because it would make it hard to compare the proportions of each
age group.

Python Crash Course


Bar graphs
Stacked-multibar-chart.
Stacked chart can provide some intuitions about the additive components of some
aggregate measurement

Python Crash Course


Other graphs and charts
Piechart - used to illustrate the subdivision of the whole and how much each part
dividing the whole takes. Examples:
How much of the total market does each company take up (mobile phone market,
cryptocurrency market).
What kind and in what proportion of energy does a country use.
Question: if you want to display how the proportions are changing overt time which
other type of chart would you use and why?
Variations: donutchart / 3d/2d pie chart and such.
Stacked (multi) bar charts can do everything that pie charts do + visualized
systems evolution overtime easily.

Heatmap - used to show the comparative value and intensity of a variable that is
dependent on two independent variables. Very powerful visual technique for finding
outliers / intensity areas / specialization / high correlation and so on. Examples:
Farmers vs. products → size of the yield
Book titles vs. bookstores → sales
Students vs. subjects → grades

Python Crash Course


Other graphs and charts
Even more graphs: https://matplotlib.org/stable/gallery/index.html
Combinations (“chimeras” / “hybrids”) are possible, e.g.: axes histogram and
scaterplot.

Python Crash Course


Summary of chart types and examples
TBD: this will be a table someday…

Histogram - how many items in our dataset fall into specific category/range
(TTFB / response time buckets)
Line chart - evolution over time, often for comparison (APPL stock price)
Scatter plot - correlation between two variables (weight vs. height) - least
interpretation, most raw visualization
Bar chart - usefull for comparing different categories against the same
metric (profit vs. cost), sometimes time evolution
Piechart - how much of a proportion of the whole does specific category
occupy
Heatmap - intensity of variable that depends on 2 independent variables
Boxplot - variables max/min/25th percentile/75th percentile/median and
outliers - all in one!

Addendum: Animated graphs are not as powerful in term of information immediate


understanding as static graphs - because you can not compare the first and last
visuals/frames as easily. The ultimate graph would display all information at once.
Python Crash Course
Aggregating and sampling
When you have too much data to display you need to either aggregate (show the mean
or 95th percentile) or sample (choose random items - sample each set that you want
to display separately if they are not equally represented in data, e.g.: marketing
vs. it employees and their salaries - maybe marketing has 9 employees for every 1
employee in the sample).
This introduces interpretation / bias in the visualization, just like splining. If
I asked what is was the max in Figure 1 - what would you say?
Spline: https://en.wikipedia.org/wiki/Spline_(mathematics)

Python Crash Course


Lying with graphs
As a person working with data you should know how data and statistics can mislead
people. This should help you to not be mislead yourself.
Examples:
https://towardsdatascience.com/stopping-covid-19-with-misleading-graphs-
6812a61a57c9
https://medium.com/@hypsypops/axes-of-evil-how-to-lie-with-graphs-389c1656d538
https://heap.io/blog/how-to-lie-with-data-visualization
Common misleading constructs:
Scale discrepancies (different scales of variables compared)
Visuals do not match the numbers (russian covid - numbers are increasing but curve
is flattening)
Y axis starts from non-zero (most common [opinion])
Missing labels, axis ticks or similar
No inference about individuals from group statistics (not based visualization, just
general)
Spurious correlations visualized (internet explorer and murder statistics) - corr.
not causation.
Logarithmic scales / double y-axis - can be misleading for non-technical audiences

Python Crash Course


Seaborn
Seaborn is another data visualization library
It is built on top of matplotlib and addresses some of it’s limitations.
What are those limitations:
Visual appeal - seaborn visualizations are better looking
Verbose - seaborn uses a more minimalistic syntax
Seaborn - high level (abstraction on top of mplib), matplotlib is lower level.
Ref: https://seaborn.pydata.org/
Seaborn groups plots according to datatype:
relplot() → for plotting relationships - line and scatter plots
catplot() → for categorical data plots
distplot() → histogram
kdeplot() → kernel density esimator
Ref: https://www.youtube.com/watch?v=qc9elACH8LA
Ref: https://www.youtube.com/watch?v=x5zLaWT5KPs
Python Crash Course
Bokeh
Python Crash Course
An interactive plotting and visualization library for showing / creating
visualizations that are embedable into web pages - this is it’s main usecase: if
you want to export interactive graphing content on a webpage.
Not based on matplotlib.
Basic plotting does not require special datastructures, but for more advanced case
we need to use ColumnDataSource Bokeh datastructure to be able to display data,
although for simple numpy arrays and Python lists Bokeh can wrap them itself:
https://docs.bokeh.org/en/latest/docs/first_steps/first_steps_8.html#using-
columndatasource
Different visual elements that can be added on a figure are called glyphs.
Ref: https://docs.bokeh.org/en/latest/docs/first_steps.html#first-steps
Also provides ajax datasource:
https://docs.bokeh.org/en/latest/docs/user_guide/data.html#ajaxdatasource
Plotly … and others
Python Crash Course
A very commonly used library
Nice looking graphs
Commonly people use plotly for interactive 3d plotting

Funny enough matlab plotly has 3d pie chart and python version does not
https://plotly.com/matlab/3d-pie-plots/

Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1xkOzq5IODhSguHoBA0LlvFZSi1Bi7qCA.pptx ---


Artificial Intelligence
Recommender Systems
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Collaborative Filtering Example
01
02
Learning the embeddings with NN
Recommender Systems
00
Collaborative Filtering
Improvements to the model
03
Further explorations
07
06
Fast.ai and book recommendations
Content-based filtering (CB), makes recommendations based on user preferences for
product features. If someone likes this item, person will like this other similar
item.
Collaborative filtering (CF) mimics user-to-user recommendations (not a friend).
People like what other similar people like (in this case with similar taste in
movies / books / items). CF predicts users preferences as a linear, weighted
combination of other user preferences.
Both methods have limitations
Content-based filtering can recommend a new item, but needs more data of user
preference in order to incorporate best match.
CF needs large dataset with active users who rated a product before in order to
make accurate predictions.
Combination of these different recommendation systems called hybrid systems and can
overcome these limitations but is harder to implement.
Novel items, outside of users direct preferences is a problem which CF solves
better.
More advanced problems: how do you recommend something that is similar when the
item is a once in a decade item (a tool for 500€) - you recommend what similar
people bought.
Collaborative Filtering
Recommender Systems
Both methods have limitations (cont.)
It is great to be able to recommend items based on their content i.e. content based
similarity measurement using some distance metric. However this requires domain
knowledge to be able to capture the data / features that would best represent item
similarity and also provide novelty to be able to expand users interests.
We do not always know which features to capture.
For example: does the user choose a travel offering because of the location or
because of the price/location combination (then maybe it would help create other
features that would better capture the specifics of the items). In short: feature
engineering is emphased in content-based filtering.
Another example: we used movie tags for movie recommendations. How about we added
genre information? What if the information for genre was already encoded in the
tags (then we would not need it) what if part of the genre information was encoded
- then we would need to merge them all. How about the movie length - some users
evaluate the movies poorly based on how long they were.
So in general it is accepted that CF systems or a hybrid system are better.
In ideal conditions CF outperforms CB.
But in specific cases …
Collaborative Filtering
Recommender Systems
One interesting phenomenon that can arrise in collaborative filtering recommenders
is popularity of old items.
Example: “Touching the Void” (book) appeared much earlier than “Into Thin Air”
(book), but after an online book store started noticing that people who buy “Into
Thin Air” also buy “Touching the Void” they started to recommend it and now it
outsells the newer item 2 times.
We can also note, that CB has the ability to make items viral. Why? Because as the
count of users that “liked” this item grows, so the probability that other similar
users will also get this recommendation.
Collaborative Filtering
Recommender Systems
CF is not a monolith the implementation of which is set in stone and everyone can
just read it and implement it. There are various ways of doing it.
Ref: https://towardsdatascience.com/various-implementations-of-collaborative-
filtering-100385c6dfe0
Recommender systems taxonomy: https://www.researchgate.net/figure/Taxonomy-of-
Recommender-Systems_fig2_323726564
Collaborative Filtering
Recommender Systems
This example
(https://keras.io/examples/structured_data/collaborative_filtering_movielens/)
demonstrates Collaborative filtering using the Movielens dataset
(https://www.kaggle.com/c/movielens-100k) to recommend movies to users. The
MovieLens ratings dataset lists the ratings given by a set of users to a set of
movies. Our goal is to be able to predict ratings for movies a user has not yet
watched. The movies with the highest predicted ratings can then be recommended to
the user.
Collaborative Filtering Example
Recommender Systems
The steps in the model are:
Map user ID to a "user vector" via an embedding matrix
Map movie ID to a "movie vector" via an embedding matrix
Compute dot product between user vector and movie vector, to obtain a match score
between user and movie (predicted rating).
Train the embeddings via gradient descent using all known user-movie pairs.
Make predictions - pass user_id, movie_id pairs (same user id) to get predicted
ratings by that user and sort them.
Tune the model and repeat.
Deep Learning model / Neural Net can help learn embeddings that comprise the latent
factors than decompose the user-to-movie matrix.
Importantly embeddings connect 6, topics that look different on the surface:
recommenders (CF), tabular data with NNs, NLP word embeddings, dimensionality
reduction, Transformers and geometric distance between concepts. Important to
remember for job interviews.
Learning the embeddings with NN
Recommender Systems
Demo with excel:
Lesson 7: Practical Deep Learning for Coders 2022 https://youtu.be/p4ZZq0736Po?
t=4153 (1:09:13)
Excel demo / playground:
https://docs.google.com/spreadsheets/d/1U4IKBztUZiD4dy1J1skQAAz0qec2a2As/edit?
usp=sharing&ouid=100991015018856999546&rtpof=true&sd=true
Demo with NN:
We can have an initial model that has only Dot product between embeddings & sigmoid
activation.
We can then add biases
Then add Dense layer
Add multiple Dense layers
Add Dropout regularization
… and so on, whatever helps.

Remember: with NNs we are trying to achieve a constant learning curve as the
network gets exposed to more data it should learn something. If the learning curve
is flat or declining we have a problem.
Improvements to the model
Recommender Systems
Fast.ai framework takes collaborative filtering seriously
… they have a dedicated part in their course (https://youtu.be/cX30jxMNBUw?t=5114 -
precise point)
… and in their library documentation for collaborative filtering
(https://docs.fast.ai/collab.html)
See the link to get the from scratch implementation and some more explanations:
https://colab.research.google.com/github/fastai/fastbook/blob/master/
08_collab.ipynb#scrollTo=AF5XmS6Fwtya
Let’s implement a recommender using goodbooks dataset (this is the second dataset
for recomm.). This dataset can be used for content-based filtering using tags.
Fast.ai and book recommendations
Recommender Systems
collab_learner() is dual - can be dot product based and can neural network based.

Inter retation of the bias term - the fast.ai documentation states that a single
number learned per movie - the bias parameter represents the “intrinsic value of
the movie”. How did they arrive at this conclusion? Correlation between the average
rating of all people vs. bias for each movie is strong (we should verify that).
We can research: what does the user bias represent. How - just check various
parameters that it correlates to. It might be that the user bias will represent
“the intrinsic value of the user” :D … that would obviously be in the context of
this dataset, maybe users that have the most reviews would be though as valuable.
Check the correlation between user review count and the bias (we can do that with
model trained on Keras or Fast.ai).
We learned an important lesson about model interpretability - check what various
parameters are correlated with. By that we can understand what the neural network
(or other models) represent with that parameter.
Fast.ai and book recommendations
Recommender Systems
Apply the knowledge you gained here to a new dataset.
Research: take any neural network we trained and check the bias of the last layer.
Is it correlated with anything? For example maybe we are trying to guess the flat
price by size and #rooms and the bias is high when the room count it big and low in
reverse case (just one possible example).
Most recommenders (at least in the literature) do not take into account how
recently the evaluation of a movie (and thus adjustment to users preferences) was
made. Movies watched by the user recently probably should have higher impact on the
current taste of the person than those made 15 years ago (IMDB accounts are >15
years old). How would we implement that? We would save a timestamp of when the
evaluation was made and then some logic to dampen the evaluations as they become
old: eval - (eval * 0.1 * N_years) … of course real implementation would be
different. Recommendation systems should not be time-dependent. Also, keep in mind
that the algirthms do not say that the item recommended is better in any
substantive / objective manner.
Further explorations
Recommender Systems
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Recommender Systems
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 162F3V0NWTigZWvnichcXmZSygwOfb8Ue.pptx ---


Artificial Intelligence
Tabular Data Analysis with Pandas
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Goals of EDA
01
02
Datasets and Demo
Tabular Data Analysis with Pandas
00
What is EDA
Common questions
03
//
04
05
06
//
//
07
//
08
//
What is EDA
EDA - exploratory data analysis. It is nothing more than common sense (+ stats and
visualization techniques) applied to unknown data - if you see a dataset (a bunch
of images, videos, songs, texts, excel or csv tables) what questions will you ask
about it? What immediate things would you like to know?
Images: how many? RGB, CMYK, BRG? Expected categories? Incorrect categories?
Videos: how many? Size? Format? Expected categories? Exceptions to the common
rules?
Tabular data: dimensions, column data types, columns as data features, missing
values? Nan?
Time series data: missing values…
Text data: …
Although typically data science process is depicted linearly - it is obviously not
linear (side images).
EDA + Transforms / Cleaning + Feature engineering > 50% of data science process /
data project time.

Exploration → Clearning / Transformation → Feature Engineering


New dataset → first look → explanation files (.html, .pdf, .txt, .names, papers:
https://arxiv.org/abs/1505.04868 HMDB51 dataset
(https://ieeexplore.ieee.org/document/6126543), etc.) → EDA techniques.
Write down questions looking though the data and answer those questions (talk to
people of possible, use technical skills).
Tabular Data Analysis with Pandas
The amount of data collected grows - in the problem domains where we know precisely
what we are solving we use a database, where we don’t know we use more unstructured
storage types: data lakes (S3, filesystem, FTP) and wherehouses. But the
availability is only half of the story - you also need quality. Quality of data is
ensured by EDA + Data cleansing informed by EDA. Data storage choices can sometimes
make EDA easier / harder (video can not be stored in a database that allows
querying).
Main goals:
understand the dataset dimensions and parameters in general (nan, n/a, missing
values, corrupt values and so on).
understand the domain to which the dataset belongs to (crime stats, real estate
prices, tree health problems, etc.).
ensure the data is not garbage (n.b.: GIGO) and develop techniques to eliminate
garbage (nonsensical values, outliers (sometimes require domain knowledge), null /
nan and other values).
prepare for data cleaning (and other transformations).
prepare for feature engineering, modeling and other data project pipeline steps
The goal of EDA depends (to a small degree) on your goals and other constraints in
the data project. For example if you are trying to model only a specific narrow
band of data (flats between 30-50m^2) and you have only a few datapoints to work
with in that bad you might preprocess the data in that band differently (input
missing values instead of eliminating them).
EDA is a “research” / spike / POC level step/activity or one-off job (kaggle
competition) activity. Opposed to automated datapipeline (no EDA). EDA is performed
by a human/person to understand the data (but EDA automation also exists).
Goals of EDA
Tabular Data Analysis with Pandas
Datasets and Demo
Dataset:
New York city taxi dataset
Demos:
https://www.youtube.com/watch?v=OY4eQrekQvs (tree stump sizes is a googlable
question!)
https://www.youtube.com/watch?v=QiqZliDXCCg
https://www.youtube.com/watch?v=-o3AxdVcUtQ

https://ocw.mit.edu/courses/6-s897-machine-learning-for-healthcare-spring-2019/
video_galleries/lecture-videos/

https://www.youtube.com/playlist?list=PLoazKTcS0Rzb6bb9L508cyJ1z-U9iWkA0 (very good


playlist)
Tabular Data Analysis with Pandas
Common questions
EDA or cleaning / cleansing first? EDA: https://www.kaggle.com/questions-and-
answers/103089
What are common pandas methods used in EDA? .describe(), .isna(), .info().,
isna().sum()
What is EDA, and why is it important?
What are the different techniques you can use for EDA?
How do you approach data cleaning and preparation before performing exploratory
data analysis? Is it possible that the data was pre-cleaned by someone else?
Organizational structure: can data engineering interfere with ML engineering / data
science?
What are the different visualization techniques you can use to explore data?
How do you identify outliers in your data? Algorithmic vs visual methods.
How do you deal with missing or incomplete data in your analysis? Elimination,
Imputation, Enrichment. Modeling techniques like multiple imputation (imputing in
different ways and choosing one based on models performance) or domain knowledge
based imputation are just specific examples of imputation.
What statistical tests can you perform to explore relationships between variables?
What is the difference between correlation and causation, and how do you determine
causation? https://www.youtube.com/watch?v=dFp2Ou52-po ; https://github.com/py-
why/dowhy ; https://www.pywhy.org/dowhy/v0.9.1/getting_started/index.html
Is correlation necessary for causation?
https://theincidentaleconomist.com/wordpress/causation-without-correlation-is-
possible/
[For interviews] Can you give an example of a challenging dataset you have worked
with in the past, and how you approached exploring it? How to determine if the
dataset is challenging: new domain to you? Required enrichment / augmentation? A
lot of dimensions (dim. reduction)?

Tabular Data Analysis with Pandas


Common questions
How do you determine which variables are most important in your dataset (feature
importance)?
[For self-reflection] Which of these questions were obvious? Why? Why not?

Tabular Data Analysis with Pandas


Additional tooling
Pandas and Excel - primary.
Additional for automation: https://www.polymersearch.com/blog/exploratory-data-
analysis-tools
For separate EDA tasks: scikit-learn (feature importance), … (PCA, feature
importance).
For different datatypes: audio, video, images metada (python file packages).

Tabular Data Analysis with Pandas


Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Tabular Data Analysis with Pandas


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1HXaex-7KDAhDEO6Rp5RgUAz-hTvT5LiS.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


//
01
02
//
Natural Language Processing
00
Intro
//
03
04
//
//
07
08
//
06
//
09
//
10
//
05
//
11
//
spaCy - open-source software library for advanced natural language processing
(Python / Cython).
MIT license. Main devs: Matthew Honnibal and Ines Montani, (founders of
“Explosion”).

Unlike NLTK, which is widely used for teaching and research, spaCy focuses on
providing software for production usage.
spaCy also supports deep learning workflows that allow connecting statistical
models trained by popular machine learning libraries like TensorFlow, PyTorch or
MXNet through its own machine learning library Thinc.
Using Thinc as its backend, spaCy features convolutional neural network models for
part-of-speech tagging, dependency parsing, text categorization and named entity
recognition (NER).
Prebuilt statistical neural network models for 23 languages (English, Portuguese,
Spanish, Russian, Chinese … )
There is also a multi-language NER model.
Additional support for tokenization for >65 languages.

Ref: https://spacy.io/
Intro
Natural Language Processing
How much of a tradeoff are we talking about?
Efficiency vs. accuracy
Natural Language Processing
//
TRF vs. Other model types
Natural Language Processing
//
Further explorations
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1y6c4V7EfaRsMEBEhKCXTEd-UlhGFDmLx.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius
2 Level
1 Chapter
Today you will learn
Complex indexing
Broadcasting
01
02
03
Arrays with structured data
Python Crash Course
00
Complex datatype definitions
05
Complex reshaping
06
Miscellaneous operations
07
Numpy financial
08
Numpy performance
04
Vectorization
Complex datatype definitions
Because numpy automatically converts mixed data to strings, we can’t perform
certain operations on mixed data arrays - element-wise addition is one such
example.
To bypass this we can specify the types explicitly. This is why we need to know
how.
Sometimes it can be hard to convert or understand numpy dtypes.
This is true when working with external data especially.
Attention: if you see a numpy array like this: [(..), (..), .. , (..)], it’s type
will be void, this is a general flexible datatype in numpy:
https://stackoverflow.com/a/25247533/1964707
You can have mixed datatype in numpy - however use this as a last resort. A
multidimensional array with mixed datatypes can be understood as a table / tabular
data and it’s more common to use Pandas for this purpose.
Numpy - homogenous arrays, Pandas - heterogenous “arrays”.
Remember: numpy i8 is signed, u8 - unsigned int, also remember that there are no
protections against overflow.
See: https://numpy.org/doc/stable/user/basics.types.html
Python Crash Course
Complex datatype definitions
Refs: https://numpy.org/doc/stable/reference/arrays.dtypes.html#arrays-dtypes
Python Crash Course
Complex indexing
Complex indexing, like boolean mask and indexing arrays with arrays as indexes are
sometimes called fancy indexing.
What is fancy indexing? Advanced indexing in numpy - using arrays as indexes, using
boolean arrays as indexes or conditions!
Can we use arrays as indexes when indexing numpy - yes!
Python Crash Course
Complex indexing
… you can change multiple values with array indexing with a single operation.
Python Crash Course
Complex indexing
Boolean arrays and boolean conditionals for indexing
Python Crash Course
Arrays with structured data
Arrays can contain complex data
Python Crash Course
Broadcasting
Broadcasting - a name for mechanism numpy has that allows it to work with arrays of
non equal shapes for operations.
Applies to element-wise operations!
Subject to certain constraints, the smaller array is “broadcast” across the larger
array so that they have compatible shapes.
Animations: https://matteding.github.io/2019/04/11/numpy-broadcasting/
Broadcasting provides a means of vectorizing array operations so that looping
occurs in C instead of Python.
Boardcasting is CPU and memory efficient as no copies are made.
Broadcasting scalars - the scalar value is “broadcast” over the array. Simple to
visualize and understand, does not produce errors due to shape incompatibilities.
Broadcasting arrays - the shapes have to be same or compatible.
Examples:
(3, 6) and (3, 1), can be broadcasted
(3, 6) and (3, 2) can’t be broadcasted
(3, 6) and (2, 1) can’t be broadcasted
Compatibility rule: When array broadcasting is being performed, every dimension
needs to match or the dimension that does not match has to be 1 (>=1 dimensions
need to be 1 - (1, 6) * (5, 1))
Python Crash Course
Broadcasting
Examples.
Python Crash Course
Complex reshaping
Adding a new axis:

print(np.array([1, 2])[np.newaxis, :]) -> [[ 1, 2 ]]


print(np.array([1, 2])[:,np.newaxis]) -> [[1], [2]]

Automatic reshaping - we can reshape w/o specifying one of the dimensions (any 1
dimension can be specified as unknown dimension: -1).
Python Crash Course
Vectorization
Numpy can convert a function to work with arrays automatically even if it is
declared to accept a single parameter
it’s a convenience tool, it does not make the code run faster (?).
Many tricks implemented in numpy can become research projects for the curious
minds, I would encourage anyone who is interested in how numpy speeds things up to
research that. G. Hotz implements fast matrix multiply, comparing it to numpy using
FLOPS: https://www.youtube.com/watch?v=VgSQ1GOC86s
See: https://stackoverflow.com/a/3379505/1964707
Python Crash Course
Numpy financial
Provides financial functions to help financial calculations.
Needs to be installed and imported separately: import numpy_financial as npf
Refs:
Repo: https://github.com/numpy/numpy-financial
Docs: https://numpy.org/numpy-financial/
Function reference: https://numpy.org/doc/1.17/reference/routines.financial.html
Python Crash Course
Miscellaneous operations
np.argmax() , np.argmin() - get the position of the biggest or
smallest values
np.argsort(arr, kind=’mergesort’) - sort indexes by values
arr[np.argsort(arr, kind=’mergesort’)] - display values in sorted order
np.linspace(0, 5, 5) - return evenly spaced numbers over a
specified interval.
np.geomspace() - geometric progression
np.random.randint(10, 50, size=(2, 3)) - generate random matrix
np.random.randint? - display the docstring
saving numpy objects for future work:
Python Crash Course
Digression on statistical distributions - recommended: uniform, pareto,
gausian/normal.
Numpy performance
Numpy is fast. Much faster than regular python when it comes to array processing,
some estimates range from 2 to 1000 times faster.
Some tests performed and described: https://towardsdatascience.com/how-fast-numpy-
really-is-e9111df44347
There are numerous reasons for that:
Dense, homogenous arrays with the added benefit of locality of reference.
Reimplementation of common operations with C (do not reduce the answer to “Numpy is
implemented in C” as Python list are also C - this is a common mistake to make)
SIMD, vector processing instruction usage with 3rd party BLAS/LAPACK linear algebra
processing libraries:
https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms
Some parallelism is also involved (BLAS has some multithreading)
Take note, if you try to optimize a regular program by adding numpy it might not
work as numpy performance benefits begin to show up only for large arrays:
https://stackoverflow.com/questions/52603487/speed-comparison-numpy-vs-python-
standard
Refs:
https://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast
https://stackoverflow.com/questions/41365723/why-is-my-python-numpy-code-faster-
than-c
Python Crash Course
Practical Task P2
Two choices / two tasks:

There is a dog breeder by picture shredding (googlable) meme on the internet. Some
say that this is a proof that we live in a simulation (see videos). Refs:
We live in a matrix video:
https://www.tiktok.com/@aleksandr.ne.bloger/video/7008549854318251265?lang=en
https://www.youtube.com/watch?v=f1fXCRtSUWU
https://digg.com/video/shredder-multiplying-photos
https://laughingsquid.com/a-clever-collage-shredding-trick/
Your task is to try and do that same, but with numpy. “Prove that we live in a
simulation” by cutting, rotating the cut images and displaying the intermediate
results. Include the final result (picture) into github readme or collab notebook.
Please provide the results as a github link to the code or google colab link to the
notebook.

Combine the knowledge you obtained from part1 with what we are learning in part2:
Scarpe some data from the internet (does not need to be complex).
Initialize a numpy array from/with that data.
Perform some calculations over it (mean / average, sum, other simple descriptive
statistics, and some array filtering must be included).
Please provide a complete working solution that has some documentation on how to
run it and what it does (readme / collab).
Please provide the solution as either a github repo or collab notebook with all the
data necessary to just launch it.
** You can augment your previous PP1 if you want.
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/
Python Crash Course
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1Jj2m3g08Vx3pmb3fXPR1G982IJBt9vLz.pptx ---


Artificial Intelligence
Advanced Computer Vision
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


YOLO version comparison
01
02
Image Object Detection w/ YOLO
Advanced Computer Vision
00
Explaining YOLO
Video Object Detection w/ YOLO
03
Connecting to a webcam
05
04
YOLOv5 on a custom dataset
06
Further explorations
07
Practical Project 13
YOLO means “You Only Look Once”
A single-shot object detection model / architecture. A CNN with relatively fewer
layers and filters that some other rival models. Using single shot-detection is the
differentiating factor for making the model fast.
Originally this model / architecture was created with a Deep Learning CV framework
called “darknet” (written in C and CUDA parallel computing platform)
You can preload weights trained on certain datasets to locate objects in your data.
For example COCO dataset - 80 categories. So in many cases it works out of the box.
There are several videos explaining how YOLO works:
https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF
Of note is the fact that YOLOv1 is said to to have [citation needed] been the first
DL object detection algorithm to be real-time capable.
Another dedicated YOLO video [TBD]
For low power devices (embeded IoT, drones, wearables) TinyYOLO variant is often
used. It is faster, has fewer convolutional layers and uses smaller filter sizes,
smaller grid, fewer output layers.
Explaining YOLO
Advanced Computer Vision
YOLOv1 - 2016
YOLOv2 - 2017, earning an honorable paper mention at CVPR 2017. The architecture
made a number of iterative improvements on top of YOLO including BatchNorm, higher
resolution, and anchor boxes.
YOLOv3 - 2018. (YOLOv3 paper is perhaps one of the most readable papers in computer
vision research given its colloquial tone.) YOLOv3 built upon previous models by
adding an objectness score to bounding box prediction, added connections to the
backbone network layers, and made predictions at three separate levels of
granularity to improve performance on smaller objects.
YOLOv4, released April 2020
YOLOv5, released June 2020
PP-YOLO and so on. Baidu released their own YOLO version with some improvements. PP
stands for “PaddlePaddle” and it is a DeepLearning framework from Baidu (worth
exploring after the course).
YOLOv6, released ___
YOLOv7, released ___

Read more: https://blog.roboflow.com/guide-to-yolo-models/


YOLO version comparison
Advanced Computer Vision
//
Image Object Detection w/ YOLO
Advanced Computer Vision
//
Video Object Detection w/ YOLO
Advanced Computer Vision
Is is possible to train YOLO models on a custom dataset.
Many resources online
Ref: https://blog.roboflow.com/how-to-train-yolov5-on-a-custom-dataset/
Ref: https://blog.paperspace.com/train-yolov5-custom-data/
You might encounter difficulties with:
Labeling and label format
YOLOv5 on a custom dataset
Advanced Computer Vision
OpenCV has direct bindings to the webcam for windows (maybe x-platform).
Reference: https://stackoverflow.com/a/606154/1964707 and starter code (below).
Do not launch when other software has the webcam acquired: pip install opencv-
python

import cv2

cv2.namedWindow("preview")
vc = cv2.VideoCapture(0)

if vc.isOpened(): # try to get the first frame


rval, frame = vc.read()
else:
rval = False

while rval:
cv2.imshow("preview", frame)
rval, frame = vc.read()
key = cv2.waitKey(20)
if key == 27: # exit on ESC
break
cv2.destroyWindow("preview")
vc.release()
Connecting to a webcam
Advanced Computer Vision
Launch on GPU
Train with custom dataset and hyperparameters
Real time YOLO based object detection - use your webcam.
Further explorations
Advanced Computer Vision
For this part
use any of the tools / approaches (YOLO, RetinaNET, SRCNN) and use it on your own
data.
* end2end project with labeling and single object detection.
** real time object detection with webcam
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 13 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) when finished for review and evaluation.
Practical Project 13
Advanced Computer Vision
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Advanced Computer Vision


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 18ZrD9J-VzyTACP8KmgN-vaiEFgpnkW8c.pptx ---


Artificial Intelligence
Machine Learning
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
ML Algorithm Classification
01
02
ML Pipeline
Machine Learning
00
Machine Learning Definition
Scikit-learn
03
04
Bias-variance tradeoff
05
06
07
Regression Intro
ML as a service
08
Data Science Process Frameworks
ML Certification
09
MWE
Machine Learning Definition
We have a few main goals in the first lecture:
Define Machine Learning and distinguish, separate it from other, related fields.
Walkthrough the complete ML pipeline process.
Explain what regression problem is.
Explain the different learning algorithms that can solve regression problem.

The most confusing part for the students is the plethora of ML models that exist:
when to use which
how do they differ
why are there so many algorithms that can be used for the same problem.

This is not the lecture that will answer this question completely, but it will be a
start to answer this question! And the important thing: pay attention and ask
yourself did I understood when to apply a particular model - this is the most
important thing in this part.

With that out of the way let’s move to the definition!


Machine Learning
Machine Learning Definition
“Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part of
artificial intelligence. Machine learning algorithms build a model based on sample
data, known as "training data", in order to make predictions or decisions without
being explicitly programmed to do so [... “validation” and “test data” (authors
addition)]. Machine learning algorithms are used in a wide variety of applications,
such as in medicine, email filtering, speech recognition, and computer vision,
where it is difficult or unfeasible to develop conventional algorithms to perform
the needed tasks.

A subset of machine learning is closely related to computational statistics, which


focuses on making predictions using computers; but not all machine learning is
statistical learning. The study of mathematical optimization delivers methods,
theory and application domains to the field of machine learning. Data mining is a
related field of study, focusing on exploratory data analysis through unsupervised
learning. Some implementations of machine learning use data and neural networks
(Deep Learning) in a way that mimics the working of a biological brain. In its
application across business problems, machine learning is also referred to as
predictive analytics.”

Source: https://en.wikipedia.org/wiki/Machine_learning

Python Crash Course


Machine Learning Definition
From this definition we need to highlight:
Automatic improvement, no hardcoded rules. What are the properties of problems
solved with ML that do not allow classical approaches?
High data dimensionality - can’t create algorithms based on loops and branches that
accommodate 100-ts of dimensions (features like patient blood characteristics,
urine characteristics, lifestyle characteristics) and their unknown relative
importance.
Large amounts of data (data is the new oil) - data mining uses machine learning
techniques.
Incomplete understanding of the problem domain - how would you create an autonomous
helicopter that does tricks with classical programming? You wouldn’t.
Variability in data - there is variation in cats. It is impossible to define a
procedure that would identify all the cats from all possible angles.
What made ML hot in recent years - large amount of data constantly accumulating,
algorithms / research matured and great results were achieved, computational power
of even home computers became huge (thanks to gamers in a way).
Note on the “vague understanding of the domain” - sometimes we know something to be
right, but don’t know the precise programable criteria to define it. Example:
explicit sexual content - “we know it when we see it”, but it is hard to define the
precise line. Also, ML model can segment customers into risk / profitability
profiles w/o us even knowing which criteria were important or their relationships.

Python Crash Course


Machine Learning Definition
It is worth to note that the problem’s complexity dictates how much data we need
and that in turn dictates the model’s that can represent this data complexity. Of
course, just because you have a lot of data, does not mean you need a big model.
The problem complexity dictates that.

Sometimes to solve a problem (construct a model that can act / prognosticate) by


reducting complexity, not increasing the model size or data requirements.

Python Crash Course


Machine Learning Definition
From this definition we need to highlight (cont.):
No hardcoded rules in the model - this is the primary differentiating factor
between machine learning and classical programming.
Not all of ML is statistical learning applied - this is a common misconception,
that there is no difference between machine learning and statistical learning (some
even say statistics). This is an opinionated question, but this graph summarizes it
well: https://www.datasciencecentral.com/profiles/blogs/machine-learning-vs-
statistics-in-one-picture
Python Crash Course
Machine Learning Definition
From this definition we need to highlight (cont.):
… although there are more differentiation arguments:
https://stats.stackexchange.com/questions/442128/machine-learning-vs-statistical-
learning-vs-statistics
https://stats.stackexchange.com/questions/6/the-two-cultures-statistics-vs-machine-
learning
Let’s take a look at the applications mentioned: “such as in medicine, email
filtering, speech recognition, and computer vision”:
First we see that there is a scale difference in the examples given - medicine is a
very broad sphere of human activity, email filtering is a very narrow and specific
application. In medicine ML has helped in - and this is one example amongst many -
diagnostics, the task being classification of patients into “possitive” and
“negative” based on the images (MRI, X-Ray, CT scan, etc). Discussion: Is
predicting the number of patients that will come into a hospital really a machine
learning application in medicine? How would you model that?
Another thing to note is that most of the applications here are where mainly deep
learning is shining - non-DL ML algorithms like kNN
(https://pubs.acs.org/doi/abs/10.1021/ie049667a , American Chemical Society),
Random Forests for Statistical Genomics, microarray methods
(https://link.springer.com/chapter/10.1007/0-387-29362-0_16 , Springer:
Bioinformatics and Computational Biology Solutions Using R and Bioconductor) and
dimensionality reduction techniques are useful in various scientific fields (see:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030116#pcbi
-0030116-b038
Civil engineering: https://www.amazon.com/Primer-Machine-Learning-Applications-
Engineering… - example: recognize the count and types of components in a
electronics diagram, formulate order.
...let’s try to find some applications in your field of interest! Sports - given
data about basketball players at age 16 “predict” if a given 16-year old aspiring
basketball player has the potential / will it be easy for him.
Python Crash Course
Machine Learning Definition
From this definition we need to highlight (cont.):
Classification of ML - how is it different than AI? ML is part of AI, but AI has a
more general goal of creating intelligent systems. What is something that is part
of AI, but not part of ML? Classical NLP that performs natural language tasks using
formal grammar syntax (universal grammar), gaming AI in classical games is also
classical, not ML/DL-based, “Boston Dynamics” robots are also using hardcoded rules
rather than ML/DL (see: https://www.quora.com/How-does-Boston-Dynamics...). See
more: https://www.linkedin.com/pulse/... Control Theory.
Expert systems, classical grammar rules based NLP, classic CV for face recognition
- AI, not ML.
We will talk about differences between ML and DL in the next part on DL. But in
short DL is concerned with Neural Networks, which is believed to the most
generalizible set of algorithms / datastructures that solves the most complex
issues (games, self driving cars, more precise image processing, language
generation, question answering and so on). Deep learning methods won over other ML
methods in all supervised problems except tabular data problems (by predictive
power).

Python Crash Course


There are broad categories of machine learning algorithms:
Supervised learning algorithms - we have data x and labels y. We want to discover
the f(x) = y relationship for unseen data based on data we’ve seen. In supervised
learning we know the right answers for a large number of cases and learn from them.
Tasks: regression (predicting continuous variables) and classification (predicting
discrete variables).
Unsupervised learning algorithms aim to organize data based on the internal
properties of the data. No labels, no ground truth is known before hand. They are
limited in scope and applicability, but very useful when we can use them. Sometimes
used in feature engineering to make other models, like deep neural networks more
efficient (i.e. reduce the dimensionality of the problem). Tasks: clustering,
anomaly detection and dimensionality reduction.
Reinforcement learning algorithms - goal oriented learning. Based on the actor /
agent environment abstraction and a reward function that is being maximised. Mainly
used in autonomous systems that operate w/o human guidance - games for example.
There are many more categorizations we will see throughout this course. One
additional way to categorize ML algorithms would be by way of data ingestions
during the learning process - learning modes: batch and online learning, see:
https://stats.stackexchange.com/questions/... and https://vitalflux.com/difference-
between-online-batch-learning/
SQ: self-driving cars (the ultimate problem for ML) perform all / many types of
learning - they solve problems of image segmentation, object localisation, action
using RL, etc. Because self-driving is no watered-down to “best possible driver
asist system” it might be that personal asistants will be the most profitable
problem to solve.
ML Algorithm Classification
Python Crash Course
Problems
Supervised learning
House price prediction using regression on tabular data.
Bounding box around a face in a picture - bounding box regression.
Classification of malignant or benign cancer based on the age of the patient and
the tumor size (or more features / dimensions).
Classification of spam vs. non-spam emails.
Unsupervised learning
Grouping of news articles that are about the same thing: google news.
Grouping people by genomic profile.
Organize computers into computers that work together in a datacenter - when one is
active the others are active too. Minimizing the distance can make them work more
efficiently, google search: “data center optimization with machine learning”
Automatically finding groups of friends in a social graph.
Finding dating matches.
Market segmentation and so on.
Cocktail party algorithm to separate people speaking over each other [really
unsupervised? TBD].

Visuals taken from: https://www.coursera.org/learn/machine-learning/


ML Algorithm Classification
Python Crash Course
ML Pipeline
There are many visualizations on how the ML Pipeline looks like. Some ephasize
certain parts, some other parts, but this is a good one:

We can note, that ML training is an iterative process (what isn’t these days) -
this means the model is never finished. You achieve an MVP, release it and improve
it. Source: https://towardsdatascience.com/not-yet-another-article-on-machine-
learning-e67f8812ba86

Python Crash Course


Note: this pretty much the same pipeline we saw when we talked about data science,
the only real differences are in the algorithms and techniques used - a ML pipeline
necessarily involves creating an ML model. In general, the steps are the same.

This applies to Deep Learning as well , as deep learning is as subset of ML so the


same way of thinking can be used.

Note: the pipeline has feedback mechanisms! It is not linear!


ML Pipeline
1. Problem Definition: Define the business problem you require an answer for. It
has to clearly defined.
… there can be more but these are the most common. Example: which features maximize
the differences between datapoints (dim. reduct.)

Python Crash Course


ML Pipeline
2. Data Ingestion: Identify and gather the data you want to work with. Sometimes
there is no data and you need to acquire it, even doing research on the legal
aspects of it (remember the scraping lawsuits? or customer data distribution by
facebook / cambridge analytics).

Files (with different formats), SOAP / REST / GQL endpoints, SQL database, non-
relational / no-sql databases, sensors in proprietary formats - all structured
data. Unstructured data like audio, images and videos are also commonly used.

Even if we take textual data we can realize how many different data source we can
have: tweets, facebook posts, emails, blogposts, notes in notepad can all
datasources of a textual type.
Python Crash Course
ML Pipeline
3. Data Preparation: Since the data can be raw and unstructured, it is rarely in
the correct form to be processed. It usually involves filling missing values or
removing duplicate records or normalising and correcting other flaws in data, like
different representations of the same values in a column for instance, dropping
outliers, etc. This is where the feature extraction, construction and selection
takes place too. It is important to remember: most ML models work only with data
expressed / encoded into numbers - so non numeric data needs to be converted. This
is often the most time consuming step in the ML pipeline, some estimates say that
this is where 40%-70% percent of time is spent [citation needed]

First we might perform data analysis in order to:


understand our data (univariate: mean, median, percentiles, standard deviation,
outliers and so on, bivariate/multivariate: correlation).
apply data visualization which is part of data analysis as well, but also we can do
it before we even attempt to create a model.
check if the data is sorted, some models are sensitive to initially observed
datapoints and get into the bias-groove.

Feature engineering:
Apply domain knowledge to transform features or create new ones (e.g.: hemoglobin /
lipid qoutient)
Simple examples: turn date of birth into age as it will be simpler to understand
the the model most likely, convert weekdays / month names to numbers. If you have
patient data, maybe you would crete BMI collumn from height and weight.
Feature scaling / normalization:
Most machine learning models work based on data that is of similar scale.
If you have thousands of metters and tens of grams it might be a good idea to
express distance as 1.2Km instead of 1200m.
For example K-means clustering algorithm relies on Euclidian distance and is
affected by the scale of different attributes of your datapoints.
4 ways of data scaling: https://en.wikipedia.org/wiki/Feature_scaling important:
normalization and standartization.

Questions:
What kind of data preparation could be done on a video?
What kind of data preparation could be done on an audio recording? (cutting off the
silence at the beginning and end - can we detect it? Can we detect it
deterministically). Do we need meta information on audio file, can we remove it?

Python Crash Course


ML Pipeline
4. Data Segregation: Split subsets of data to train the model (train set), test it
(validation set) and further validate (test set) how it performs against new data.
Sometimes we work with two set’s train and validation set. We will talk about the
need for test dataset in the future.

There are many problems / questions in this situation:


why not use the entire data for training? Two reasons:
Inability to evaluate model against unseen data if everything is used. When you
have used up your data, what will you use then?
Detection of overfitting - model will “memorize” patterns of data and not
generalize to unseen data. We will see that the effect of overfitting becomes more
prominent the more powerful the model we have is. Intuitively you can think of this
as a student memorizing the material and not being able to apply it because he has
no understanding. The opposite is underfitting - then the model simply does not fit
the pattern because it’s undertrained.

how much data should we use from all the data we have for each set? 70/30, 80/20,
90/10
should this number change if we have a lot of data? small amount of data? The less
data, the more we use for training. This is because ML/DL models are inherently
quite hungry for data, so no complex problem can be solved w/o some minimal
threshold of data that would be able to satisfy the complex model modeling the
complex problem domain (data augmentation).
do we just split the data randomly into 2 big portions? No, we randomize samples!
Sensitvity to initial samples: imagine you were training Tarzan to recognize cars
and showed red cars at the beginning - he might associate red-ness with car-ness,
next samples will have to negate this false connection between conpcets red and
car! Thats why we randomize data. We can test samples if they have similar
statistical properties: if you dedicate all expensive flats to the test dataset and
the train dataset will contain only cheap flats, the model might work well for
cheap flat prediction, but fail for expensive ones and you will waste time on
searching for better model while the only problem was with your skill of data
splitting - both test and train (and validation) should have the same (or just very
similar) statistical properties as the entire dataset.
what if we have series/sequential data to avoid selection bias? Should we split
simply have a cut-off point or randomize the datapoints so that test and split data
would have values from any period assigned randomly (hint: this depends of whether
there is a time related correlation / trend). EKG classification for heart disease
vs. stock price prediction.
Python Crash Course
ML Pipeline
Python Crash Course
4. Data Segregation (cont.)

We have two main techniques for segregating the data:


Hold-out or test/train split - recommendation start from pareto principle 80/20.
K-fold cross validation / k-fold x-validation - we split the training data (those
70-80%) into parts. The part count is the parameter K. We then train the model with
different permutations (order does matter). In this case the error is calculated as
the average error obtained when training over each fold. Recommendation: start from
K~10 and test different values. KF is slower, but more accurate.
Refs: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-
7637112d3f8f and https://datascience.stackexchange.com/questions/52632/cross-
validation-vs-train-validate-test

ML Pipeline
Python Crash Course
4. Data Segregation (cont.)

Selection bias and why need to shuffle and randomize the samples:
In the standard formulation of machine learning problems, the learning algorithm
receives training and test samples drawn according to the same distribution (i.e.:
for example 60/40 ECG’s of sick/well patients in training set and 60/40 test set,
ideally). This assumption often does not hold in practice. The training sample
available is biased in some way, which may be due to a variety of practical
reasons: cost of data labeling or acquisition, data selection when feeding to the
model (if the data initially fed is only from sick patients, the model can become
overlly sensitive to sickness indicators and see them where there are none - this
can happen if a large portion of the initial samples is from a single category).
If you feed your data in a series and don’t randomize it, it might have selection
bias as it is know that parts of the dataset tend to clustter around the same
values.
Ability to identify biased datasets is a research topic as well as a topic for EDA.
Check your dataframes/ arrays/tensors for asymmetric distributions.
More biases: https://developers.google.com/machine-learning/crash-course/fairness/
types-of-bias
We assume sample independence when we shuffle data - we predict only based on
features of a single sample. This does not concern sequential / time series.

Scikit-learn
We will use scikit-learn library, which is arguably the most important machine-
learning (non-DL) specific library in the world.
This library is opensource and build on top of numpy, scipy and other powerful
libraries: https://github.com/scikit-learn/scikit-learn
It contains most of the ML non-DL algorithms and some DL capabilities.
Also contains all the peripheral algorithms for scaling the data (scaler objects),
segregating (train_test_split, cross_validate) data, visualization (ROC, confusion
matrix, f1 score) and so on.
You can essentially learn most of what is non-DL machine learning by just learning
this one library.

Python Crash Course


ML Pipeline
5. Model Training: Use the training subset of data to let the ML algorithm
recognise the patterns in it. In this step we can try several algorithms or
ensembles of them to find the most suited models / algorithms that fits the problem
best.

6. Candidate Model Evaluation / Tunning: Assess the performance of the model using
test and validation subsets of data to understand how accurate the prediction is.
This is an iterative process and various algorithms might be tested until you have
a model that sufficiently answers your question. Different metrics can be evaluated
independent of accuracy for example - time to train, inference time, memory used /
CPU / GPU / electricity consumed (remember learning algorithms are also
algorithms). There are several ways to improve the model: hyperparameter
optimization / tuning and ensemble learning (adding another ML algorithm to the
mix).

7. Model deployment / operationalization: Once the chosen model is produced, it is


typically exposed via some kind of API / UI web app and is embedded in a decision-
making frameworks as a part of an analytics solution … or even a script. MLOps is a
new Devops off-shoot for deployment and operationalization of ML solutions, ref:
https://en.wikipedia.org/wiki/MLOps . Example: an API that the fire department
operator can call, provide currently known data like: humidity and temperature to
gain understanding of whether she/he should call for backup from other
firedepartments. OR you constantly get data and you want to constantly retrain
models on always increasing dataset and immediately deploy the model to production
(ML pipeline automation) - MLOPS are dedicated people who could do that.

8. Performance Monitoring: The model is continuously monitored to observe how it


behaved in the real world and calibrated accordingly. New data is collected to
incrementally improve it. Logging and collecting input data are the most important
techniques. The most difficult ones are obtaining the labels, which often is not
possible. In our previous example it is possible that the firedepartment will
calculate what the actual area was and enter it into the system for us. But in the
majority of cases it hard to get the actual values - some companies crowsource the
step of getting the right answer (delfi.lt - propaganda comments classification).
Additionally there is a self-defeating paradox here: if the model was trained on
the data that was obtained when there was no such model and then it get’s fed the
data where fires were mitigated faster due to predictions of the current model we
will have problems. This demonstrates that the problem of continuous improvement is
not a simple one.
Python Crash Course
ML Pipeline
Pipeline automation: MLFlow, NeptuneAI
MLflow - open source platform for managing the end-to-end machine learning
lifecycle. It tackles four primary functions: tracking experiments to record and
compare parameters and results (MLflow Tracking), doing it collaboratively,
packaging ML code in a reusable, reproducible form in order to share with other
data scientists or transfer to production (MLflow Projects), managing and deploying
models from a variety of ML libraries to a variety of model serving and inference
platforms (MLflow Models) and providing a central model store to collaboratively
manage the full lifecycle of an MLflow Model
NeptuneAI - similar service to MLflow.
Databricks - not the same as these two, it’s a unified data analytics platform
built on Apache Spark. Databricks provides a managed Spark environment in the
cloud, along with tools for data ingestion, ETL, streaming, and machine learning.
It can be used for building and deploying machine learning models, but its scope is
broader than just model management and experiment tracking.

Python Crash Course


Regression
ML model trained to predict numerical values, like prices of the house, area
affected by the fire is said to be solving a regression problem.
We need to distinguish between the following types of regression problems:
univariate regression, multiple and/or multivariate regression, linear regression,
non-linear/polynomial/multinomial regression, multi-output regression.
With respect to each ML model we need to understand:
How the error is calculated - moderate difficulty to understand.
How the model learns i.e. minimizes aggregate errors by adjusting it’s internal
state - can be hard to understand. Learning algorithms.
How the model predicts - not-difficult in many cases.
The simplest regression problem is univariate linear regression. The objective of
an ML algorithm when solving univariate linear regression problem is to learn the
coefficients of an equation of a line that produce the smallest total error - in
that sense ML algorithms are said to be minimization algorithms.
Line: y = mx + b, where m is the slope and b is the veritcal-shift. The model
predicts y for a given x with the parameters learned m and b. This model has only 2
parameters (DL models - billions of parameters). This is the models
“representational power”.
Total error is the sum of all errors for each prediction. An error for each
prediction in regression problem is defined by the difference between the
prediction made by model - (ŷ “y hat”) - and actual value - y ⇒ y - ŷ . Hence total
error sum(y - ŷ) . To produce only non-negative values we square each error term
and then to obtain the standard measurement - MSE (mean squared error) - we take
the average!
There are many ways to calculate errors when machine learning problems are solved
MSE is just one of them. … that’s the error calculation.
R2 is the “percentage of the variation explained by the regression line” aka:
“Coefficient of determination” see: https://www.youtube.com/watch... and
https://stats.stackexchange.com/... . Note - when when training models R2 can be
negative, but is always upper bound by +1.0 (no concrete negative bound, probably
could be expressed as a function). Now why is that the case? See:
https://datascience.stackexchange.com/questions... and
https://stats.stackexchange.com/questions/… .. but in short: the R2 is negative
when the model predicts an opposite trendline than the actuall data follows.
We will cover learning in more depth latter on - for now just know that there are
many learning mechanisms: OLS, Gradient Descent, Newton's method.
Very important to distinguish between Problem > Model > Learning Algorithm. Problem
is predicting price of flat given it’s size. Linear model can be chosen. OLS /
SGD / etc. can fit the data for that model.
Additional resources. Ref: https://www.youtube.com/watch?v=PaFPbb66DxQ ; Ref:
https://www.youtube.com/watch?v=nk2CQITm_eo ; Ref:
https://towardsdatascience.com/linear-regression-explained-1b36f97b7572
Python Crash Course
Bias-variance tradeoff
A model is said to have high bias when the model is in principle unable to learn
the underlying relationship due to lack learning of power.
Low bias - fewer assumptions about the form of the target function, model is
flexible / powerful.
High bias - more assumptions about the form of the target function, model is rigid.
A model is said to have high variance when model does not generalize because it is
too highly dependent on the data it has seen. High flexibility of the model due to
it’s ability to remember / complex internal state.
Low variance - only small changes to the estimate of the target function with
changes to the training dataset.
High variance - large changes to the estimate of the target function with changes
to the training dataset.
All models can be categorized by their intrinsic mechanism as low or high
bias/variance. Examples
low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors
and Support Vector Machines.
high-bias machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
high-variance machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Important heuristic: high bias implies low variance synon. w/ underfitting, high
variance implies low bias, synon. w/ overfitting.

Fundamentally: “Increasing the bias will decrease the variance. Increasing the
variance will decrease the bias.” - if you make the model more powerful (like use
neural networks for regression problems and add more and more neurons) or use a
more powerful algorithm due to it’s design ir will increase variance. In other
words: the larger the number of parameters the larger the search space for
optimization to get lost in.
How do you increase the bias or variance, two ways: by choosing models and by
tunning them (random forest - more trees - higher variance).

Ref: https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-
trade-off-in-machine-learning/
Python Crash Course
Regression regularized
Many types of algorithms to solve regression problems - ridge, lasso, support
vector regression and so on https://www.jigsawacademy.com/blogs/data...
Some of them are algorithms that add regularization parameters to classical
regression models.
Why do we need it? In general regularization is performed to increase the
performance of the model mostly by constraining it - so avoiding overfitting /
reducing variance.
There are 3 notable examples that use regularization for linear regression problem:
Ridge regression - adds an additional term to the error function for linear
regression. Most suitable when a data set contains a higher number of predictor
variables than the number of observations or there is collinearity between the
features (multicolinearity). See: https://www.youtube.com/watch?v=Q81RR3yKn30 and
https://stats.stackexchange.com/questions... and
https://satishgunjal.com/univariate_lr_scikit/
Lasso regression - also adds a regularization term, but a different one. It tends
to eliminate the weights of the least important features (i.e., set them to zero),
so use it for multivariate problems with many useless features (could we test this
using a synthetic MWE?)
Elastic net regression - combines the techniques from lasso and ridge. Elastic Net
is a middle ground between Ridge Regression and Lasso Regression. The
regularization term is a simple mix of both Ridge and Lasso’s regularization terms,
and you can control the mix ratio r.
Note: these regularization parameters can be added to univariate (although not
needed), multivariate, multioutput and other types of regression we will see latter
on. You can have Polynomial Ridge Regression and Polynomial Multioutput Elastic Net
Regression, just like you can have red bus and red supercar.
Exercise: think about what generated dataset would prove the effectiveness of ridge
and lasso regression (compared to regular and against each other) - generate that
dataset and compare it with unregularized linear regression.
Exercise: how would you go about disproving the the statement that you need more
than 50 samples.
Python Crash Course
Regression regularized
Exercise: implement the full ML flow (problem → analysis → model → …) using:
student grades dataset
a linear regression model with OLS learning
and this ref: https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
( https://archive.ph/ei77a )
if the data link is dead use this:
https://github.com/MindaugasBernatavicius/DeepLearningCourse/blob/master/
04_ML_Intro/student_scores.csv

Python Crash Course


AutoML
Def.: automated model tunning.
Ref: https://www.automl.org/automl/
Google probably the leading company offering AutoML solutions in the cloud (now
Vertex AI platform): https://cloud.google.com/automl/docs
AutoML Natural Language – enables users to build models for tasks like text
classification.
AutoML Tables – enables users to build models for tasks such as structured data
prediction.
AutoML Video – allows users to build models for tasks like video classification.
IBM SPSS auto-classifiers.
More?

Python Crash Course


ML as a service
Sometimes abreviates as MLaaS
Main players: Google with GCP and google collab notebooks and APIs (cloud vision,
cloud translation), Microsoft with Azure ML, Amazon with AWS ML.
Should we use them? This is a question of productivity what the employers are using
(for example in my current company we use AWS to host notebooks, so training on
AWS), a question of control, a question of cost, a question of how standard is the
problem you are trying to solve.
Most of them you can try for free. Azure was only accessible with non-prepaid cards
(untill 2021), like credit cards before, but not you can access it with just debit
card.
How is this useful? In many ways:
Use these services for learning - what kind of models exist, how can I quickly
configure and test them?
Use them for experimentation (trying out new ideas) and fast prototyping and
initial benchmarking.
Use these cross validate the results achieved by a model implemented in python or
for fast prototyping.
Squark: https://squarkai.com/squark/
… and others.
Python Crash Course
ML as a service
[DEPRECATED] Demo: studio.azureml.net
Load the data
Vizualize
Split the data
… and so on.
… finally check the “Coefficient of determination” in the “Evaluate model” cell.
This value is the R^2 which needs to be as close to 1 as possible.
See: https://www.youtube.com/watch?v=csFDLUYnq4w
Demo3: https://youtu.be/8aMzR8iaB9s?t=299

Note, the model will probably be very inaccurate.


Observe that the encoding of string values - transforming them into numbers is not
too good: “Feature Hashing” node and “Convert to Indicator Values” node create a
lot of columns increasing the feature space (when “Convert to Indicator Values”
produces the error for the “Month” collumn not being allowed, just mark it as
categorical using another “edit metadata” step)

Python Crash Course


ML as a service
Demo: https://portal.azure.com/
Create a microsoft account
Create a subscription (200$ credits, if they are used, create a pay-as-you-go
subscruption)
Create a new resource group and workspace
Go to Azure AI | Machine Learning Studio after it is created
Go to designer
Import dataset (needs to be pretty clean)
Add Split Data, ML Model, Train Model, Score Model, Evaluate Model nodes.
Submit the pipeline for running (it will ask you to create compute resources if
there are none)
Try other models
You can deploy them
After you are done, don’t forget to delete the resources (cancel subscription).

See: https://youtu.be/8aMzR8iaB9s?t=299
Python Crash Course
Data Science Process Frameworks
We talked about data science pipeline before. There is a formalized, somewhat
established (although not hugelly recognized in the indistry) version of it in
data-mining field CRISP-DM (CRoss Industry Standard Process for Data Mining):
https://www.datascience-pm.com/crisp-dm-2/
Many have recognized that both the DS / ML pipeline and CRISP-DM processes do not
take into account the different people / roles in the organization wrt/
chronological process of doing industrial datascience - for example data storage
vs. modeling - does a single role just do everything?
Microsoft offered their solution called TDSP - Team Data Science Process. It
offers:
A data science lifecycle definition
A standardized project structure
Recommended infrastructure and resources
Recommended tools and utilities
Tasks and responsibilities for each project role
Ref: https://www.datascience-pm.com/tdsp/ and
https://docs.microsoft.com/en-us/azure/architecture/data-science-process/overview
This is for project management / for project managers.

Python Crash Course


ML Certification
ML certifications are generally tool/framework based - not general knowledge /
concepts based.
There are a lot of IaaS, AIaaS / MLaaS providers that offer certifications for
their platforms - these are more general usually.
See the following certifications from the MLaaS provider we mentioned before:
https://cloud.google.com/certification/machine-learning-engineer
https://aws.amazon.com/certification/certified-machine-learning-specialty/
(probably recommended)
https://docs.microsoft.com/en-us/learn/certifications/browse/?terms=data%20science
https://learn.microsoft.com/en-us/credentials/applied-skills/train-and-deploy-a-
machine-learning-model-with-azure-machine-learning/

The closest to the general certification:


https://www.coursera.org/specializations/machine-learning-introduction (earlier:
low abstraction level, Octave, not python, v2 (2022) Python based)
https://www.coursera.org/specializations/deep-learning (medium abstraction level,
python based)
These would be the most recommended after our course!

Discussions about this topic to gain more knowledge


https://www.quora.com/What-are-best-machine-learning-certifications-available
https://towardsdatascience.com/6-machine-learning-certificates-to-pursue-in-2021-
2070e024ae9d

As a general rule in IT/CS/SE certifications do not provide a strong guarantee of


anything (some exceptions: Networking, IT Security, Cloud Tech, QA-Testing, Java,
C#). An established record of work and projects is important to get the job and
word-of-mouth and professional network is somewhat important to get into the
highest positions (CTO and such). However if you can’t get a desired position
easily with your projects you might consider using certification path as a “plan B
(or C)”. Additionally, they provide a good study plan for those of you who don’t
have projects you are interested.
Python Crash Course
MWE
When learning ML/DL it is very important to be able to create what can be called
MWEs - minimal working examples.
These are examples you can use for experimentation to gain intuitive understanding
of concepts being learned. +Progression - make them more and more complex and
observe how model tunning helps.
MWEs - complete working examples of reduced complexity designed with intended goal
of being simple to launch and play around with, more about it:
https://en.wikipedia.org/wiki/Minimal_working_example
To create MWEs we need data. How to manage that?
Either keep a small dataset that is simple in git somewhere.
Or we learn to generate a simple dataset (with noise injection) “manually” or using
scikit make_regression() function.
Or we use inbuilt datasets (many frameworks have them):
https://scikit-learn.org/stable/datasets.html (note, there is a famous boston
housing dataset, which will be removed:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
)
Examples:
MWE for K-fold cross validation
MWE for test/train split
MWE for univariate linear regression
MWE for data gen, like scikit make_regression()
MWE for multivariate linear regression…
… and so on, for each model (maybe? You decide)
Python Crash Course
Glossary
Important tems and differentiation of concepts:
Bias vs. Variance - high bias: assumes a lot, usually oversimplified and is
synonymous with underfitting (linear regression) and high variance: is synonymous
with overfitting, the model being more powerful and having more internal power to
represent complex arrangements (NN).
Model vs. Algorithm - learning algorithm + data + training time ⇒ model, model +
unseen data ⇒ predictions
Regression - a problem type, not a algorithm, not a model (even though there is
common phrase “develop regression models”).
Classification - another problem type where a given datapoint is assigned to a
specific category or class based on it’s properties.
Regularized Regression Algorithms - OLS + regularization. Use Ridge when you have
more columns than rows and multicolinearity, Lasso when you suspect most features
are irrelevant.
AutoML - tools and processes automating machine learning work, like automatically
tunning hyperparameters.
… and others … K-Fold CV, R^2 score, MSE.
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1yxTMK4dGMbWSFZZueCNL_9BZDmVy7-u7.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Using pre-trained word embeddings
01
02
Word2Vec
Natural Language Processing
00
Word Embeddings
GloVe
03
04
Choosing between embeddings
Bidirectional RNNs
06
07
Recurrent dropout
05
Classification with Keras
08
Further explorations
Today we have 3 new, important things to discuss:
Pretrained embeddings
Bidirectional RNN
Recurrent dropout
Intro
Natural Language Processing
With shallow techniques we can use one-hot encoding, count vectorization or td-idf
to vectorize language data. With deep techniques, like neural networks, it's easy
and recommended to use word embeddings. Keras has an implementation of a trainable
Embedding layer. Importantly: embeddings have the magical quality of learning
geometric distances between words/concepts. This is a worked out example:
https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-
work

The Embedding layer is best understood as a dictionary mapping integer indices


(which stand for specific words) to dense vectors. It takes as input integers, it
looks up these integers into an internal dictionary, and it returns the associated
vectors. It's effectively a dictionary lookup.
The Embedding layer takes as input a 2D tensor of integers, of shape (samples,
sequence_length), where each entry is a sequence of integers. It can embed
sequences of variable lengths, so for instance we could feed into our embedding
layer above batches that could have shapes (32, 10) (batch of 32 sequences of
length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a
batch must have the same length, though (since we need to pack them into a single
tensor), so sequences that are shorter than others should be padded with zeros, and
sequences that are longer should be truncated.
Word Embeddings
Natural Language Processing
This layer returns a 3D floating point tensor, of shape (samples, sequence_length,
embedding_dimensionality).
samples ... how many sequences we have
sequence_length ... how many samples we have in each sequence
embedding_dimensionality ... how long is the embedding vector for each word
So input (samples, sequence_length) → (samples, sequence_length,
embedding_dimensionality)

Such a 3D tensor can then be processed by a RNN layer or a 1D convolution layer (or
just a Dense layer after Flatten)
When you instantiate an Embedding layer, its weights (its internal dictionary of
vectors) are initially random, just like with any other layer. During training,
these word vectors will be gradually adjusted via backpropagation, structuring the
space into something that the downstream model can exploit. Once fully trained,
your embedding space will show a lot of structure -- a kind of structure
specialized for the specific problem you were training your model for (problem
specificity).
For some nice visuals see this: https://medium.com/@kadircanercetin/intuitive-
understanding-of-word-embeddings-with-keras-6435fe92a57b
Word Embeddings
Natural Language Processing
Assuming all values in the initial matrix are unique so the dict size id 10
Sometimes, you have so little training data available that could never use your
data alone to learn an appropriate task-specific embedding of your vocabulary. What
to do then?
Instead of learning word embeddings jointly with the problem you want to solve, you
could be loading embedding vectors from a pre-computed embedding space known to be
highly structured and to exhibit useful properties -- that captures generic aspects
of language structure. The rationale behind using pre-trained word embeddings in
natural language processing is very much the same as for using pre-trained convnets
in image classification (same as transfer learning): we don't have enough data
available to learn truly powerful features on our own, but we expect the features
that we need to be fairly generic, i.e. common visual features or semantic
features. In this case it makes sense to reuse features learned on a different
problem.
Important reading material:
https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469 (I recommend
implementing this as a homework)
Using pre-trained word embeddings
Natural Language Processing
Such word embeddings are generally computed using word prediction based on context
window (“shallow window” - word2vec) and/or word cooccurrence statistics
(observations about what words co-occur in sentences or documents - GloVe), using a
variety of techniques, some involving neural networks, others not.
The idea of a dense, low-dimensional embedding space for words, computed in an
unsupervised way, was initially explored by Bengio et al. in the early 2000s, but
it only started really taking off in research and industry applications after the
release of one of the most famous and successful word embedding scheme: the
Word2Vec algorithm, developed by Mikolov at Google in 2013. Word2Vec dimensions
capture specific semantic properties, e.g. gender.
Additional explanations:
(Using word embeddings, Andrew Ng) https://www.youtube.com/watch?v=ARIqkgvYUbY -
(Learning word embeddings, Andrew Ng) https://www.youtube.com/watch?v=yXV_Torwzyc -
intro to concept of context and training of embeddings with different contexts
(last 4 words vs. surrounding 4 words, last 1 word) and with different “problems”
(classfication / prediction).
(Word2Vec, Andrew Ng) https://www.youtube.com/watch?v=3eoX_waysy4 - skipgram
selection based word2vec DNN (embeddings + softmax layer) using nearby 1 word.
Paper: https://arxiv.org/abs/1301.3781 and code:
https://code.google.com/archive/p/word2vec/
Word2Vec
Natural Language Processing
The whole playlist: https://www.youtube.com/playlist?list=PLhWB2ZsrULv-
wEM8JDKA1zk8_2Lc88I-s
Word2Vec
Natural Language Processing
There are various pre-computed databases of word embeddings that we can download
and start using in a Keras Embedding layer. Word2Vec is one of them. Another
popular one is called "GloVe", developed by Stanford researchers in 2014. It stands
for "Global Vectors for Word Representation", and it is an embedding technique
based word co-occurrence statistics. Its developers have made available pre-
computed embeddings for millions of English tokens, obtained from Wikipedia data /
Common Crawl data.
Additional explanations:
Video: https://www.youtube.com/watch?v=InCWrgrUJT8
(GloVe, Andrew Ng): https://www.youtube.com/watch?v=EHXqgQNu-Iw
Original paper: https://nlp.stanford.edu/projects/glove/
GloVe
Natural Language Processing
We have a few choices
Whether to train our embeddings or use pre-trained ones
If we choose pre-trained ones - which ones (word2vec / glove / other)?
Word2Vec vs. GloVe
The choice is mostly empyrically derived - try and see.
There are some tasks that word2vec is better at and some where glove is better at.
Sometimes you can guess based on the data they were trained on.
Importantly all embedding models are evaluated against word analogies dataset,
however it is not necessarily predictive of the performance you will achieve for
your NLP task [needs confirmation].
In general - whichever is bigger will probably require a bigger neural network in
order to not undefit. But neural networks are often not on the verge or
underfitting, quite the opposite.
Performance reasons (?)
Choosing between embeddings
Natural Language Processing
Bigger decision is whether to choose to train your own embedings or use pre-trained
ones. Two considerations are in order:
Do you have a lot of data? Remember we said before - if you have enough data you
can train your own embeddings and they will outperform generic ones. Just like with
CNNs and transfer learning - the more data you have, the less you need pretrained
models in the transfer learning models (you can judge if you have insufficient data
by the network underfitting).
Does your task use similar words to the ones that the embedding was trained on? A
very dissimilar reusable embedding will perform much worse than a very similar. Can
you make them trainable? Yes. You can find that out from the papers. Although the
datasets are so extensive that these embeddings are more like embeddings for the
whole language.
Do LLMs use pretrained embeddings? No, this would limit their ability to learn task
specific representations, which are learned doing in an end-2-end training fashion.
Also, using pre-trained embedding is for brokies, not big companies like google, ms
or openai.
Summary: so after all the decision on what to use is primarily based on how much
data you have - if it’s a small dataset, just use pretrained ones, otherwise try
all possibilities. If one pretrained embedding does not work better than custom one
chances are that others will not work better as well.
Lithuanian pretrained embeddings?
Choosing between embeddings
Natural Language Processing
Let’s see how to use custom embeddings, prebuilt embeddings and see what accuracy
we can achieve with the IMDB dataset
You can find some usage examples here:
https://keras.io/examples/nlp/pretrained_word_embeddings/
https://www.tensorflow.org/tutorials/keras/text_classification
https://www.tensorflow.org/tutorials/keras/text_classification_with_hub
https://keras.io/examples/nlp/bidirectional_lstm_imdb/
TODO: fast.ai
TODO: pytorch
Classification with Keras
Natural Language Processing
Another technique that we will introduce in this section is called "bidirectional
RNNs". A bidirectional RNN is common RNN variant which can offer higher performance
than a regular RNN on certain tasks. It is frequently used in natural language
processing (translation on full text, classification/sentiment analysis).
RNNs are notably order-dependent, or time-dependent: they process the timesteps of
their input sequences in order, and shuffling or reversing the timesteps can
completely change the representations that the RNN will extract from the sequence.
This is precisely the reason why they perform well on problems where order is
meaningful, such as our temperature forecasting problem. A bidirectional RNN
exploits the order-sensitivity of RNNs: it simply consists of two regular RNNs,
such as the GRU or LSTM layers that you are already familiar with, each processing
input sequence in one direction (chronologically and antichronologically), then
merging their representations. By processing a sequence both ways, a bidirectional
RNN is able to catch patterns that may have been overlooked by a one-direction RNN.
However it is not always helpfull, so you need to test it for your problems!
Remarkably, the fact that the RNN layers in this section have so far processed
sequences in chronological order (older timesteps first) may have been an arbitrary
decision. At least, it's a decision we made no attempt at questioning so far. Could
it be that our RNNs could have performed well enough if it were processing input
sequences in antichronological order, for instance (newer timesteps first)? Let's
see what happens if we reverse the sequences to be in reverse chronological order.
This would be a good task - how much can we learn processing in reverse order.
Bidirectional RNNs
Natural Language Processing
To instantiate a bidirectional RNN (BRNN) in Keras, one would use the Bidirectional
layer, which takes as first argument a recurrent layer instance. Bidirectional will
create a second, separate instance of this recurrent layer, and will use one
instance for processing the input sequences in chronological order and the other
instance for processing the input sequences in reversed order.

(TODO:: find replacement) Explanation (Andrew Ng): https://www.youtube.com/watch?


v=bTXGpATdKRY
Bidirectional RNNs
Natural Language Processing
We have two dropout regularization (i.e.: helps avoid overfitting) techniques for
RNNs: regular dropout and recurrent dropout, see:
https://stackoverflow.com/questions/44924690/keras-the-difference-between-lstm-
dropout-and-lstm-recurrent-dropout
You are already familiar with a classic technique for fighting this phenomenon -
dropout - consisting of randomly zeroing-out input units of a layer in order to
break happenstance correlations in the training data that the layer is exposed to.
How to correctly apply dropout in recurrent networks, however, is not a trivial
question. It has long been known that applying dropout before a recurrent layer
hinders learning rather than helping with regularization. In 2015 (while LSTMs
1997), Yarin Gal, as part of his Ph.D. thesis on Bayesian Deep Learning, determined
the proper way to use dropout with a recurrent network: the same dropout mask (the
same pattern of dropped units) should be applied at every timestep, instead of a
dropout mask that would vary randomly from timestep to timestep.
What's more, in order to regularize the representations formed by the recurrent
gates of layers such as GRU and LSTM, a temporary constant dropout mask should be
applied to the inner recurrent activations of the layer (a "recurrent" dropout
mask). Using the same dropout mask at every timestep allows the network to properly
propagate its learning error through time; a temporally random dropout mask would
instead disrupt this error signal and be harmful to the learning process (so
Dropout() should not be used before an LSTM / GRU / SimpleRNN (only before
Dense())).
Yarin Gal did his research using Keras and helped build this mechanism directly
into Keras recurrent layers.
Every recurrent layer in Keras has two dropout-related arguments:
dropout, a float specifying the dropout rate for output units of the layer (used
only before Dense() layer), and
recurrent_dropout, specifying the dropout rate of the recurrent units.
Let's add dropout and recurrent dropout to our GRU/LSTM layer and see how it
impacts overfitting. Because networks being regularized with dropout always take
longer to fully converge, we train our network for twice as many epochs (think
about convergence like a "potential to learn more" a NeuralNet that still has
potential to increase the accuracy it's not converged).
Advanced detail: recurrent dropout is not implemented in cuDNN kernel (CUDA side,
as of mid 2021) so it will be slower to train on a GPU than CPU when recurrent
dropout is applied (need to test TPU). Ref:
https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM#for_example
Recurrent dropout
Natural Language Processing
Recurrent batch norm exists in the literature: https://arxiv.org/abs/1603.09025
For keras: https://github.com/keras-team/keras/issues/10193
Recurrent batch norm
Natural Language Processing
Research any new embeddings (improvements over GloVe and Word2Vec), are there now
better ones?
Pytorch/fastai with pretrained embeddings.
Word2Vec + Keras.
Construct an experiment - MWE - proving that using pretrained embeddings is
beneficial (the notebook has the beginning of such project, you can extend it).
Either top accuracy or reach the accuracy faster.
Check what happens when dropout is applied before the intermediate recurrent layer
of DRNN LSTM(dropout=0.2) → LSTM() → Dense() . Is the negative impact noticeable?
Severe?
Try removing html tags (<br>), all the misplaced quotation marks from the training
data and see if the accuracy increases even if the test data has them.
Further explorations
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1D7jmQV-Ywzs7Hs8m-KUVlvsmXdjjwRyC.pptx ---


Artificial Inteligence
Python Crash Course
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Installation and tool preparation
Python history and versions
01
02
03
04
05
Python as a programming language
Other Interpreters
Python Crash Course
Python interpreter
00
Computer literacy test / discussion
Computer literacy test / discussion

Python Crash Course


To be in any of the data roles, we need to be a proficient user of a computer > a
programmer > ML engineer. Let’s discuss, do you know:
What is a variable?
What is an array / list? (contiguous?)
What can we represent with 3D array? What about 4D? (list of names? GPS
coordinates)
Where are the variables stored during the execution of the program?
Is the computer executing the same code that we wrote?
What is a compiler? Which languages use it? What is an interpreter? Which languages
use it?
Sequential execution / conditional execution / repeating execution (loop) - what
mechanisms in programming language correspond to these terms? Are there more types
of execution? Parallel?
What is the opposite of synchronous? How is it related to programming?
What is the use of comments in code? How much should we comment or should we do it
at all?
What is sorting and what are the fundamental operations that sorting algorithms do.
What is a datastructure? Can you give some examples? SQL tables? DAG?
What is the difference between an array and a linkedlist?
What is a garbage collector (... apart from being a dream job)? Memory collector?
Can a GPU do what CPU does? How about vice versa?
Do you know how to send code snippets in MS teams and create screenshots?
How is your mic, screen sharing capabilities and webcam?
What is a software engineer / developer / ML eng. and what does a he/she/it do?
Code? Problems?
Watch this video as a homework for more (Computer Scientist Answers Computer
Questions From Twitter (Wired)) https://www.youtube.com/watch? v=QUNrBEhvXWQ
Installation and tool preparation
Python Crash Course
We will use these tools (and more) throughout the course:
Pycharm (Python, Numpy, Pandas) / VSCode (optional) / Neovim / Other
Google Collab (rest of the course, most frequent), need to have google account
Jupyter notebooks (we’ll only introduce)
ChatGPT, Claude, Google Gemini, Bing Chat, Github Copilot (perferal tools)
Installation and tool preparation
Pycharm
IDE (integrated development environment)
Downloading and installing
Lauching
Virtual environment - library and interpreter separation for different projects.
Can be used for projects on Python 3.6 while we want to use Python 3.9 globally. It
is the best practice to use them.
Creating new projects
Its recommended to create new projects in a separate folder
If you make a mistake, just delete the folder, there are no global changes made by
creating new projects (let’s take a look at the project structure and see a demo).
Привет мир / Hello world / Labas Pasauli applicaton.
if __name__ == '__main__': → useful if we have modules, so that a module would’t be
automatically executed when it’s imported:
https://stackoverflow.com/a/419185/1964707 .It is also a good practice to use this.
We will talk way more about modules latter on.
Python Crash Course
Installation and tool preparation
Google Collab
You will need to reach them though this link at first or just google search
https://colab.research.google.com/notebooks/basic_features_overview.ipynb
Press “New”
It will create the folder: “Colab Notebooks” in google drive
Addon for rightclick https://workspace.google.com/u/0/marketplace/...
Collab pro is available for Lithuania from 2022 (more on that latter)

Demo
Create a new notebook
Привет мир / Hello world
Saving your work and launching again
Using shell commands inside collab (!apt install iputils-ping)
Adding screenshots / heading (hierarchy) for personal notes

Questions from students


Can we install python libraries in collab? How?
How to downgrade library version?
What are the advantages of Collab?
Python Crash Course
Installation and tool preparation
Jupyter notebooks
Many people use conda, however there can be some issues on windows installing
nvidia drivers and launching models on gpu.
It is good to use classical Jupyter notebooks that can be installed using pip: ref:
https://jupyter.org/install
If you know Java, C#, PHP, Javascript then pip is similar / same as maven/gradle,
nuget package manager, composer, npm
pip -V → check if pip is already installed
pip install notebook (not JupyterLab) → install classic jupyter notebooks
Launch notebooks → need to go into the root folder where your notebooks are.
jupyter notebook
python -m notebook
Hotkeys:
Esc
DEMO: exporting google collab notebook and launching on our machine:
We will need to install libs
We will need to use the shell appropriate to our OS! So bash commands might not
work!
Images are not importable from jupyter to collab
DEMO: local notebook to google collab (local → cloud)
Python Crash Course
Python as a programming language
Python Crash Course
Main things:
February 20, 1991
Python and Monty Python?
Ref: https://www.javatpoint.com/python-tutorial
Ref: https://www.javatpoint.com/python-features

Python advantages:
Python (Very HLL) vs. Java / C# / C++ (HLL)
Does not use c-like syntax - quite unique, whitespace sensitive / significant
whitespace
Easy to learn the basics and commonly praised for developer productivity.
Runtime speed vs. writetime speed.
Interpreted language, but: Link1 and https://i.imgur.com/PJME67T.png
Dinamically typed
GC’ed - no explicit allocation / deallocation (of course we can’t accumulate data
ad infinitum, so leaks are still possible)
Multiparadigm: procedural, object oriented, with functional mechanisms.
Suitable for scripts / automation, data processing, GUI programming, simple games,
web applications, AI, security.
Popularity: google trends / tiobe / github stars / polls (you can see yourself at
dirbkit.lt (cvbankas) / linkedIn). List of companies for ML/DS.
Python has gone the “batteries included” path - a large standard library (time, os,
sys) and a giant pypi (python package index) package system
Python as a programming language
Python Crash Course
Disadvantages:
Not widely used in mobile, desktop realms.
Interpreted languages are often slower than compiled ones (remember: languages are
not slow or fast - runtimes are).
Unique syntax - means more difficult to switch to other languages.
Python as a programming language
Python Crash Course
Reports:
Python as a programming language
Python Crash Course
Reports:
Python history and versions
Main things:
Guido van Rossum, BDFL
PEP (python enhancement proposal), PEP8 ir PEP20 (PEP 20, we can see it with import
this ). Root page: https://peps.python.org/
PEPs are “community driven” - evolve by community effort (Microsoft has contributed
and others)
Pypi index: https://pypi.org/
2nd version, 2.7 not supported from 2020.
https://en.wikipedia.org/wiki/History_of_Python
Which version will we use?

Python3 vs. Python2:


print(“A”) vs. print “A”
unicode support
there is a lot of Python2 code. And it will exist for many years!
Python Crash Course
Python interpreter
Python Crash Course
The Python interpreter can be accessed from the command line - an interactive
programming language!
python -V → version
python / py → REPL (read eval print loop)
ctrl + Z → exit
python -c “” → one-liners and pipe’int to linux shell commands.

We can launch the source code file!


notepad
save
run
Other Interpreters
4 main implementations:
CPython: It can be considered the “reference implementation” of the language.
CPython is a compiler, interpreter, and set of built-in and optional extension
modules, all coded in standard C.
Jython: python on JVM.
IronPython: python on CLR.
Pypy: python on python, performant, but not as popular.
… (honorable mention) Micropython: https://micropython.org/

Which to use:
CPython
but if you need speed - Pypy (of course you need to benchmark whether your
application will see speed improvements).
Well and if you have a lot of Java / C# code or you know a good library and want to
call it from Python code then the other options.
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1yaHEZojmi3CcDDAZIItjSC1oekxTMflV.pptx ---


Artificial Inteligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
DDL
DQL
01
02
03
DML
Python Crash Course
00
SQL
05
Joins
06
Python database connectivity
07
08
04
Grouping and aggregation
Relationships and data modeling
BONUS: mongodb, neo4j, apache spark
09
Recommended resources
SQL
Structured query language a very popular language for querying information from a
relational database - RDBMS.
RDBMS is an acronym for relational database management system. It refers to common
software products like: MS SQL server, Mysql, Oracle 18c, PostgreSQL, etc.
An RDBMS is also synonymous with the term “database server”.
An RDBMS is a highly optimized software product for the purpose of writing,
storing, retrieving data in addition to controlling access to it and many other
operations (vs. flat files).
Each installation of RDBMS can contain many databases which can be defined as a set
of logically related tables, which in turn are composed of rows and columns.
Each column represents a quality and each row a particular entity of some specific
type (like: Person, Employee, Item, Invoice, etc). That way in each cell we store
the values of a particular quality for that entity (Name of Person).
R in RDBMS comes from the relational model (1968-1969) where data is described as
sets of tuples arranged in a table like structure where the tables themselves can
have relations.
NoSQL trend refers to databases that use not-only SQL - non relational databases,
commonly schema-less, like: MongoDB, Redis and such.
4 main categories nosql databases in total (see side pictures) + time series.
Ref: https://db-engines.com/en/ranking and
https://db-engines.com/en/ranking_trend . Check this video for more:
https://www.youtube.com/watch?v=W2Z7fbCLSTw
Importance of SQL for dataroles: https://www.google.com/search?... and
https://www.reddit.com/r/SQL/... and
https://www.youtube.com/@thedatajanitor9537/videos
Python Crash Course
SQL
We have 4 groups of SQL statements:
DDL – Data Definition Language (schema descr.)
DQL – Data Query Language (the important thing)
DML – Data Manipulation Language
DCL – Data Control Language
We are going to use MySQL to learn SQL. This is the most popular open source
database in the world
We need to install MySQL: https://dev.mysql.com/downloads/installer/
Also we will install MySQL Workbench to be able to write and execute queries
interactively against MySQL: https://dev.mysql.com/downloads/workbench/
If you don’t want to install anything just use: http://sqlfiddle.com/
Create a new database → table → record (and reverse these actions).
Define the columns, datatypes: https://www.w3schools.com/sql/sql_datatypes.asp . We
will need:
VARCHAR(?)
INT/BIGINT (signed/unsigned)
FLOAT/DOUBLE
DATE/DATETIME
Question: we want to save a phone number - which datatype should we use? How about:
isdn, barcode, mac address, social security number, etc.
We can potentially import more data from CSV. Use online generator to generate it:
https://extendsclass.com/csv-generator.html . Exercise.
Python Crash Course
SQL
Python Crash Course
SQL
Python Crash Course
DDL
CREATE – to create database and its objects like (database, table, index, views,
stored procedure, function and triggers)
ALTER – alters the structure of the existing database
DROP – delete objects from the database
There are more, but these are the important ones.
TRUNCATE – remove all records from a table, including all spaces allocated for the
records are removed, but not the table
COMMENT – add comments to the data dictionary
RENAME – rename an object
Python Crash Course
DDL
Explanation:
MySQL datatypes: https://www.mysqltutorial.org/mysql-data-types.aspx important
ones: INT, DOUBLE, VARCHAR, DATETIME
NOT NULL - Each row must contain a value for that column, null values are not
allowed (on insert).
DEFAULT value - Set a default value that is added when no other value is passed.
UNSIGNED - Used for number types, limits the stored data to positive numbers and
zero.
AUTO_INCREMENT - MySQL automatically increases the value of the field by 1 each
time new record added (on insert).
PRIMARY KEY - (pirminis raktas) Used to uniquely identify the rows in a table. The
column with PRIMARY KEY setting is often an ID number, and is often used with
AUTO_INCREMENT. Primary key = unique + not null.
DEFAULT CURRENT_TIMESTAMP → current time will be inserted if the caller does not
provide any value.
Python Crash Course
CREATE DATABASE Test; CREATE DATABASE `Test Database`;

USE test3; -- prevents “no database selected” error (--


comment)
CREATE TABLE employees (id int); -- one of the simplest possible create
statements;

CREATE TABLE MyGuests (


id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
firstname VARCHAR(30) NOT NULL,
lastname VARCHAR(30) NOT NULL,
email VARCHAR(50) UNIQUE,
update_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
DDL
Python Crash Course
ALTER TABLE table_name
ADD column_name datatype;

ALTER TABLE table_name


DROP COLUMN <column_name>;

ALTER TABLE Persons


MODIFY COLUMN <column_name> <datatype>;

ALTER TABLE testalter_tbl


RENAME TO alter_tbl;

ALTER TABLE Customer


CHANGE <column_name> <column_name_renamed> <new_datatype>;

DROP TABLE [IF EXISTS] <table_name>; -- IF EXISTS issues a warning if table is


not in the database. If it is missing it will be an error.
DROP DATABASE [IF EXISTS] <db_name>;
INSERT - add data to the table.
Skip columns when there is no NOT NULL constraint
We don’t need to specify all the columns if we insert all of them - specify only
when not all columns are inserted (positional binding).
The order of where the value will be inserted is determined by the column name
( `surname`, `salary`, `name`) (binding by name)

UPDATE - update data in the database.


Simple update: UPDATE <TABLE> SET <COL> = XYZ
Multiple column update: UPDATE <TABLE> SET <COL1> = X2, <COL2> = X2
Update with filter: UPDATE <TABLE> SET <COL1> = X2, <COL2> = X2, WHERE COL2 > X
Update with filter on the updatable collumn
Update acting on the same collumn (SET age = age + 1)

DELETE - delete data.

There are more, but these are the most important ones:
MERGE - UPSERT, REPLACE, CALL, LOCK / UNLOCK table.

Exercise:
Create table Employee (id, name, surname, salary).
Add some data to the table using INSERT statement.
Update all rows that make less than salary X to add 10%
Create two rows with same information (just the id is different) - try to delete
it.
DML
Python Crash Course
INSERT INTO Guests
(`id`, `firstname`, `lastname`, `email`)
VALUES
(1, "Mindaugas", "Bern", "[email protected]"),
(2, "Jonas", "Kur", "[email protected]"),
(3, "Petras", "Per", "[email protected]"),
(4, "Juozas", "Ten", "[email protected]");

INSERT INTO Guests (`firstname`, `lastname`)


VALUES ("Ponas", "Krabas");
UPDATE Guests SET Lastname = 'N/A';

UPDATE Guests SET


Firstname = CONCAT(Firstname, "A"),
Lastname = CONCAT(Lastname, "A")
WHERE id >= 10;

UPDATE users SET


-- email = CONCAT('fuitcake2', '@yahoo.com'),
lastname = CONCAT('>', UUID_SHORT())
WHERE id > 2;

DELETE FROM Guests WHERE Firstname = 'Jonas';


DQL
Many queries and clauses:
SELECT
ORDER BY
LIMIT
WHERE
JOIN (latter)
UNION
SUBSELECT(latter)

SELECT used for retrieving records from one or more database tables.
Select “A” → simplest select (for checking connectivity)
Select * → select all columns
Select coll_1, coll_2 → only certain columns
Select coll_2, coll_1 → change the order of columns at display time
Select count(*) | count(id) → combined with function count() it can tell you how
many items are in the table
Select x AS y from … → alias, useful for renaming columns in the result set
Limit and Offset → select only a portion of data (from, how many)
Where → can be combined with multiple operators which then can be
chained using AND / OR operators
Python Crash Course
SELECT * FROM test.guests;
SELECT count(id) AS `#Guests` FROM test.guests;
SELECT * FROM test.guests Limit 2, 4;
SELECT * FROM test.guests Limit 4 OFFSET 2;

SELECT * FROM test.guests


WHERE firstname = 'Mindaugas'
AND email = '[email protected]';

SELECT * FROM test.guests ORDER BY email, firstname;


SELECT * FROM test.guests ORDER BY email DESC, firstname DESC;
SELECT * FROM test.guests ORDER BY firstname;
SELECT * FROM test.guests ORDER BY email;
Grouping and aggregation
Python Crash Course
Grouping squashes the rows that have the same value at a particular column.
Grouping is useful for calculating statistics - for example the average male salary
by department.
Common statements:
GROUP BY - clause used to achieve grouping
HAVING - filter by any aggregated column or combo of them: HAVING std < 50
Math functions: COUNT(), MIN(), MAX(), SUM(), AVERAGE(), STD().
Non-math functions: GROUP_CONCAT(), SUBSTRING(), CONCAT(), CONCAT_WS()
GROUP_CONCAT - concatenates any column by the grouped column, CONCAT_WS -
concatenatenates any two columns.
SELECT count(id) as count_at_update_time, update_date
FROM test.guests GROUP BY update_date;

SELECT count(id) as count_at_update_time, update_date


FROM test.guests GROUP BY update_date
HAVING count(id) > 2;

CREATE TABLE Employee (


id INT,
name VARCHAR(255),
surname VARCHAR(255),
department_name VARCHAR(255),
salary INT
);

INSERT INTO Employee (`id`, `name`, `surname`, `department_name`, `salary`)


VALUES (1, "Mindaugas", "Bern", 'IT', 500),
(2, "Jonas", "Kur", 'Marketing', 1500),
(3, "Petras", "Per", 'IT', 500),
(4, "Juozas", "Ten", 'Marketing', 100),
(5, "Mindaugas", "Xyz", 'IT', 550);

SELECT MIN(salary), MAX(salary), AVG(salary), SUM(salary) / COUNT(salary) FROM


test.employee;

SELECT
MIN(salary),
MAX(salary) / MIN(salary), -- how many does the best earner outearn the worst
earner
MAX(salary),
AVG(salary),
SUM(salary) / COUNT(salary),
STD(salary),
department_name
FROM test.employee
GROUP BY department_name;

SELECT CONCAT_WS(" , ", name, surname) as fullname, salary FROM Employee;

SELECT GROUP_CONCAT(`name` SEPARATOR "; ") as `dep_staff`, department_name


FROM Employee GROUP BY department_name;
SELECT
GROUP_CONCAT(CONCAT_WS(" ",
name,
CONCAT_WS(".",
SUBSTRING(name, 1, 1),
SUBSTRING(surname, 1, 1)),
surname)
) as fullname,
salary, deparment_name
FROM test.employee
GROUP BY deparment_name;
Grouping and aggregation
Exercise 1:
create a table called Cities with the following columns: id, country_name,
country_code, city, number_of_residents_in_city
calculate the total number of people in each country
calculate the average city size in terms of number of people in each country
filter out the countries by average size of the city (choose the cut off point)
order by total number of people in a country
** use single select statement (for each task mentioned before)

Exercise 2:
Construct the postcode from first two letters of the city name + “-” + current post
code, i.e.: city-post_code_number. E.g.: NE-AX8485 (New York-AX8485)
Display all postcodes for a country separated by comma and space.
Python Crash Course
CREATE TABLE cities (
id INT PRIMARY KEY AUTO_INCREMENT,
country_name VARCHAR(255) NOT NULL,
country_code VARCHAR(10) NOT NULL,
city VARCHAR(255) NOT NULL,
post_code_number INT NOT NULL,
number_of_residents_in_city INT NOT NULL
);

SELECT * FROM cities;

INSERT INTO cities (id, country_name, country_code, city, post_code_number,


number_of_residents_in_city)
VALUES (1, 'United States', 'US', 'New York', 'AX8485', 8336697),
(2, 'United States', 'US', 'Los Angeles', 'MX845', 3999759),
(3, 'United States', 'US', 'Chicago', 'AC8415', 2716450),
(4, 'Canada', 'CA', 'Toronto', 'XX85', 6301095),
(5, 'Canada', 'CA', 'Montreal', 'CAX885', 1704694),
(6, 'Canada', 'CA', 'Vancouver', 'MAX85', 675218),
(7, 'United Kingdom', 'UK', 'London', 'AX9485', 8900000),
(8, 'United Kingdom', 'UK', 'Manchester', 'AX9985', 545500),
(9, 'United Kingdom', 'UK', 'Birmingham', 'AX2385', 1100000),
(10, 'Australia', 'AU', 'Sydney', 'AX8115', 5514000),
(11, 'Australia', 'AU', 'Melbourne', '118485', 4736000),
(12, 'Australia', 'AU', 'Brisbane', '228485', 2189800)
;
Relationships and data modeling
Nouns (domain models), how many tables / relations, table names, columns, datatypes
of columns.
There are 3 types of relationships
one-to-one (1:1) : spouses in western world (single table, self-referential), phone
and client w/o phone history or phone types.
one-to-many (1:M) : customer and orders, phone and client, student mentors and
mentees (1:M but single table, self-referential).
many-to-many (M:M) : books and authors, classes and teachers, categories and
products. A junction table is used.
M:M can express 1:M and 1:1. It is the most flexible.

Relationships are established by primary keys and foreign keys.


PK uniquelly identifies a row in a table even if all other values between two rows
are the same.
FK - a column pointing to the primary key of another table, establishing a
referential relation. Also a constraint.
FK is maintained by the “child” table - the many side in the 1:M or M:M
relationships. Child table is the junction table in M:M.
Python Crash Course
Relationships and data modeling
1:M - clients and phone numbers (left as an exercise): create all the tables and
columns necessary to represent 1:M relationship between customers and phones.
Please add a few records when solving this.

M:M - books and authors + foreign key constraint;

CREATE TABLE IF NOT EXISTS Books_Authors (


book_id INT,
author_id INT,
FOREIGN KEY (book_id) REFERENCES Books(id),
FOREIGN KEY (author_id) REFERENCES Authors(id)
) ENGINE=INNODB;
-- Error Code: 1824. Failed to open the referenced table 'books'

CREATE TABLE IF NOT EXISTS Books (


id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL
) ENGINE=INNODB;

CREATE TABLE IF NOT EXISTS Authors (


id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) NOT NULL
) ENGINE=INNODB;
Python Crash Course
-- Error Code: 1452. Cannot add or update a child row: a foreign key constraint
fails
-- INSERT INTO books_authors
-- (`book_id`, `author_id`)
-- VALUES(1, 1);

INSERT INTO books


(`id`, `title`)
VALUES (1, 'Durnių laivas');

INSERT INTO authors


(`id`, `name`)
VALUES (1, 'V. Landsbergis');

INSERT INTO books_authors


(`book_id`, `author_id`)
VALUES(1, 1);
CREATE TABLE Clients (
id int auto_increment primary key,
name varchar(30)
);

CREATE TABLE Phones (


id int auto_increment primary key,
number varchar(30),
c_id int,
FOREIGN KEY (c_id) REFERENCES Clients(id)
);

INSERT INTO Phones (`number`) VALUES ('+3701');


INSERT INTO Phones (`number`, `c_id`) VALUES ('+3702', 3); -- error!
INSERT INTO Clients (`name`) VALUES ('Jonas'), ('Petras');

INSERT INTO Phones (`number`, `c_id`) VALUES ('+3703', 1), ('+3704', 2);
Joins
JOIN clause hels us connect two or more tables into a resulting set based on some
matching condition (collumn(s)).
Usually used with SELECT statement (but it can be used with others, like UPDATE to
perform cross-table updates).
Most common and most important join types (there are more):
(Inner) Join
Left (outer) join
Right (outer) join
Full (outer) Join
P.S. Table1 is the table specified in the FROM clause, and Table2 in the JOIN
clause.
Python Crash Course
SELECT C.id, C.name, P.number FROM Clients AS C
JOIN Phones AS P ON C.id = P.c_id;

SELECT C.id, C.name, P.number FROM Clients AS C


RIGHT JOIN Phones AS P ON C.id = P.c_id;

SELECT C.id, C.name, P.number FROM Clients AS C


LEFT JOIN Phones AS P ON C.id = P.c_id;

Joins
More Join types
If we were to define joins as a logical construct of connecting two more tables,
this infragraph would be valid, but usually we treat the concept of joins as a
syntactical concepts not only logical one.
Python Crash Course
Joins
Exercise:
Create a data model to present a company that provides online training for
programmers, like CodeAcademy.
There should at least be Students and Courses represented (bonus points for
additional domains models represent).
Think about the column types, names, relations, PKs, FKs, etc.
Include students that do not attend any courses.
Write a query that would display full student information including which course(s)
the student attend(s).
* Phone numbers, emails, contacts, teachers can be included into the modeled domain
for additional karma points.
Python Crash Course
Joins
Let’s illustrate all the join types using simple tables:
https://gist.github.com/MindaugasBernatavicius/ac6f3b4583f7d83a64a3e39ea9f15f86
DEMO: Simple joins (inner, left, right, full). Multi joins.
Question: why do we need right and left join if we can just flip up the tables?
Mathematically they are equivalent
… however if you have a complex join with more than 2 tables and you want to add an
additional ones you will not be able to change which table is the Table1 in the
join
Note - we can provide some formulations on tasks that can be handled / solved using
joins (problem solving or example based approach vs. conceptual introduction):
Get all clients that have phone numbers and provide their names and phone numbers.
Get all phones that do not have clients assigned (right join with exclusion)
Get all … TBD
… etc.
Python Crash Course
SELECT
GROUP_CONCAT(authors.id SEPARATOR ', ') as a_id,
GROUP_CONCAT(authors.name SEPARATOR ', ') as authors,
COUNT(authors.id) as '# authors',
books.id as b_id, books.title
FROM authors
JOIN books_authors ON books_authors.author_id = authors.id
JOIN books ON books_authors.book_id = books.id
GROUP BY books.title;
SELECT
C.id, C.name,
GROUP_CONCAT(P.number) as 'phone numbers',
COUNT(P.number) as '# numbers'
FROM Clients AS C
INNER JOIN Phones AS P ON C.id = P.c_id
GROUP BY C.name
ORDER BY '# numbers' DESC;

Json and relational data


Don’t take these notes as set in stone, but take them into consideration and
research more on json datatype and semi-structured data
Python Crash Course
Order of execution
Common job interview question - given a particular query, how will it be executed?
Another question: should where be executed first or order by and why?
Python Crash Course
Order of execution
//
Python Crash Course
Python database connectivity
There are many ways to connect to a database in Python.
Some of the most used ones: database drivers (different for each database, low
level), Pandas SQL capabilities, SQLAlchemy (ORM).
Connecting using databases drivers is pretty simple - simply install the driver and
write the appropriate code to establish a connection.
The python connector: mysql-connector-python

Pandas has read_sql_query() method. More on Pandas SQL capabilities:


https://blog.panoply.io/how-to-read-a-sql-query-into-a-pandas-dataframe
Python Crash Course
from mysql.connector import connect, Error

try:
with connect(host="localhost", user="root", password="mysql") as connection:
users_query = "SELECT Customer.name, Address.City, COUNT(Orders.id) " \
"FROM joinsexample.Customer " \
"INNER JOIN joinsexample.Address ON Customer.id =
Address.Customer_id " \
"INNER JOIN joinsexample.Orders ON Customer.id =
Orders.Customer_id;"
with connection.cursor() as cursor:
cursor.execute("SET sql_mode=(SELECT REPLACE(@@sql_mode,
'ONLY_FULL_GROUP_BY', ''));")
cursor.execute(users_query)
result = cursor.fetchall()
for row in result:
print(row)
except Error as e:
print(e)

SET sql_mode = 'STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION,ONLY_FULL_GROUP_BY';


SET sql_mode = 'STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION';

cursor.execute("SET sql_mode=(SELECT CONCAT(@@sql_mode, ',ONLY_FULL_GROUP_BY'));")

from mysql.connector import connect, Error


import pandas as pd

try:
with connect(host="localhost", user="root", password="mysql") as connection:
df = pd.read_sql("SELECT * FROM dvwa.users", con=connection)
print(df)
except Error as e:
print(e)
Python database connectivity
SQLAlchemy is an ORM in the Python ecosystem.
An ORM is an “Object relational mapper” and it essentially allows us to query and
manipulate data w/o writing any or very little SQL.
It translates or maps the OOP world of objects with properties and SQL world with
tables rows and columns.
So in short it translates data stored in a database to python objects and vice
versa + allows us to perform CRUD operations.
Ref: https://www.tutorialspoint.com/sqlalchemy/index.htm and
https://towardsdatascience.com/sqlalchemy-python-tutorial-79a577141a91
pip install mysqlclient sqlalchemy
Python Crash Course
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import declarative_base
from sqlalchemy.orm import sessionmaker

engine = create_engine('mysql://root:[email protected]/joinsexample', echo=True)


Base = declarative_base()

class Customer(Base):
__tablename__ = 'customer'
id = Column(Integer, primary_key=True)
name = Column(String)

Session = sessionmaker(bind=engine)
result = Session().query(Customer).all()
# print(Session().query(Customer))
for user in result:
print(user.id, user.name)
Student questions:
How to understand an already fully developed database if you are joining the
project.
There are many tips

What about performance?


One of the best starters I have found: https://www.youtube.com/watch?v=HubezKbFL7E
One important rule: almost never should you do filtering in python if you can
delegate it to a database (unless the database load is explicitly being preserved
and data querying/aggregations are moved to the clients). I have had cases when I
gained from 37s to 0.2s (~160).
Is reading a file faster than reading a database table? Reading from file, if made
maximally efficient is faster than reading from DB, because DB is an abstraction
over a file.
Python Crash Course
Recommended resources
SQLBolt: https://sqlbolt.com/lesson/
Hacker rank: https://www.hackerrank.com/domains/sql
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1Z6e0Auiz-y3Dq4V0HS69ogrHyN4xDnIc.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


History of NLP
01
02
Lithuanian?
Natural Language Processing
00
Definition of NLP
Obtaining data
03
04
Preprocessing and mining
Words as numeric data
06
07
Bag of words and Bag of ngrams
05
Auto summarization
08
Further explorations
NLP or natural language processing is a part of artificial intelligence, that deals
with processing natural languages with computers.
Example of natural language: English, Lithuanian.
Non-natural (synthetic) would be programming languages like java, php, python,
synthetic grammars, esperanto.
The peculiarities in natural languages that are hard for computers to deal with
are:
Abstract teams that don't have precise meaning (family, love, nation) - we need to
define them in science and law all the time.
Non-precise meaning, meaning depending on the context (context dependence -
dependency).
Alegories, metaphors, word-play, etc.
These things make it hard for computers to understand what is being said, but also
give beauty, flexibility to natural languages.

NLP essentially tries to answer questions like: how to deal with written (text) and
spoken (audio) natural language in an algorithmic, programable way, allowing next
level automation - based on semantic meaning (like computer vision) - exist and
develop.
If you ever find yourself thinking about a particular sphere of machine learning
think in a simplified way - it's just automation. What could we automate if
computers where able to understand human languages?
Definition of NLP
Natural Language Processing
Common tasks - an incomplete list of NLP applications:
Giving commands to robots / computers (Cortana, Siri, Alexa, Google Assistent
(Google Duplex)): https://www.youtube.com/watch?v=D5VN56jQMWM (was 5 years ago.
Anything came from it? They are still working on it.)
Transcription (speech to text)
Automatic reading - (text to speach, ebooks/audiobook)
Sentiment analysis - classifying statements - possitive/negative,
constructive/destructive, tos compliant/tos violating (or anything). Facebook
contentment moderation.
Topic modeling (classification) - topic assignment given a text, automatic tagging.
Automatic text summarisation (automatic tagging).
Text generation (recent revolutions are often here, like GPT-3). Code generation.
Autocomplete (gmail has currently the best autocomplete, need to connect to gmail
though US proxy).
Translation
The primary usecases sometimes are not impressive for people new to NLP, but a
sentiment analysis model can be used for stock trading decisions - if you see a
hashtag on twitter trending or some news in the media about your company - a
negative news message can be an indicator you current position before many people
do and the price of the stock goes bust, saving you potentially thousands to
millions.
Text generation in the form of code autocomplete is the huge with the reveal of
GPT-3, OpenAI Codex, Github Copilot.
Definition of NLP
Natural Language Processing
//
Definition of NLP
Natural Language Processing
Starting from the earliest:
Turing test - using language to interact with others is distinctivelly human.
Chomskian revolution in linguistics / rule based models (1. Arguments against
behaviorists & 2. Theory of universal grammar)
Machine learning based models - linguistics is not necessary anymore, we don’t need
a ruleset, algorithms learn the rules!
Starting from about 2014-2015 (bidirectional) LSTMs begin to dominate NLP, 2018 -
Transformer architecture.
See: https://www.dataversity.net/a-brief-history-of-natural-language-processing-
nlp/# and https://en.wikipedia.org/wiki/...
Ref: https://buggyprogrammer.com/what-is-natural-language-processing/
History of NLP
Natural Language Processing
One of the more impresive and creative products:
Github Copilot (X)
Microsoft Copilot
ChatGPT based products
http://sentdex.com/political-analysis/us-politicians/
http://sentdex.com/how-sentdex-works/
https://zyro.com/ai/content-generator Ai text generator
etc.

Imaginary NLP pipeline:


Audio data
Converted to text (transcription)
Converted to tensors (numbers)
Trained RNN / Transformer or an ML model
Deploy model to production to process new data
$$$ ?
Pipeline and products
Natural Language Processing
So English is of course the most used language in NLP. Other major languages are
also used and models / corporii / embeddings for them are.
https://informatica.vu.lt/journal/INFORMATICA/article/1083/text
https://www.google.com/search?q=speech+to+text+lithuanian (try it:
https://sonix.ai/languages/transcribe-lithuanian-audio )
https://semantika.lt/ (išbandykite ir pagalvokite ar galėtumėte sukurti geriau)
TODO: mobile apps

Lithuanian is a small language, so there is a possibility that you create some


impressive ML/DL model or product based on that model you might be the first one to
do it (...or a leader). It is nice to know a small language. So knowing a small
language is both a blessing and curse in that sense.
Lithuanian?
Natural Language Processing
The internet is basically mostly text (F. Chollet). Natural language data can be
obtained by various means:
pre-preped dictionaries and datasets available on the internet
scarping websites / forums / dating sites / use APIs of those websites
video / audio transcription
generating your own data (literaly just writing text)

Prepreped:
End of this arcticle: https://towardsdatascience.com/ultimate-beginners-guide-to-
collecting-text-for-natural-language-processing-nlp-with-python...
NLTK corpii: https://theflyingmantis.medium.com/exploring-natural-language-toolkit-
nltk-e3009de61576
Userfull list: https://www.nltk.org/book/ch02.html
Kaggle (text mostly, audio is more about animal sounds, not speech)

APIs:
A list of APIs is in the resource above
APIs or particular interest: twitter, facebook, reddit and on. “Simple”
(~understandable) task - create a chrome plugin that near each comment would show
if it’s a positive or negative comment.

Scraping
We had an extended discussion on scrapping which would help you for this part!
Obtaining data
Natural Language Processing
Transcription
We can use SpeechRecognition package in python to obtain text from audio files.
This is essentially a facade for multiple speach recognition APIs, like google,
microsoft bing and so on:
recognize_bing(): Microsoft Bing Speech
recognize_google(): Google Web Speech API
recognize_google_cloud(): Google Cloud Speech - requires installation of the
google-cloud-speech package
recognize_houndify(): Houndify by SoundHound
recognize_ibm(): IBM Speech to Text
recognize_sphinx(): CMU Sphinx - requires installing PocketSphinx
recognize_wit(): Wit.ai
https://pypi.org/project/Spgnize_bing(): MicroseechRecognition/
The following file formats are supported:
WAV: must be in PCM/LPCM format
AIFF
AIFF-C
FLAC: must be native FLAC format; OGG-FLAC is not supported
Obtaining data
Natural Language Processing
Once we obtained the words / text we can do the preprocessing - here are some
common preprocessing actions we need to do:
Preprocessing and mining
Natural Language Processing
For many tasks in natural language processing we can use the NLTK library - the
toolbox for NLP! NLTK is a leading platform for building Python programs to work
with human language data. It provides easy-to-use interfaces to over 50 corpora and
lexical resources such as WordNet, along with a suite of text processing libraries
for classification, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries. See the following
resources if you want to learn more: http://www.nltk.org/nltk_data/
The following are the popular tasks for NLP:
Sentence tokenization
Word tokenization
Stopword removal (commonly occurring words that do not add much meaning)
Building n-grams (sequences of n words (bigrams, trigrams and so on)). Helps us
construc “bag of n-gram models”
Stemming - only leave the stem of the word
Lematization - grouping together the inflected forms of a word so they can be
analysed as a single item
Tagging part of speach (POS tagging): https://pythonprogramming.net/part-of-speech-
tagging-nltk-tutorial/ . Why is it useful? We could reduce the dimensionality of
our data by just leaving nouns and verbs for example. Remember - as long as you are
getting adequate performance for your problem, you can do whatever it is you want
with the data. Knowing which words are verbs and nouns and filtering them as such
would need to be achieved through PoS tagging first. so It’s a preprocessing and
dimensionality reduction technique.
Displaying parse tree
Additional tasks that you might need to perform, but with raw python
Data cleaning - html/json/xml entity removal would be an example
Punctuation removal
Lowercasing, uppercasing, etc.
Data obfuscation (like email, ssn, pwd obfuscation, email is usually replaced with
placeholder <email>). Replacing the data fields/words that are particular to some
individual with predefined tokens/variables gives better generalization, because
the text becomes less specific.
Preprocessing and mining
Natural Language Processing
Usefull when you want a summary - let's say you have an ecommerce website and you
want to get the basic gist of what the customer is saying from a long text. There
many ways to do it, but without complex model we can exploit the idea that a few
important sentences in the text can disclose the meaning the entire piece of text -
our goal is to find those sentences.
Interesting applications as example:
summarizing facebook / twitter posts - see how much meaning they retain.
summarizing CNN/Delfi articles and comparing the summaries to the ones that the
author provided. Maybe the one we generated will be less click-bate-y.
can a summary be only a paragraph? Maybe a table with nouns | verbs | adject. we
could PoS tagging for that purpose.
scientific research paper summarization (and automated “Abstract” synthesis)
We have many ways to performing auto summarization, however there is a
deterministic classic algorithm - just find the most important words and sentences.
How do we decide which words are important? Word importance == word frequency
without counting stop words - they are most frequent, but least representative of
the topic being discussed.
Which sentences are most important? The more important words a sentence has, the
more importatant a sentence is! The sum of word importance.
After determining the most important sentences we can preserve their relative
order.
After you have the most important words and sentences choose how long the summary
needs to be and create it.
Auto summarization
Natural Language Processing
Neural networks represent a differentiable function mapping inputs to outputs,
hence they work only with numbers.
We need to convert words into numbers (remember the tabular data analysis part
where we converted categorical data to one-hot encoding). So we need some
representation for the words we are going to feed into our networks.

General steps:
obtain the text
tokenize the text into words
represent the words as vectors

What is the best/popular ways to represent words as vectors?


One hot encoding (ohe)
Frequeney based encodings (tf-idf and others)
Prediction based encodings (word embeddings)
Words as numeric data
Natural Language Processing
One hot encoding
When you have a table of all the words and for the words that are in the sentence
you write a 1 when the word is present.
Has a lot of downsides: large matrix, sparse matrix, no ordering preservation
(BoW), loss of frequency information, semantic relationships are lost, but very
easy to understand (and implement, event from scratch). Why is "easy to understand"
a virtue / advantage? Because you can even memorize it and use it in job
interviews. It's also usefull as a base case to benchmark against.
Words as numeric data
Natural Language Processing
Frequency based encodings
Count frequency encoding captures how often the words occurs. Othervise same
disadvantages one one-hot.
TF-IDF (term frequency–inverse document frequency) captures how often the word
occures in a document as well as how often it occurs in the entire corpus. Captures
the frequency and the relevance. The more often the word occurs in a document - the
more important it is, the more often it occurs in the entire corpus, the less
relevance it has. But we just talked that the word occurence in a sentcee makes the
sentence more important? Yes, but the word is not more important if it is frequntly
used. Think about stopwords - they appear very often but are often considered to be
noise in a textual dataset. We can also think that the oft-occuring words are very
generic and non-specific, therefore lack "specificity" ("jack of all trades, master
of none” words). The drawbacks: no context and no relationships with other words.
Common technique.
Co-occurrence: is a measure of how frequently two words have occured together in
the context window of a certain size. Requires huge matrix, but can preserve
relationships between words, see: https://stackoverflow.com/a/24076711/1964707
Words as numeric data
Natural Language Processing
One-hot-encoding
Frequency Vectors
Tf-idf
Ref: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/
ch04.html
Words as numeric data
Natural Language Processing
Embeddings
We already talked about them. Captures both meaning and semantic relationships.
They are learned.
Based on some vector distance in Euclidian space (Euclidian distance, Cosine
distance and so on).
This is the most powerfull way to represent words in numeric form.
Automatically provide dimensionality reduction.
Compared to OHE:
While the vectors obtained through one-hot encoding are binary, sparse (mostly made
of zeros) and very high-dimensional (same dimensionality as the number of words in
the vocabulary), "word embeddings" are low-dimensional floating point vectors (i.e.
"dense" vectors, as opposed to sparse vectors). Unlike word vectors obtained via
one-hot encoding, word embeddings are learned from data. It is common to see word
embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when
dealing with very large vocabularies. On the other hand, one-hot encoding words
generally leads to vectors that are 20,000-dimensional or higher (capturing a
vocabulary of 20,000 token in this case). So, word embeddings pack more information
into far fewer dimensions.
Words as numeric data
Natural Language Processing
Embedings are so powerful that the following numeric expression approx. holds:
king - man + woman = queen

Who could have guessed that arithmetic on concepts is possible!


Embeddings (cont.)
There are two ways to obtain word embeddings:
Learn word embeddings jointly with the main task you care about (e.g. document
classification or sentiment prediction). In this setup, you would start with random
word vectors, then learn your word vectors in the same way that you learn the
weights of a neural network.
Load into your model word embeddings that were pre-computed/pre-trained (in the
spirit of transfer learning) using a different machine learning task than the one
you are trying to solve. These are called "pre-trained word embeddings". There are
two popular pre-trained ones: word2vec, GLoVe (more about them in the next slides)
Words as numeric data
Natural Language Processing
Embeddings (cont.), paraphrased from Chollet’s book
The simplest way to associate a dense vector to a word would be to pick the vector
at random. The problem with this approach is that the resulting embedding space
would have no structure: for instance, the words "accurate" and "exact" may end up
with completely different embeddings, even though they are interchangeable in most
sentences. It would be very difficult for a deep neural network to make sense of
such a noisy, unstructured embedding space.
To get a bit more abstract: the geometric relationships between word vectors should
reflect the semantic relationships between these words. Word embeddings are meant
to map human language into a geometric space. For instance, in a reasonable
embedding space, we would expect synonyms to be embedded into similar word vectors,
and in general we would expect the geometric distance (e.g. L2 distance) between
any two word vectors to relate to the semantic distance of the associated words
(words meaning very different things would be embedded to points far away from each
other, while related words would be closer). Even beyond mere distance, we may want
specific directions in the embedding space to be meaningful.
In real-world word embedding spaces, common examples of meaningful geometric
transformations are "gender vectors" and "plural vector", “roayalty vector”. For
instance, by adding a "female vector" to the vector "king", one obtains the vector
"queen". By adding a "plural vector", one obtain "kings". Word embedding spaces
typically feature thousands of such interpretable and potentially useful vectors.
Is there some "ideal" word embedding space that would perfectly map human language
and could be used for any natural language processing task? Possibly, but in any
case, we have yet to compute anything of the sort [similary Chomskian universal
grammar project]. Also, there isn't such a thing as "human language", there are
many different languages and they are not isomorphic, as a language is the
reflection of a specific culture and a specific context. But more pragmatically,
what makes a good word embedding space depends heavily on your task: the perfect
word embedding space for an English-language movie review sentiment analysis model
may look very different from the perfect embedding space for an English-language
legal document classification model, because the importance of certain semantic
relationships varies from task to task.
It is thus reasonable to learn a new embedding space with every new task.
Thankfully, backpropagation makes this really easy, and Keras makes it even easier.
It's just about learning the weights of a layer: the Embedding layer.
Words as numeric data
Natural Language Processing
What we discussed are bag-based-models (frequency encoding, tf-idf). They disregard
order, but maintain count (frequency).
These terms are often mentioned in tutorials and books. Bags-of-words models can be
usefull for clasification tasks (sentiment analysis), but not so much for sequence
based tasks (autocompletion/text generation/translation/summarization).
Take note that one-hot encoding does not preserve frequency, therefore it is not
clasified as a bag-of-words (BOW) model technically, but it is definitely a bit
simplistic way to represent words as numbers.
Bag of n-grams models is just a logical extension (generalization) of bag of words
models. Same definition: disregard order, maintain frequency. Sometimes it is
usefull to check whether these models could give better results.
Bag of n-grams models is just a logical extension (generalization) of bag of words
models. Same definition: disregard order, maintain frequency. Sometimes it is
usefull to check whether these models could give better results.
Bag of words / Bag of ngrams / Sequential
Natural Language Processing
LLMs also must accept numbers not words, so they will use some kind of tokenization
+ token encoding scheme out of many that exist.
Because there are many schemes, we can better understand their properties by
grouping them into categories/types: character level, sub-word level, word level.
A list of known token encoding schemes:
one-hot encoding (bow)
count vectorization (bow)
tf-idf encoding (bow)
label encoding (sequential)
word embeddings
character embeddings
byte-pair encoding (BPE)
WordPiece
SentencePiece
byte-level Encodings
We know that some of these are BoW encodings while LLMs are mostly used preserving
the sequential aspect of NL. So not all encodings are particularly suited for LLMs.
A list of token encoding schemes used by LLMs:
BERT uses a WordPiece tokenizer variant:
https://huggingface.co/docs/transformers/v4.40.1/en/model_doc/bert#transformers.Ber
tTokenizer
GPT-2 uses a BPE tokenizer.
GPT-3 uses a BPE tokenizer.
GPT-3.5 uses a BPE tokenizer (tiktoken).
GPT-4 uses a BPE tokenizer (tiktoken), see:
https://datascience.stackexchange.com/questions/97630/what-tokenizer-does-openais-
gpt3-api-use
Encoding and Embeddings for LLMs?
Natural Language Processing
An idea for final project: perform analysis on the Lithuanian language resources
for NLP:
is there a text corpus?
could it be contributed to NLTK?
how about stemming/lematization (does it work)?
do standard frameworks support Lithuanian language (NLTK, spacy)?
any pre-trained word embeddings?
are there any apps / services for LT language (additional ones to what we have in
the slides)?
Autosummarization with ML/DL - what’s current SOTA (GPT-4)? Which type of neural
network is best? (when googling autosummarization add NLP in front), one example -
encoder-decoder architecture:
https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-
summarization-using-deep-learning-python/
Further explorations
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1Vekz8CozwQwNQrlB3zqzOPzF1ahrDNKr.pptx ---


Artificial Inteliigence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Iterators / Iterables
Decorators
01
02
03
Closures
Python Crash Course
00
Inner classes
05
Callables
06
Abstract Class
07
Interfaces
04
Digression on topic combinations
Inner classes
Python Crash Course
You may have wondered, coming from other languages, whether it is possible to
created nested or inner classes in Python? The answer is yes, it’s possible,
however (see below, ref: https://stackoverflow.com/question.... )
Inner classes are used as a design choice. When one class is inextricably tied to
another and it does not make sense to have it as a standalone class, we can nest it
inside.
A common example is the Iterator pattern, where ContainerIterator class is nested
inside a Container class (names chosen accidentally) or Person and FullName class,
etc.
If you have a class representing a collection and you want to iterate it in many
ways you can nest iterator classes - a common pattern in languages like Java / C#.

Iterators / Iterables
Python Crash Course
Definition:
An iterator is an object that contains a countable number of values, almost like a
“container class”.
An iterator is an object that can be iterated upon, meaning that you can traverse
through all the values.
Technically, in Python, an iterator is an object which implements the iterator
protocol, which consist of the methods __iter__() and __next__().
Iterator vs Iterable:
Lists, tuples, dictionaries, and sets are all iterable objects. Strings are
iterable objects, and can return an iterator aswell.
They are iterable containers which you can get an iterator from. All these objects
have a __iter__() method which is used to get an iterator.
The differences can be summarized:

Iterators / Iterables
Python Crash Course
Iterating over and iterator:
You can do it with next()
You can also do it with for loop. The for loop actually creates an iterator object
and executes the next() method for each loop.

Creating iterators:
To create an object/class as an iterator you have to implement the methods
__iter__() and __next__() to your object.
__iter__() - returns the iterator object itself.
__next__() method also allows you to do operations, and must return the next item
in the sequence.

Stoping iterations:
Iterator would continue forever if you had enough next() statements, or if it was
used in a for loop.
To prevent the iteration to go on forever, we can use the StopIteration built-in
exception.
In the __next__() method, we can add a terminating condition to raise an error if
the iteration is done a specified number of times.

Iterators / Iterables
Python Crash Course
Container class:

Exercise:
We have a Flight class.
It contains two lists (+ other scalar fields).
First list contains objects of Passenger class
The next one contains objects of CrewMember class
The Flight class can accept the two lists as parameters to __init__(self,
passengers, crew)
Create an iterator to iterate over all of the people on the plane.

When doing the exercise, think about how you can simplify it by creating a
simplified version of the problem first. After solving the simplified version
first, solve the final version.

Please try not to perform memory operations in __next__ or __iter__ methods like
concatenating lists, etc.
Closures
Python Crash Course
Closures are functions that are nested inside other functions and have access to
the variables of the outer function.
All of this behavior is enabled by pythons functions being first class citizens:
https://www.geeksforgeeks.org/decorators-in-python/
We have already seen how treating functions as first class citizens helps us in
creating things like the map, filter, reduce functions - since we can treat a
function as a variable, we can pass it to another function and then call it when
needed:

We could pass any function to the custom my_map(func, list) function:


Closures
Python Crash Course
Knowing that we can treat functions as variables allows us to also return the
function from another function - hence we have closures.
Closures have lexical scope - they can access variables (called “free variables”)
from the outer context, see: https://stackoverflow.com/questions/1047454/what-is-
lexical-scope
Closures
Python Crash Course
A common useful example is creating “decorator functions” that extend the
capabilities of normal functions in some way.
Decorators
Python Crash Course
Decorators use to closures for their implementation. Decorators are very powerful
and useful tool in Python since it allows programmers to modify the behavior of
function dynamically, w/o rewriting it. Decorators allow us to wrap another
function in order to extend the behavior of the wrapped function, without
permanently modifying it - giving us dynamic extendability and flexibility w/o
reimplementation.

There are 3 ways to make functions flexible (more reusable, more DRY):
parameters,
functional strategy pattern - part of the logic is inject as a variable, see:
https://stackoverflow.com/a/30465042/1964707
decorators
Decorators
Python Crash Course
Decorators are usually used as metadata on other functions - @ sign + the name of
the decorator.
Decorators
Python Crash Course
We sometimes need to pass data to the decorator - when the function decorated
accepts arguments.
Note: decorators compatible with arguments are the default way of creating them!

Decorators
Python Crash Course
Practical examples: logging and timing

Decorators
Python Crash Course
Multiple decorators - the way we implemented the decorators we can’t use them all
or they are order dependent.
That is because the function name’s is being reasigned while one decorator is
called.
To make it work we can use a helper decorator: functools wraps

Decorators
Python Crash Course
Real world examples of decorators: https://realpython.com/primer-on-python-
decorators/#a-few-real-world-examples

Since decorators are mostly just functions, they can also accept arguments
themselves: https://stackoverflow.com/questions/5929107/decorators-with-parameters

Decorators
Python Crash Course
Classes can be used as decorators using the dunder __call__ method (aka:
Callables).
In the of callables classnames are often written in lowercase letters.

Decorators
Python Crash Course
Useful decorators
There are many useful decorators, see: https://towardsdatascience.com/10-fabulous-
python-decorators-ab674a732871
We will learn about @lru_cache, @cache, @jit later in the course when we talk about
performance optimization.
We learned about @property, @classmethod, @<x>.setter/deleter and so on in the
previous lecture.
Most common ones: @property, @<x>.setter/deleter, @staticmethod, @classmethod,
@lru_cache, @dataclass
Now let’s take a look at one of the most useful and time saving decorators -
@dataclass - eliminates the need to define all the class syntax explicitly,
simplifies the writing of classes.
Decorators
Python Crash Course
Dataclasses are oriented towards classes that represent data (InventoryItem,
Person, Employee, Teacher).
Features of dataclasses (see screenshot)
Decorators
Python Crash Course
Useful resources on decorators:
https://realpython.com/primer-on-python-decorators/ (contains useful examples and
real world / framework useage)
https://towardsdatascience.com/using-class-decorators-in-python-2807ef52d273
(simple example)
https://www.geeksforgeeks.org/class-as-decorator-in-python/ (just some examples)
https://github.com/lord63/awesome-python-decorator (more examples)
A simple and understandable cache implementation for functions:
https://stackoverflow.com/a/115349/1964707

Decorators
Python Crash Course
A final theoretical point on Decorators.
Some say they are just “Pythonic way to implement the decorator design pattern”.
However opinions differ as can be seen in the wikipedia article on Decorator
Pattern:

Digression on topic combinations


Python Crash Course
We saw how iterators connect concepts like objects, dunder methods and loops
together.
We saw how iterators and inner classes connect.
We also saw how closures, functions as first class citizens and decorators are
connected as well.
We could draw a pair-combination matrix of learning and explore the
interconnections between topics. Some topics are related more closelly, some are
not connected at all.

Callables
Python Crash Course
The Call Operator. When we use or invoke a function in Python, we typically say
that we call the particular function. The “calling” is achieved by placing a pair
of parentheses following the function name, and some people refer to the
parentheses as the call operator - ()

Without the parentheses, Python interpreter will just generate a prompt about the
function as an object itself — the function doesn’t get called. Let’s see the
nuance of using the function with and without the call operator. We can also test
if we have a callable typing using the callable() function:

The important thing is the dunder call method.


Objects can become callables if we define dunder __call__ method.

Callables
Python Crash Course
In short:
In Python there are many things that are callable. Functions, methods, classes,
objects to name a few.
One common usecase for callables is decorator classes.
Read more about how to use them: https://stackoverflow.com/questions/3369640/when-
is-using-call-a-good-idea

Callables
Python Crash Course
For completeness, let me mention that not only objects can be decorators (when the
classes they are derived form implement dunder call), but we can also decorate
classes with decorators and extend their functionality just like we extended the
functionality of functions. This however is a rare usecase, commonly not advised
because it is replaceable with inheritance:
https://stackoverflow.com/questions/681953/how-to-decorate-a-class
https://stackoverflow.com/questions/9906144/decorate-a-class-in-python-by-defining-
the-decorator-as-a-class
Abstract class
Python Crash Course
Abstract classes in OOP languages allow us to create an class, that we can use for
inheritence purposes, but can not instantiate.
This is commonly described as “abstract class is a template for other classes” -
this definition, however fails to capture the diff between abstract class and
interface in other OOP languages, but for beginner programmers it might be
sufficient.
In classical OOP we have classes, interfaces and abstract classes. Class - have
concrete methods and data, interfaces - just abstract methods, abstract classes -
at least one abstract method and can have concrete data and methods.
Python does not provide abstract classes (or interfaces) as a native language
construct, but we can use a package for that.
ABC - abstract base class
It is hard to find a unique case where abstract class would be the most appropriate
solution. The closest thing seems to be when you (1) need to inherit certain
fields, (2) have some common interface that you need to keep in the abstract class,
(3) and you don’t have a reason to provide a default implementation i.e. you must
tell the concrete classes to implement their own implementation (no default).
A weaker argument for a design that includes abstract based classes is when a class
is needed only for inheritance.
Interface
Python Crash Course
An interface is a type that only declares abstract methods in classical OOP.
When you define a class you can declare that it implements a certain interface and
the concrete class then must implement the abstract methods that the interface
declares in itself - this provides a guarantee that the objects of the concrete
class can then be passed to all functions that accept the instances compliant to
that interface (an also add those objects to collections that then can be iterated
upon and a method called on them or they can be passed to some method). In that
case we have polymorphism enabled by interfaces, which is much more abstract.
Why is this better than a simple inheritence based polymorphism? Because interfaces
allow us to be very abstract and not provide any implementation, just a requirement
that the class must conform to a certain behavior (provide certain methods).
Another thing that Python does not offer natively.
It is common to simulate Interface functionality by simply having an ABC with only
abstract methods - this way we can simulate interfaces in Python!
… also, there is a python package for even more realistic interfaces:
https://stackoverflow.com/questions/2124190/how-do-i-implement-interfaces-in-python
, more concretely: https://stackoverflow.com/a/52960955/1964707
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 1kx9d3m8dJ2BreHFoVHnJq0DPc4W24Fb3.pptx ---


Artificial Intelligence
Advanced Computer Vision
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Data processing inequality
01
02
Historical summary
Advanced Computer Vision
00
Introduction
Upscaling
03
SRGAN
05
04
SRCNN / ESPCNN
06
Measuring image quality objectively
07
Further explorations
Super resolution is the process of upscaling and or improving the details within an
image. Approx. more pixels!
More strictly: SR = denoising + debluring [needs reference]
Often a low resolution image is taken as an input and the same image is upscaled to
a higher resolution, which is the output.
The details in the high resolution output are filled in where the details are
essentially unknown. Super resolution is essentially what you see in films and
series like CSI where someone zooms into an image and it improves in quality and
the details just appear.
Let's see a video intro: https://www.youtube.com/watch?v=KULkSwLk62I
Different upscaling algorithms are compared here:
https://en.wikipedia.org/wiki/Comparison_gallery_of_image_scaling_algorithms

The benefits are gaining a higher quality image from one where that never existed
or has been lost, this could be beneficial in many areas or even life saving on
medical applications (CAT, Roentgen (x-ray), MRI can be enhanced), CCTV quality
ehnacement for suspect identification. Another use case is for compression in
transfer between computer networks. Imagine if you only had to send a 256x256 pixel
image where a 1024x1024 pixel image is needed, corespondingly maybe we save storage
(1TB -> 500GB). REMEMBER: we do not talk about compression / decompression
algorithms here.
Introduction
Advanced Computer Vision
Ref: https://en.wikipedia.org/wiki/Data_processing_inequality
Data processing inequality
Advanced Computer Vision
We can differentiate the following historical methods for SR:
Classical methods: nierest neighbours interpolation, bilinear, bicubic.
ML: SVR based: http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.367.2012&rep=rep1&type=pdf
(SR)CNN: does not dream up details (source?): https://arxiv.org/pdf/1501.00092.pdf
(SR)GAN: does dream up details, gives nice looking images:
https://arxiv.org/pdf/1609.04802.pdf
SOTA models: https://paperswithcode.com/sota/image-super-resolution-on-set14-4x-
upscaling
Another interesting distinction is single-frame vs. multi-frame superresoltion
algorithms (used in mobile phones, for example).
We see a tradeof between precision (CNN) and details that can be dreamed up (GAN)
Historical summary
Advanced Computer Vision
WE can perform simple upsampling if we want our image to be larger w/o loosing too
much quality. This can be done with OpenCV and is one of the most popular ways to
do it with code. See: https://chadrick-kwag.net/cv2-resize-interpolation-methods/

Interpolation methods:
INTER_NEAREST – a nearest-neighbor interpolation
INTER_LINEAR – a bilinear interpolation (used by default)
INTER_AREA – resampling using pixel area relation. It may be a preferred method for
image decimation, as it gives moire’-free results. But when the image is zoomed, it
is similar to theINTER_NEAREST method.
INTER_CUBIC – a bicubic interpolation over 4×4 pixel neighborhood
INTER_LANCZOS4 – a Lanczos interpolation over 8×8 pixel neighborhood
Upscaling
Advanced Computer Vision
BSDS500
DIV2K (high res: https://www.tensorflow.org/datasets/catalog/div2k )
Datasets
Advanced Computer Vision
Ref: https://keras.io/examples/vision/super_resolution_sub_pixel/

Interesting parts:
Why is BICUBIC used for cR, cB channels and for Y model.predict() is used?
Why is the model trained on a single channel not on all 3? Performance?
Would it make the model worse if we used RGB rather than YUV?
SRCNN / ESPCNN
Advanced Computer Vision
Architecture: https://pyimagesearch.com/2022/06/13/enhanced-super-resolution-
generative-adversarial-networks-esrgan/ (note how generator generates SR images and
discriminator tries to distinguish them).
Ref: https://www.tensorflow.org/hub/tutorials/image_enhancing
SRGAN
Advanced Computer Vision
Ref: https://www.mathworks.com/help/images/image-quality-metrics.html
Measuring image quality objectively
Advanced Computer Vision
Create Nearest-neighbor interpolation SR from scratch.
There are many more approaches to image superresolution, explore them.
One of them is this tool: https://github.com/fperazzi/proSR
How would you go about creating a web tool that would allow users to increase the
size of an uploaded image? Would a SRCNN approach be feasible (CNN does not have
arbitrary output dimensionality, maybe we would offers just some subset of standard
scales: 640x480, 1024x768, 1280x1024). This discloses an interesting advantage that
classical upscaling techniques have. Maybe we would use a mixed approach: use a
SRGAN for initial upscaling that is bigger than the user requested, then downscale
it with classical techniques to get the resolution the user wants, see:
https://github.com/thekevinscott/UpscalerJS
Further explorations
Advanced Computer Vision
Try some of the available models or online SR tools on your own images.
… try digitizing and improving images from old photography / video tapes.
… check if you can recover all the details form images saved in google photos
(after downscaling) (hint: you would probably need a way to measure the quality
objectively in case you can’t see improvements subjectively).
Further explorations
Advanced Computer Vision
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Advanced Computer Vision


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 12AOyljL1i0jzCMm2HilqmLXGGkCEL_LC.pptx ---


Artificial Intelligence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Using a debugger
Generators
01
02
03
Lambda
Python Crash Course
00
Functions
Basic file operations
04
Functions
Python Crash Course
Funkcijos - tai vardą turintys kodo blokai, kuriuos galima perpanaudoti prireikus
(DRY).
Funkcijos gali priimti parametrus ir gali gražinti rezultatą su return raktažodžiu.
Pagal rezultato gražinimą funkcijos yra dažnai klasifikuojamos į:
Grynasias funkcijas (pure functions) - f-jos, kurios neturi pašalinio efekto (side
effect), tik gražina reikšmę pagal paduotus parametrus.
Funkcijas su pašaliniu poveikiu (side effect) - f-jos, kurios gali gražinti arba
negražinti reikšmės, bet būtinai atlieka veiksmą “už atminties” (pvz: spausdina
tekstą į ekraną ar failą / duomenų bazę, daro užklausą į tinklą) ar modifikuoja
globalų kintamąjį ar pan.
Pure functions yra lengvai testuojamos, paralelizuojamos, suprantamos ir taip pat
vertinamos funkcinio programavimo ekosistemoje.
Praktinis panaudojimas: kuo daugiau pure functions, o visos impure functions yra
atskiriamos ir visus side effects stengiamės sutelkti keliose funkcijose arba išvis
globaliame script scope (pvz: suskaičiuoti vidutinį ilgį delfi šiandienos
antraščių: get_raw_html(url) → extract_titles(raw_html) → get_avg_titles([titles])
→ print(avg)
Pagal parametrų priėmimą funkcijos gali būti klasifikuojamos į:
Funkcijas su parametrais.
Funkcijas be parametrų.
Funkcijas su default parametrais (svarbi savoka possitional binding, named
variables).
Pagal tai kas rašo funkcijas galime jas išskirti į:
In-built funkcijas - tokias, kurios jau egzistuoja Python kalboje mums įsidiegus
interpretatorių (print(), len(), sum(), .append()) ar bibliotekom
User-defined - mūsų sukurtos funkcijos

Daugiau: https://www.w3schools.com/python/python_functions.asp
Functions
Python Crash Course
Labai svarbu suprasti skirtumą tarp funkcijos deklaracijos (declaration) ir
funkcijos iškvietimo (function call). Tik aprašyta funkcija, bet neiškviesta
niekada nesuveiks.
Built-in funkcijas mes praktiškai visada tik kviečiame, o user-defined f-jas ir
deklaruojame ir kviečiame.
Funkcija gali turėti parametrus nurodomus skliaustuose.
Funkcijos deklaracijos shema:

def funkcijos_vardas(parametras1, parametras2, … ):


kodas / logika;

Įdomu tai, jog python kalboje galime pasirašyti funkciją be implementacijos su


keywordu “pass” - tai pati paprasčiausia python funkcija.
Naming’as: print_something() , verbs (veiksmažodžiai): print(), su išimtimis list()
(nors to list something).
Functions
Python Crash Course
Default argumentai:
Funkcijos gali tūrėti vadinamuosius default parametrus, visada deklaruotinus po
non-defaultinių - jei nepaduosime to parametro kviesdami funkciją, tai jis įgys
default’inę reikšmę. Tokiu būdu, galime pasidaryti optional funkcijų parametrus.
Kai kviečiame funkciją paprasti parametrai yra bindinami pagal poziciją, bet jei
naudojame parametrų vardus (t.y. naudojame keyword argumentus) tai galime paduoti
parametrus, bet kokia tvarka nes naudojamas bindinmas pagal vardą. Keyword
argumentai turi eiti po pozicinių, jei tokių yra.
Svarbu: nenaudokite mutuojamų reikšių kaip default parametrų funkcijose. Default
parametrai inicializuojami funkcijos sukūrimo metu ir jei naudosite mutable
kolekcijas ir darysite funkcijos viduje append gausite galimai netikėtus
rezultatus. Kaip spręsti tai: https://stackoverflow.com/a/11416002/1964707

*args ir **kwargs - labai bendrinėms funkcijoms:


Jei turime paduoti į funkciją daug reikšių galime naudoti *args parametrą
(funkcijos viduje taps tuple).
Jei turime paduoti daug keyword argumentų naudojame **kwargs (funkcijos viduje taps
dict)
Pavyzdžiai: print(*values, …) → print(“A”, “B”, “C”)
Galime šiuos mechanizmus kombinuoti def f(*args, **kwargs):
https://medium.com/spatial-data-science/unlock-the-power-variadic-arguments-in-
python-functions-a591bf572c2
Functions
Python Crash Course
Grynosios funkcijos - tos, kurios neturi pašalinių efektų, jų rezultatą pilnai
apibrėžia parametrai ir algoritmas aprašytas jose.
Grynosios funkcijos nieko nespausdina į ekraną, neįrašo į failą, nesiunčia /
nesikreipia tinklu, nekeičia globalaus kintamojo.
Jos turi daug privalumų (lyginant su f-jomis, kurios turi pašalinį efektą):
Daug lengviau suprasti.
Padeda struktūruoti programą.
Lengva testuoti unit testais (Pytest) nes nereikia patikrinti ar funkcija padarė
pašalinį efektą. Kadangi supratimas yra testavimu grįstas, tai ir suprasti.
Lengva naudoti aplikacijose su daug gijų (daugiagijis programavimas).

Kas yra pašalinis poveikis?


I/O operacijos (rašymas į failą, duombazę, API, ekraną).
Paveikimas / keitimas (mutavimas) globalių kintamųjų.
Modifikavimas paduoto argumento į funkciją taip, kad už f-jos tas parametras jau
yra pasikeitęs (by reference).

Šios funkcijos nėra blogis - kiekviena programa, kuri yra naudinga turi jų ne
vieną. Tačiau kai kuriame aplikaciją apsimoka laikytis taisyklės: kuo daugiau
funkcijų be pašalinio poveikio ir visus pašalinius poveikius turėti mažame kiekyje
funkcijų (koncentracija). Pvz.: delfio antraščių pavyzdys praeitoje skaidrėje.
Functions
Python Crash Course
Return:
Įdomi detalė python kalboje, kad funkcija visada kažką gražina, net jei nėra
raktažodžio return, tuomet gražinama None.
Taip pat, skirtingai nuo daugelio kalbų python galimas multiple return.
Multiple return mechanizmas remiasi tuple unpacking mechanizmu: a, b, *c = (1, 2,
3, 4)
Return nutraukia funkcijos veikimą - nesvarbu ar return statementas būtų panaudotas
ciklo viduje, conditional viduje ar pan.
Functions
Python Crash Course
Funkcijos gali kviesti kitas funkcijas ir tokiu būdu deleguoti dalį darbo.
Tai vadinama programos funkcine dekompozicija - pirminė programos forma gali būti
tiesiog vienas milžiniškas source kode failas. Jame galime įžvelgti tam tikras kodo
vietas, kurios pasikartoja - tokias kodo vietas pakeisti funkcijos pakvietimu.
Kviečiama funkcija turės tą pačią logiką, kuri buvo išlupta iš mūsų mintiniame
ekperimente naudojamo failo.
O tada dar galime iš funkcijos pakviesti ir kitas funkcijas, taip dar labiau
moduliarizuodami kodą.

Funkcijos gali kviesti pačios save - tai vadinama rekursija.


Klasikiniai pavyzdžiai problemų, kurios sprendžiamos rekursinėmis funkcijomis yra:
Faktorialo skaičiavimas → 5! = 5*4*3*2*1 ⇒ 5! = 5 * 4! ⇒ n! = n * (n-1)!
Fibonnaci skaičių generavimas → 1, 1, 2, 3, 5, 8. Fn = Fn-1 + Fn-2
Medžio (tree datastructure) duomenų strukūros visų narių išvardijimas (tree
traversal).
Rikiavimo algoritmai (kai kurie yra rekursijos pagalba implementuojami).
left_pad(“ab”, 4) ⇒ |__ab|
Rekursijos esmę sudaro dvi idėjos:
kad funkcija kviečia save, bet vis su kita parametro reikšme (recursive step) bei
kad egizistuoja atvejis, po kurio rekursija sustoja, vadinamas “base case”
(stopping condition).
Funkcijų pakvietimo stekas (function call stack).
Funkcijų pakvietimo steko vizualizacija bei recursijos medis (recursion tree).
Stack Overflow klaida - jei niekada funkcija nepriena prie “base case” tai
galiausiai įvyksta legendinė SO klaida.
Domino pavyzdys!
Functions
Python Crash Course
Parametrų padavimas į funkcijas
Parametrai gali būti duomenys, “flagai”/kontroliniai parametrai ir netgi kitos
funkcijos (nors sintaksė padavimo ta pati, bet semantika skiriasi).
C, C++ bei C#, PHP kalbose parametrus į funkcijas galima paduoti pagal nuorodą (by
reference) ir kopiją (by copy / by value).
Padavimas per nuorodą (by reference) reiškia, jog funkcija gauna kintamojo adresą
atmintyje ir jei kintamąjį pakeičia, tai pasikeitimai atsispindės ir už funkcijos.
Kur tai naudinga, kur praktikoje panaudojamas šis būdas? By reference kai reikia
perduoti dideles duomenų struktūras (dideli masyvai), kad nereiktų daryti dar
vienos kopijos.
Perdavimas kopijuojant (by copy) reiškia, kad parametras nukopijuojamas ir visi
pakeitimai jam padatyti funkcijoje galios tik tos funkcijos viduje.
Python kalboje sakoma, jog yra naudojama “pass by assignment” semantika (arba pass-
by-object-reference, references are copied, not the values) - į funkciją
perduodamas kintamasis yra tiesiog priskiriamas naujam vardui. Tokiu atveju
funkcijos kvietimo metu egzistuoja du kintamieji, kurie laiko savyje nuorodą į tą
pačią atminties vietą. Jei kintamasis yra pakeičiamas “in-place” (tai galima
padaryti tik mutable kintamiesiems) tai pasikeitimas matysis už funkcijos ribų, jei
kintamasis yra immutable (string, int ir t.t.) tai bet koks assignment veiksmas
sukurs naują kintamąjį ir atrodys, jog kintamasis nebuvo pakeistas už funkcijos
ribų. Ref: https://stackoverflow.com/questions/986006/how-do-i-pass-a-variable-by-
reference Pvz:

def test(a):
a += 1
print(a)

a = 1
print(a)
test(a)
print(a)

def test2(list):
list.append('a')
print(list)

list = ['a']
print(list)
test2(list)
print(list)
Using the debugger
Python Crash Course
After learning how functions, loops and conditionals (if, else) work we can
effetivelly learn how to use a debugger in the IDE.
Concepts:
Debugger - a tool to see what is happening in the programm line-by-line.
Alternative: print()
Breakpoint - a point in the execution of the programm that when hit will stop the
execution and engage the debugger.
Execution flow - how the programm is executed (with all the loops, jumps to
function and all that).
DEMO: grouping exercise with dicts (jumping inside loops and conditionals): step
over
DEMO: refactor same example with functions: step into, resume with multiple
breakpoints.
DEMO: how to change variables during the execution of the debuggin session.
NOTE: Inbuilt functions written in C are not debuggable (you can’t step into them
with pydev debugger). There is also advanced debugging when CPython interpreter is
launched with breakpoints to see the internals of how the interpreter works -
mixed-mode debugging:
https://learn.microsoft.com/en-us/visualstudio/python/debugging-mixed-mode-c-cpp-
python-in-visual-studio?view=vs-2022

Lambda
Python Crash Course
Anoniminės funkcijos Python kalboje yra vadinamos lambda funkcijomis arba tiesiog
lambda.
Neturi def ir return keyword’ų, gali turėti default parametrus, bet nedažnai taip
naudojama.
Pvz: sum = lambda x, y : x + y
Dažnai naudojama tada kai reikia paduoti į funkcijos parametrus callable tipo
objektą, bet nėra prasmės kurti vardinės funkcijos (named function), nes logika
panaudota bus tik vieną kartą
Jau matyta sort() funkcija. Pvz: išrikiuokime listą, kuriame yra kitas listas ar
tuple pagal antrą reikšmę:

Taip pat naudojama rikiuojant objektų sąrašus pagal atributą (tai paliesime
susipažinę su objektiniu programavimu)
Ref: https://www.w3schools.com/python/python_lambda.asp
Lambda
Python Crash Course
Kaip ir dauguma šiuolaikinių programavimo kalbų Python kalba yra multiparadigminė
ir suteikia sąlygas rašyti procedūrinį kodą (paprasčiausi statementai), objektinį
kodą (turime klases ir objektus), bei funkcinį. Funkcinio programavimo principai,
tokie kaip: gebėjimas funkcijas paduoti į kitas funkcijas kaip kintamuosius
(functions as variables, functions as first class citizens), preferencija immutable
duomenų struktūroms bei rekursijai vietoj iteravimo bei deklaratyvaus (regex, sql)
kodo naudojimo principai sutinkami. Map, filter ir reduce funkcijos yra pavyzdys
paskutinio principo:
Map - pritaikome operacija kiekvinam aibės elementui, ref:
https://www.w3schools.com/python/ref_func_map.asp . Map gali priimti daugybę
parametrų ir tuomet funkcija, kuri paduodama į map turi priimti tokį pat argumentų
kiekį. Map funkcija yra lazy evaluatinama - reiškia negausime rezultato tol, kol
nepakviesime next() arba list() ant funkcijos gražinto rezultato. Gražina tiek pat
narių kaip priima: [1...n] → [1..n] . Galima map funkciją naudoti duomenų
generavimui, bet geriau tai daryti su comprehensions. Geresnis pavadinimas būtų
transform.

Filter - pritaikome sąlygą kiekvinam sekos nariui ir gražiname tik tą narį, kuris
patenkina sąlygą. Filter iš daug gražina daug arba nieko [1...n] → [1..n] | [1..a]
where a < n | [ ] . ref: https://www.w3schools.com/python/ref_f...

Reduce - transformacija pagaminanti iš daugybės, vieną. Pritaikome tą pačią


operaciją rezultatui gautam iš praėjusios operacijos iteracijos. Operacija pagamina
vieną reikšmę ir daugybės [1...n] → 1. Pavyzdžiai: sum, avg, len, mul, stddev,
max/min, median, mode, string concat, etc. Įdomu tai, jog BDFL ir daug python
programuotojų nepritaria dažnam reduce() funkcijos naudojimui, ref:
https://stackoverflow.com/questions/181543/what-is-the-problem-with-reduce

Lambda
Python Crash Course
Kombinuojame map su reduce:

Type hints
Python Crash Course
Python 3.5 (2015) pridėjo type hinting mechanizmą.
Padeda lengviau suprasti kodą, atsiranda tipai funkcijų preview IDE, taip pat
galima naudoti su statiniu tipų tikrintoju (static type checker, toks įrankis),
tokiu kaip mypy.
DEMO:
pip install mypy
mypy type_hints.py

def greeting(name: str) -> str:


return 'Hello ' + str(name)
print(greeting(1))

rezultatas:

Dėmesio, type hintai neturi įtakos programos veikimui runtime metu (bent jau tol
kol cpython interpretatorius į juos nekreipia dėmesio), todėl viršuje esančią
programą paleisti galėsite be problemų.
Hintai leidžia aptikti problemas anksčiau nei runtime
“Compile” time error (error before launch)
Runtime error (error after launch)
No error but real bug still present
Tipai. Nuo python 3.9 kolekcijų tipams nereikia naudoti typing modulio
Ref: https://docs.python.org/3/library/typing.html
As of 3.12 version type hints are not mandatory and can’t be made mandatory, ref:
https://stackoverflow.com/a/63838550/1964707
Type hints
Python Crash Course
Pycharm type check severity konfigūracija:
Dėmesio: Pycharm opcijos pavadinimai bei vizualinė išvaizda, screenshotas tik
apytiksliam pailiustravimui.
Generators
Tai funkcijos leidžiančios generuoti iterable tipo objektus vieną po kito.
Naudoja lazy evaluation - kol kviečiantis kodas nepareikalauja reikšmės, ji nėra
generuojama.
Dėl šios priežasties atmintis laikyti kintamąjam nėra inicilizuojama iškart, bet
tik kai to prireikia.
Galima modeliuoti begalines sekas nenaudojant daug atminties, todėl naudojami
optimizacijai.
Naudojamas yield raktažodis.
Galima išgauti visas reikšmes, kurios būtų gražinamos iš generatoriaus, tiesiog
padavus generatorių į list()/tuple()/t.t. konstruktorių (... atminkite, jog tai
panaikina greitaveikos privalumus, kuriuos turi generatorių naudojimas, jei
generatorius begalinis, tai išnaudosime visą atmintį). Terminanting a generator /
generator termination / consuming a generator.
SK: Ar galima iš anksto sužinoti kiek generuos generatorius. Galime iš kodo
semantikos.
SK: Ar galima generatorių resetinti - teoriškai galima, bet nepatartina. Geriau
tiesiog susikurti naują generatorių (galima laikyti jo kopiją jei žinome, kad
reiks), nes tai yra pigus objektas, ref: https://www.quora.com/If-a-python-
generator-is-exhausted-is-it-possible-to-get-to-the-first-value-again
Atvejis, galintis padėti suprasti generatorius: turime funkciją, kuri generuoja
skaičius ar vardus/žodžius ar pan ir tos funkcijos gražinamas vertes naudojame
cikle, kuris gali suktis nuo 0 iki tūkstančių kartų. Generatorius čia galėtų padėti
greitaveikai - generuotų tik tiek reikšmių kiek kartų jis buvo iškviestas.
We note that map/filter/reduce functions return a generator for the optimizations
they offer. Because map/filter/reduce touches on various topics - functions, lazy
evaluation, generators (although it returns an iterators), functional programming
(functions as variables), functional strategy pattern, lambda expressions, method
chaining (which python does not allow with map, filter, reduce) (7 topics at least)
- with one simple example, we can branch out to all of these topics in a job
interview. There are a few of these cases in programming where multiple topics meet
each other in a very simple example.
Chained generators create a pipeline that never exceeds a certain amount of memory
(never requires a huge allocation of memory) - Raspberry pi.
Python Crash Course
Generator Expressions
Another way to instantiate a generator.
Syntax: (num ** 2 for num in range(10))
Python Crash Course
Basic file operations
Python Crash Course
Mokytis Python pagrindų, pamatinių koncepcijų dažnai užtenka tiesiog konsolės ir
konsolinių programų (daugiausia ką galite padaryti dabar: aritmetikos mokymosi
žaidimas, kalbų mokymosi žaidimas).
Tačiau naudingos programos dažnai reikalauja grafinės prieigos, failų apdorojimo,
susisiekimo su duomenų baze ar išorine sistema (service).
Paprasčiausias būdas atidaryti failą yra open() funkcija.
Jai reikia paduoti opciją (dar vadinamą flag’ą), kuri pasako ar norite skaityti (r)
ar rašyti (w) ar skaityti+rašyti (r+) ar pridėti duomenų į failo pabaigą (append,
a). Taip pat galime sukurti naują failą su ‘x’ opcija, tačiau žinotina, jog naujas
failas bus sukurtas, jei neegzistuoja ir padavus ‘w’ opciją.
Ref: https://www.w3schools.com/python/python_file_handling.asp bei
https://mkyong.com/python/python-difference-between-r-w-and-a-in-open/
Jei norime skaityti binary failus - flagai yra wb ir rb.

Taip pat galima naudoti vadinamąją context manager išraišką, su with - tai
rekomenduojamas būdas, nes automatiškai uždaro failą už with bloko ribų ar įvykus
klaidai. Mes apie context manager kalbėsime dedikuotoje paskaitoje.

Gal būt studentų tarpe yra žinančių, kad I/O operacijos yra brangios lyginant su
in-memory operacijomis - programai skaitant failą, intepretatorius greipiasi į
operacinę sistemą su vadinamuoju syscall’u ir operacinė sistema pasitelkdama I/O
driverius iš disko gražina informaciją. Kitoje skaidrėje galime tai matyti.
I/O operacijų pavyzdžiai - rašymas į failus, kreipimasis į tiklo resursą, vartotojo
įvestis (input()).

Python Crash Course


Basic file operations
Python Crash Course
Kolkas kalbėjome apie beformatį tekstinį failą. Kiti failų formatai: CSV
CSV - comma separated values
CSV failai yra paprasti tekstiniai failai (ne binary) kur reikšmės atskirtos
kableliais ar kabliataškiais (kartais netgi kitais delimiteriais / skirtukais).
CSV failai nėra excel failai, nors dažnai su jais maišomi nes excel yra programa,
kuri pagal nutylėjimą atidaro csv failus windows.
CSV failo viršuje kartais būna headingas paaiškintantis kokia informacija yra
saugoma kiekviename stulpelyje … arba nebūna.
CSV failų apdorojimui Python kalboje egzistuoja csv modulis … ir kitos bibliotekos
taip pat tą gali daryti (Pandas)
Nors dažniausias būdas skaityti ir rašyti csv failus yra reader() ir writer()
funkcijos…
… tačiau galime panaudoti ir DirctionaryReader() ir DirctionaryWriter(), tokiu būdu
galėsime naudoti stulpelių pavadinimus apdorojant failą.
Asside: CSV yra flat/reliacinis formatas, o pvz. JSON yra hierarchinis formatas
(kalbėsime) - tiek vienas, tiek kitas turi privalumų skirtingiems duomenims
reprezentuoti ir operacijoms atlikti.

Generuoti duomenis galime panaudoti Python Faker biblioteką:


https://zetcode.com/python/faker/
https://faker.readthedocs.io/en/master/
https://www.geeksforgeeks.org/python-faker-library/
Supportinama ir LT lokalė taigi galima generuoti LT vardus.
Online Generatoriai
Generuoti taip pat galime su ChatGPT
How could you compare all of these tools? Which criteria would you use. These tools
should be the first ones to go if ChatGPT is as effective as we would think.

Basic file operations


Python Crash Course
Is the CSV valid - missing fields?
File validation is important - HTML, CSV, JSON. Whatever it might be.
This is one of the situations where 3rd kind of error can easily appear, compare:

id,name,hourly_rate,age
1,John,30,50
2,Peter,70

id,name,hourly_rate,age
1,John,30,50
2,Peter,,70

When the field is completely missing and you do not validate the CSV or check for
that, the field for hourly_rate and age can be easily mixed up.
Imagine if someone gave you the code that parses this file id,name,hourly_rate,age
and this bug is hidden. Would it be easy to find?
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.
--- Content from 1Nl1GeEvFW3WKTrtKfmyCgPeiu1kBZoEd.pptx ---
Artificial Inteliigence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
JSON
01
02
REST APIs
Python Crash Course
00
Serialization
03
GraphQL APIs
04
Pickle
05
XML
06
SOAP
Serialization
Python Crash Course
Having worked with files, exceptions and context managers we now turn to
serialization.
Serialization is the process of converting structured data (e.g.: Python objects or
lists) to a format that allows sharing/transmission or storage of the data in a
form that allows recovery of its original structure. Object in memory → string
(json, xml) / binary (pickle)
This means that we have an object in programs memory → we serialize it aka turn it
into some storable / transmisable representation → store it or transmit it → then
we want to recover the data back we need to deserialize it. So deserialization is
the reverse process.
In some cases, the secondary intention of data serialization is to minimize the
data’s size which then reduces disk space or bandwidth requirements (compression -
json vs. xml).
Common serialization forms:
Serializing to Json
Serializing to XML
Serializing to binary file format / proprietary format.
… less popular options: yaml / yml, serializing to a string
Python has several built in modules for serializing objects: marshall, json,
pickle, xml package
Serialization
Python Crash Course
Marshall
The oldest of the three serialization modules. It’s primarily used to read and
write compiled bytecode from Python modules. If you’ve ever seen .pyc files pop up
in your working directory when importing modules (in order to create a __pycache__
directory with .pyc files you need to create a module), that’s marshall working
behind the scenes.
The biggest takeaway is not to use marshall. It’s mainly used by the interpreter
itself and can have breaking changes that would mess up your code. Also remember
the term marshaling / unmarshaling.
Json
json is the newest of the three serialization modules. It produces standard JSON
output.
That means that it’s human-readable and it works very well with other languages
that have ways of parsing JSON files.
An issue with json is that it only works with certain data types.
You should use this if you want to serialize data into a human readable format or
when the client calling your code supports json. It is very popular with REST APIs
and in general when interoperability between python and other languages is
necessary.
Pickle
Pickle serializes data in a binary format, which means that it’s not human-
readable.
A benefit to pickle is that it works out of the box with many Python data types,
including custom ones that you define, it’s very fast.
You should use this if you want to serialize python objects for storage and don’t
need them to be readable.
It is quite popular in the world of Python, but you are a bit tied to python
although there are libraries that can read pickles python objects in other
languages: http://formats.kaitai.io/python_pickle/csharp.html
Used in ML/DL to pickle models for distribution
Other options:
There are plenty of other serialization options, protobuffers, serialization to
string with repr and so on.
Ref: https://docs.python-guide.org/scenarios/serialization/ - for more extended
list.
Serialization
Python Crash Course
Serialization vs. simple writing to file can be distinguished by:
noting that serialization saves the entire object state (ussually, so we do not
think about lines saved in the file, but objects saved, not necessarily as lines)
and
intention to former is to recreate the object from the serialized state (pickle to
some python object) - deserialization.
Summary of which type to use when:
JSON
Python Crash Course
Definition, RFC: https://www.rfc-editor.org/rfc/rfc7159.txt
Json uses k:v pairs and supports primitive types: integers and string as well as
complex types like arrays [ ] and objects / dict { } . Also they can be nested.
At the root of json document must be a single element - can’t leave two objects
dangling.
Int’s/floats can’t be keys: { 0:0, 1: 1 }
Another peculiarity: comments
Comparing json to other formats like XML - less verbose (see next slide).
How is Json used in the real world?
Imagine our client exposes an API (a REST API) where we can get information about
the events their company organizes.
This data is retrieved as Json, but we will want to modify / transform / sort it /
filter it using python and this is where we will need to deserialize it into some
workable form (json strings of even objects).
We use json module to work with json in Python. We use two methods:
dump() → serialize to file
dumps() → serialize to string
JSON
Python Crash Course
Comparing json to other formats:
Note JSON and YAML are ~40% “slimmer” than XML, but they still duplicate data for
fieldnames (unlike SQL / Relational tables / CSV) . Fieldname duplication allows
for skipping certain fields - an object is self sufficient even if the field
missing commared to other objects of same / similar kind. In SQL the representation
is more compact, but also more rigid (and relations are represented differently).
NoSQL Document stores exploit this property of JSON representation allowing for
structureless / schemalless storage of data.
JSON
Python Crash Course
Comments are absent in JSON - probably the biggest disadvantage. Also sensitive to
dangling commas. Whitespace insensitive. Probably the best way to represent
“hierarchical data”.
Sometimes config files are written in json, where comments are extremely important.
XML - disadvantage is its verbosity.
YML - whitespace sensitivity (we claimed this to be an advantage when talking about
Python (for avoiding “style debates”), but this prevents YML from being used
effectively for configuration files (unless proper tooling is used)).
JSON
Python Crash Course
JSON is not what is known as a framed format (has a single root): this means it is
not possible to call dump more than once in order to serialize multiple objects
into the same file, nor later call load more than once to deserialize the objects,
as would be possible with pickle (or csv). Same applies to XML.
So the update of { departments: [ .. ] } is more complicated - you can not just
append the last line, because that would produce invalid json. That means that it
is much more comfortable to deserialize json into (python) objects and then append
the data using python APIs for those objects instead of of treating JSON as string.
So, technically, JSON serializes just one object per file. However, you can make
that one object be a list, or dict, which in turn can contain as many items as you
wish.
JSON
Python Crash Course
We would probably want to deserialize json into application objects / models right
away to have the ability to use the associated methods.
This can be accomplished several ways:
https://stackoverflow.com/questions/15476983/deserialize-a-json-string-to-an-
object-in-python
A popular one is to use object_hook with the loads() function. Ref:
https://realpython.com/python-json/#encoding-and-decoding-custom-python-objects
If you ever see errors when deserializing json, you can validate that json using
online validators (or non-online ones). This tools can also add quotes and formt
json in case it is invalid (try with this: { id: 5 } )

JSON
Python Crash Course
You can obtain json (for testing and development) in various ways:
Gen: https://www.json-generator.com/ and https://extendsclass.com/json-
generator.html
Repos: https://github.com/jdorfman/awesome-json-datasets … and more
JSON
Python Crash Course
Short intro to data modeling w/ Json
How do we represent data in popular datastuctures like lists, dictionaries and so
on most effectively?
Datastructures is not the same field. TBD
Let’s model hospital admisions
You can represent relationships with Json
Your representation can be normalized or denormalized (single source of truth)
How would you model Code Academy data: Courses, Lecturers, Students.
How would model Flight companies data: Flight, Plane, Customer

JSON
Python Crash Course
//
JSON
Python Crash Course
Hospital admissions with this data: [{ jonas, jonaitis, male }, …]
would make it easy to calculate:
how many people are currently sick
how many males vs. females are sick
but would not support:
how many people are repeated admissions (have we seen this person already)
what is the average between repeated admissions
are there any correlations between repeated admission and inherent properties of
the patient
what is the average duration of successful and negative health outcomes
Hospital admissions with this data: [{ jonas, jonaitis, male, 01-01, 01-15},
{ jonas, jonaitis, male, 03-22, 03-27}, …]
would make it easy to calculate:
all of the above
how many people are currently sick very well (not as well as deleting admitted
patients from the list when they leave).
average stay time for all patients (globally)
Hospital admissions with this data: [{ jonas, jonaitis, male, admissions: [ [01-01,
01-15], [03-22, 03-27] }, …]
would make it easy to calculate person specific information (average stay time for
jonas),
average stay time for all people (which would inform on weather a particular
patient is an outlier or not)
but would not support: efficient global calculations on admissions (it would be a
bit harder to compute the global average stay time), how many admission today (or
for some paritular date interval).

JSON
Python Crash Course
Combinations - the main idea is to have the same information, but arrange it into
different forms. That is the primary concern of data modeling:
[{ jonas, jonaitis, male, 2022-01-01, 2022-01-15}, { jonas, jonaitis, male, 2022-
03-22, 2022-03-27}, …]
[{ jonas, jonaitis, male, admissions: [ [2022-01-01, 2022-01-15], [2022-03-22,
2022-03-27] }, …]
{ 2022-01-07: [{ jonas, jonaitis, male}, { petras, petraitis, male }, …], 2022-01-
08: [{ anelė, aneliutė, female }, …], …}
{ male: [{}, {}, …], female: [{}, {}, …] }

Thus far we only asked about the supported operations and their difficulty
(algoritmic side). Which of these representations would be the most memory
efficient. Compare:
[{ jonas, jonaitis, male, 2022-01-01, 2022-01-15}, { jonas, jonaitis, male, 2022-
03-22, 2022-03-27}, …]
[{ jonas, jonaitis, male, admissions: [ [2022-01-01, 2022-01-15], [2022-03-22,
2022-03-27] }, …]

Relational vs. hierarchical: We said that one structure better supports global
operations, and another supports operations on individual patient. Can we
efficiently support both? And not loose to much memory? Yes, we can store the
information in a relational fashion:
[
{ 1, jonas, jonaitis, male },
{ 2, anelė, aneliutė, female },
{ 3, petras, petraitis, male }
]
[
{ 2022-01-01, 2022-01-15, id: 1 },
{ 2022-03-22, 2022-03-27, id: 2 },
{ 2022-01-01, 2022-01-15, id: 1 }
]
Pydantic
Python Crash Course
Dataclasses with more power (including json serialization/deserialization).
Ref: https://www.youtube.com/watch?v=XIdQ6gO3Anc

from pydantic import BaseModel

class Person(BaseModel):
name: str
surname: str

class Employee(Person):
badge_id: int

json_str = '{ "badge_id": "1", "name": "John", "surname": "Johnatan"}'


employee = Employee.model_validate_json(json_str)
print(f"{employee.model_dump_json()}")

This is useful when you want to create


REST APIs
Python Crash Course
Working with data obtained from an WEB API.
Although you are learning the skill of doing datascience / ML / DL it is important
to know some things about the world of web development as it is usuall to obtain
data for datascience projects from those applications.
We have several popular WEB API types in current web app market: REST, GraphQL,
SOAP … others: gRPC, webRTC, web sockets (?)
We are going to describe some of them in the next slides, there are also good
resources on the web, like: https://cloud-trends.medium.com/grpc-vs-restful-api-vs-
graphql-web-socket-tcp-sockets-and-udp-beyond-client-server-43338eb02e37

REST APIs
Python Crash Course
To work with WEB APIs we need to first understand the client-server paradigm and
the HTTP protocol:
Server: software that listens for / accepts requests & sends back response
(overloaded term, also means physical server and it is often drawn as a physical
server: apache, nginx, h2o, Tomcat/Jetty, Kestrel, Flask / Django, email / ftp /
database servers).
Client: software that sends a request / initiates the interaction and receives and
handles the response (also not a physical box, but software like: browser, chrome,
ff, edge, python libraries (urllib, requests)).
Client-server paradigm is often juxtaposed to P2P communication (Torrent, webRTC).
HTTP versions: 0.9, 1.0, 1.1, — … a lot of time … – 2, 3
HTTP request / response - body / headers (use any website with curl and --trace-
ascii - )
HTTP req. methods / verbs (GET / POST + DELETE, PUT, PATCH, OPTIONS, etc.)
URL structure: http(s)://app1.de.myapp.com/some/path?a=1&b=2#some-fragment

REST APIs
Python Crash Course
REST API
A type of web api used to get and manipulate data on the server by clients
REST stands for Representational State Transfer, this is an API (almost) standard.
It uses features like:
URL tunneling (to represent resources): /api/books ; /api/books/1 ;
/api/books/1/authors ; /v2/posts - hierarchical like json
HTTP verbs (to indicate actions on the resources):
GET (read),
POST (create),
DELETE (delete),
PUT (update full),
PATCH (partial update), see: https://www.baeldung.com/spring-rest-json-patch#json-
patch
Response Codes indicate the result of the operation: (200, 201, 204, etc.) - see
next slides.
JSON as a message format (ussually, but not always)
Hypermedia for to describe itself to aid discovery (HETEOAS).
An example: https://blog.mindaugas.cf/wp-json/wp/v2/posts ;
https://www.makalius.lt/wp-json/wp/v2/posts
More on that: https://towardsdatascience.com/api-guide-for-data-scientists-
e373f997ed61

REST APIs
Python Crash Course
REST Api table summarizing responses and requests.
Note: you need to know how to send queries to REST APIs both how to get data and
how to send data to them.

REST APIs
Python Crash Course
REST response codes:

REST APIs
Python Crash Course
What HTTP method and what URL would you use to:
Obtain information about all employees?
Obtain information about the manager of employee with id 5?
Obtain information about the manager with id 5, will it be the same information as
in the previous request?
Create a new piece of information about an author “Jack Back”?

Obtain information all comments about a specific product:


/api/products/9562/comments → 6612, 949491, 23131
Obtain information all comments /api/comments
REST APIs
Python Crash Course
We can use json-server to create our own REST API
This helps to prototype solutions and learn how to use REST
We need NPM for this. NPM - node package manager (javascript package manager, pip)
To install npm simply go to: https://nodejs.org/en/

npm install -g json-server


Create a file db.json in your favorite directory
json-server --watch db.json
Data on the top right of the slide.
Some simple testing and interactions can be performed with curl.
Curl is an API testing tool that can work in server environments where there is no
graphical user interface. A GUI option would be Postman.
Let’s learn how to do CRUD actions with this API and change the data model to fit
our needs.
You can find how to use json-server here: https://github.com/typicode/json-server
{
"products": [
{ "id": 1, "title": "Shoes", "count": 150, "price": 555.9 },
{ "id": 2, "title": "Dress", "count": 300, "price": 99.99 },
{ "id": 3, "title": "Pants", "count": 99, "price": 66.99 },
{ "id": 4, "title": "Pants", "count": 185, "price": 88.99 }
],
"comments": [
{ "id": 1, "text": "Labas pasauli!", "productId": 6 },
{ "id": 2, "text": "Sudie pasauli!", "productId": 6 }
]
}
REST APIs
Python Crash Course
Exercises (it is recommended to use a gitbash terminal if you use):
Write a curl query to get single product information:
curl http://localhost:3000/products/3 -X GET
Write a curl query to display products with their comments:
curl http://localhost:3000/products/3/comments -X GET
Write a curl query to update single product information:
curl http://localhost:3000/products/3 -X PUT -H "Content-Type: application/json" -d
'{ "id": 3, "title": "Pants", "count": 119, "price": 79.99 }' -s -D -
Write a curl query to create a new product:
curl http://localhost:3000/products/ -X POST -H "Content-Type: application/json" -d
'{ "title": "Banana", "count": 119 }' -D - -s
curl http://localhost:3000/products -X POST -H "Content-Type: application/json" -d
"{ \"title\": \"Bananas\", \"count\": 9959 }" (for windows cmd)

If you find this difficult just Postman (Insomnia, Pycharm HTTP Client Plugin)

REST APIs
Python Crash Course
Would be nice to learn how to authenticate to REST APIs as many of the APIs require
authenticated access. For this we can use hai-server:
npm install -g hai-server
hai-server --watch db.json --auth auth.json
Auth data on the bottom right.
curl http://localhost:3000/auth/login -X POST -d "{"email":"x","password":"x"}' -s
-H 'Content-Type: application/json'
curl "http://localhost:3000/auth/login" -X POST --data
"{ \"email\": \"[email protected]\", \"password\":\"mindas\" }" -H "content-type:
application/json"
curl http://localhost:3000/comments -s -H 'Authorization: Bearer XXX'
------------ db.json -----------
{
"products": [
{ "id": 1, "title": "Shoes", "count": 150, "price": 555.9 },
{ "id": 2, "title": "Dress", "count": 300, "price": 99.99 },
{ "id": 3, "title": "Pants", "count": 99, "price": 66.99 },
{ "id": 4, "title": "Pants", "count": 185, "price": 88.99 }
],
"comments": [
{ "id": 1, "text": "Labas pasauli!", "productId": 3 },
{ "id": 2, "text": "Sudie pasauli!", "productId": 4 }
]
}

------------ auth.json -----------


{
"secretKey": "123456789",
"expiresIn": "1h", // can be 10s
"users": [
{
"id": 1,
"name": "mindas",
"email": "[email protected]",
"password": "mindas"
},
{
"id": 2,
"name": "jonas",
"email": "[email protected]",
"password": "qwertyui"
}
],
"authenticationURL": "/auth/login",
"authenticatedURL": [ "/comments" ]
}
REST APIs
Python Crash Course
We need to do is learn to call REST full APIs with Python to obtain data from them.
We can use requests and json packages to interact with them.
We need to know how to get data from them and how to send data to them.
Bot, botnet, robot.
REST APIs
Python Crash Course
The concepts of how REST APIs work (like URL tunneling, token based authentication)
will be helpful in querying them, however you need to know and expect that not all
APIs will produce clean, easy-to-obtain data, one example are wordpress sites, take
a look at my blog: https://blog.mindaugas.cf/wp-json/wp/v2/posts … you can see
there are many instances of escaped HTML entities and plain HTML tags there (don’t
worry if you don’t know what they are, we will learn a bit more about them in the
future).
Making this data more human readable / standard is part of the data cleaning
(scrubing) process that is often encountered in datascience and data engineering
(we will talk extensively about it in the future).

REST APIs
Python Crash Course
HTML is made up of HTML elements, that are made up of tags, which can contain
content and attributes. Blogs that have API sometimes return HTML from their API
and if they do the HTML might be “escaped” HTML. Escaping HTML is needed when we
want to display HTML metacharacters (like “<” and “>” → “<p>”) in our HTML page,
like so: <p>&lt;p&gt;</p>. For this to be displayed as <p> it needs to be escaped.
Cleaning this would be called “unescaping” (a word you can google tools for this
activity by) can be done like so: https://stackoverflow.com/questions/... . See
more on HTML escaping:

How do we query REST API’s with Python is described everywhere. We will use
requests library, but there are many (note, the internet is a dumpster of old urls,
not working apis and examples. These examples might be old as well at some point):
Ref: https://www.geeksforgeeks.org/g...
Ref: https://www.nylas.com/blog/use...
<!DOCTYPE html>
<html>
<head>
<title>My awesome page!</title>
</head>
<body>
<h1>This is my poem</h1>
<p>Roses are red, violets are blue, I&#8217;m not dumb, not sure about
you.</p>
</body>
</html>
REST APIs
Python Crash Course
Here is a list of APIs you can use for your data science or general programming
projects (and of course you can find much more on the internet):
Google: https://www.creativebloq.com/features/google-apis
Twitter: https://developer.twitter.com/en/docs/tweets/post-and-engage/overview
SpaceX: https://api.spacexdata.com/v5/launches
More: https://www.springboard.com/library/data-science/top-apis-for-data-
scientists/
Still more: https://public-apis.io/
And then some: https://github.com/public-apis/public-apis
… enough already: https://rapidapi.com/collection/list-of-free-apis

REST APIs
Python Crash Course
Google Tranlate API:

REST APIs
Python Crash Course
You will need to add Billing details but you will also get 300$ for free.
If by any chance you get billed - contact support and they will refund you.
After entering debit / cc data you get this dialog:

REST APIs
Python Crash Course
After enabling the access to GCP you will need to set up the translation API,
install the package with PIP, export credentials as environment variables and so
on, see: https://cloud.google.com/translate/docs/setup
Then use this as a quickstart guide:
https://cloud.google.com/translate/docs/basic/quickstart#translate_translate_text-
python
REST APIs
Python Crash Course
After installation
REST APIs
Python Crash Course
Create the project folder and files (if not done before)
set GOOGLE_APPLICATION_CREDENTIALS=translate-key.json (use set in windows instread
of export)
from google.cloud import translate

def translate_text(text="Hello, world!", project_id="translation-demo-delete"):


client = translate.TranslationServiceClient()
location = "global"
parent = f"projects/{project_id}/locations/{location}"

response = client.translate_text(
request={
"parent": parent,
"contents": [text],
"mime_type": "text/plain",
"source_language_code": "en-US",
"target_language_code": "es",
}
)

for translation in response.translations:


print("Translated text: {}".format(translation.translated_text))

translate_text()
REST APIs
Python Crash Course
Facebook API
provides socialgraph information
you need a facebook account
with that facebook account you create a facebook developer account
def x()

GraphQL APIs
Python Crash Course
GraphQL API
GraphQL is a query language and a type of API
Created by facebook (internally in 2012 before being publicly released in 2015)
It is not at all compatible with RESTfull API - they are completely different in
terms of how they accomplish data representation.
Graphql APIs use graph query language to define mutations and queries
Commonalities: json, stateless (no session auth), client-server paradigm, HTTP
(graphql limited features are used).

GraphQL APIs
Python Crash Course

GraphQL APIs
Python Crash Course
Communication with the API

GraphQL APIs
Python Crash Course
Common misconceptions addressed

GraphQL APIs
Python Crash Course
Tools for learning GraphQL API:
Playground: https://graphqlzero.almansi.me/api
Fake: https://github.com/marmelab/json-graphql-server
npm install -g json-graphql-server
json-graphql-server db.json (not db.js)

GraphQL queries and mutations:


A query is used for information retrieval, they are wrapped inside a query {}
field.
One big difference is that gql is explicitly designed to only return fields that
the client requests. REST returns all the fields by default.
Who benefits from this default “return only what I ask for”? Clients w/ minimal
resources / energy constrained b/c less processing and those who are charged for
network bandwidth (mobiles).
A mutation is used to change the information on the server, wrapped inside a
mutation {} field.
GraphQL uses a schema that describes the data, we have type, input, directive, enum
data. When querying we usually use types.
Simple queries:
query { allPosts { id title } }
query { post(id: 2) { id title } }
query { post(id: 2) { id title user {id} } } → because user is a complex type, we
need to provide subfields again!
Simple mutations:
mutation { updateAlbum(id: 1, input: { title : "Bruh!"}){ title } }
note that you will have to specify the fields you want to be returned after the
mutation is performed. This is another difference with REST in which it is common
not to return any data for actions that change the data on the server.
GraphQL APIs
Python Crash Course
Data (on the left)
Query: query { Product(id: 6) { id title price Comments { id text } } }
Example screenshot:
{
"products": [
{
"id": 1,
"title": "Shoes",
"count": 150,
"price": 555.9
},
{
"id": 2,
"title": "Dress",
"count": 300,
"price": 99.99
},
{
"id": 3,
"title": "Pants",
"count": 99,
"price": 66.99
},
{
"id": 6,
"title": "Pants",
"count": 185,
"price": 88.99
}
],
"comments": [
{
"id": 1,
"text": "Labas pasauli!",
"product_id": 6
},
{
"id": 2,
"text": "Sudie pasauli!",
"product_id": 6
},
{
"id": 3,
"text": "Labai geras daiktas",
"product_id": 3
}
]
}
GraphQL APIs
Python Crash Course
A more complex query: query { post(id: 2) { id title comments(options: { paginate:
{ page: 1, limit: 2 }} ) { data { id, body } }}}
GraphQL APIs
Python Crash Course
After learning the basics of querying GQL apis, let’s try the fake api tool that
should help us see a more real world example of how to use mutations (they will be
persistant now) and also allow us to build more complicated data models that we can
CRUD to continue learning.
The installation and launch are well documented and we will use the same data as
before with the fake REST api.
mutation {
updateProduct(id: 2, title: "New"){
id title price
}
}
GraphQL APIs
Python Crash Course
Comparing json-graphql versions we see that the product moves towards more GQL like
interface.
They now return the mutated instance after mutation operation instead of a boolean:
GraphQL APIs
Python Crash Course
We can also send queries with parameters

GraphQL APIs
Python Crash Course
Sending the query with curl (remember to include the content type header):
curl 'http://localhost:3000' -H 'Content-Type: application/json' --data-raw
'{"query":"query {Product(id: 3) { id title }}"}' -s (not for win cmd)

GraphQL APIs
Python Crash Course
Authentication is handled very similarly to REST if handled not by the GQL API
mutations but by standard authentication middleware. It can be also handled with
the GQL API, like described here: https://www.apollographql.com/blog/… . In this
case we just need to send a mutation to login, obtain the token and send the token
with all subsequent requests that must be authenticated.

GraphQL APIs
Python Crash Course
There are multiple libraries in the Python world for both creating and querying GQL
APIs: https://graphql.org/code/#python
Attention, you need a client for querying, not server - the link opens in the
section about Python servers for GQL.

Pickle
Python Crash Course
Pickle
What we have seen with json module – saving and loading python objects to and from
files using some predetermined file format – is usefull tool.
In particular it is quite common to share weights of pretrained models in ML and DL
as serialized files that contain objects like lists, tensors and other things. You
might know that the result of training a machine learning and deep learning model
is not clasified images, numbers identifying the bounding boxes of object detection
algoritms or sentiment expressed in a comment. It‘s actually vectors of weights –
multimensional matrices of numbers. If you have the same neural network topology
(think structure) you can import those weights to perform the tasks this network
was trained to do.
In fact you can save the entire model to disk using inbuilt library tools
(model.save() in Keras, for example). They don’t necessarily use pickle, but the
idea is all the same.
This reuse of pretrained models is called transfer learning and it‘s very popular
and usefull (we will cover it extensively latter).
We turn our attention another very popular serialization package in Python -
Pickle.

Pickle
Python Crash Course
Pickle
A module for serializing data / code in Python. Uses binary serialization.
Serializing and deserializing via these modules is also known as pickling and
unpickling.
Important note on versions: in Python 2 we had pickle and Cpickle where Cpickle was
a more performant version. In Python 3 pickle is optimized and very performant so
there is no need to use any custom pickles / serializers ussually.
Loading objects and data with unpickling is usually much faster than loading them
as json or getting data from a database if we are talking about huge number of
objects (10s-100s times faster).
It also can be easily compressed for faster transfers over the network.
Ref: https://realpython.com/python-pickle-module/
File extension: .pickle and .pkl , but other are sometimes also used:
https://stackoverflow.com/questions/40433474/preferred-or-most-common-file-
extension-for-a-python-pickle
One important note: there is a way to save python objects and even code as strings
– load them and even execute python code from a text file. This can be achieved via
a call to eval() function. Eval() function is very powerfull – in fact too
powerfull (eval is evil). If an “attacker” found a way to feed some python code to
an eval function used anywhere in our program he might do almost anything
permisstions of the user running the python process would allow him/her to do (for
example deleting your entire file system, sending some requests to another site
(perhaps one that is controller by the atacker) with some system files, etc.)
Unpickling is different from eval, because you will not obtain “crystalized RAM
contents” via code sharing as string, but pickling is essentially that - sharing a
snapshot of RAM at a particular time.

Pickle
Python Crash Course
Serialization, like deep copying, implies a recursive walk over a directed graph of
references. pickle preserves the graph’s shape: when the same object is encountered
more than once, the object is serialized only the first time, and other occurrences
of the same object serialize references to that single value. pickle also correctly
serializes graphs with reference cycles. However, this means that if a mutable
object o is serialized more than once to the same Pickler instance p, any changes
to o after the first serialization of o to p are not saved.
In short: You should not try to change the objects while they are in the process of
serialization.
Pickle serializes classes and functions by name, not by value. Pickle can therefore
deserialize a class or function only by importing it from the same module where the
class or function was found when pickle serialized it. In particular, pickle can
normally serialize and deserialize classes and functions only if they are top-level
names for their module (i.e., attributes of their module).
For example, consider the following:

def adder(augend):
def inner(addend, augend=augend):
return addend+augend
return inner
plus5 = adder(5)

trying to pickle this raises a pickle.PicklingError exception (in v2; just


AttributeError in v3): a function can be pickled only when it is top-level, and the
function inner, whose closure is bound to the name plus5 in this code, is not top-
level but rather nested inside the function adder. Similar issues apply to pickling
nested functions and nested classes (i.e., classes that are not top-level).
You can use dill module if pickle is not enough (we will discuss it after a few
slides).

Pickle
Python Crash Course
Pickles API:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict",
buffers=None)
In v3, protocols range from 0 to 5, inclusive; the default is 4 (from 3.8), which
is usually a reasonable choice, but you may explicitly specify protocol 2 (to
ensure that your saved pickles can be loaded by v2 programs), or protocol 4,
incompatible with earlier versions but with performance advantages for very large
objects (DL models?). It is always recommended to pickle with at least version 2.
0 is for ASCII and should only be used if compatibility with ancient Python
versions is required.
Choosing the version: https://stackoverflow.com/questions/23582489/python-pickle-
protocol-choice

Pickle
Python Crash Course
Dill
The dill module extends the capabilities of pickle. According to the official
documentation, it lets you serialize less common types like functions with yields
(generators), nested functions (closures), lambdas, and many others.
With dill you can even serialize the entire session of the interpreter and then
load it - like saving your work with all the initialized objects.
Before you use dill instead of pickle, keep in mind that dill is not included in
the standard library of the Python interpreter and is typically slower than pickle.

Compression
Although the pickle data format is a compact binary representation of an object
structure, you can still optimize your pickled string by compressing it with bzip2
or gzip.
When using compression, bear in mind that smaller files come at the cost of a
slower process.

What does a data-engineer / data-scientist / ml-engineer need to know about


compression and why? Case study from job interviews: “a security operations team,
they store log files for analysis that are around 50GB in size in 50-150 servers
worldwide. You need to create a tool that performs aggregate/global data analysis
on them - describe what steps would take to design a solution and which criteria
would you base your decisions on”. You provide this service to the security team as
a data engineer, so you must think about the context in which people will use your
solution.

Pickle
Python Crash Course
Pickle __getstate__, __setstate__:
Some objects like database connections are not pickle’able even with dill.
You solve the problem by reinitializing the objects using serialized data during
deserialization
For this case you can use __getstate__, __setstate__ methods - magic methods used
by pickle
__getstate__ - use __getstate__() to define what should be included in the pickling
process. This method allows you to specify what you want to pickle. If you don’t
override __getstate__(), then the default instance’s __dict__ will be used.
__setstate__ - do some additional initializations while unpickling.

Implications for security:


Do not use pickle module to deserialize objects from untrusted sources and insecure
networks (where they can be modified in transit via MITM attacks).
Why? Because a pickled object can contain malicious code in the __setstate__ method
Read more here: https://realpython.com/python-pickle-module/#security-concerns-
with-the-python-pickle-module

XML
Python Crash Course
XML stands for eXtensible Markup Language.
XML was designed to store AND transport data (there are resources claiming
otherwise) … but originally intended for printable documents.
XML was designed to be both human and machine readable.
It is very similar to HTML - it has elements, tags, attributes.
Elements can be nested, giving HTML the tree-like hierarchical structure - same
applies to XML. Because of this notions like parent, child elements, root and leaf
nodes are used to describe the document.
Unlike HTML we can use any tags we like - we can define them.
Scalar data can be represetented in attributes alleviating the impact of opening
and closing tags.
Applications of XML: RSS feeds that blogs, news sites can expose to inform about
their updated feeds, also used by SOAP services and even microsoft office
documents, draw.io to represent drawings. Also used as file “database”, config
files for various tools.
In python we have several modules for working with XML data:
Minidom - mimic JS dom API
ElementTree
etc: https://docs.python.org/3/library/xml.html#
Let’s turn to code examples and illustrate these
XML has a big ecosystem of tools: https://webreference.com/xml/basics/xml-
technologies/

XML
Python Crash Course
XML has namespaces that often can be confusing
A namespace is just way to separate identifiers in a programing language or a
document so they would not clash.
XML uses them as well.
Let’s inspect this an RSS feed (RSS is just a data format, it is neither push or
pull so can be either, see: https://stackoverflow.com/a/47703199/1964707
XML
Python Crash Course
Querying XML with XPATH:
XPATH is a query syntax for XML and XHTML documents.
It is often used for locating elements in an XML document.
Syntax: https://www.w3schools.com/xml/xpath_syntax.asp
Examples:
//*[@id="post-2151"]/div/div/p
$x('//channel/item/title/text()').forEach((i) => { console.log(i.data) }); //
https://www.huffingtonpost.co.uk/feeds/index.xml
Cheatsheet: https://devhints.io/xpath
Describes how to do it in python:
https://web.archive.org/web/20230130213714/https://dzone.com/articles/processing-
xml-python
XPATH is declarative (like SQL and Regex)
We will see more of XPATH when we talk about web scraping.

XML
Python Crash Course
Online testers:
http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm
https://www.freeformatter.com/xpath-tester.html
http://www.xpathtester.com/xpath (also offers xslt and xquery testing capabilities)

<items>
<item>
<id>1</id>
</item>
<item>
<id>2</id>
</item>
</items>
SOAP
Python Crash Course
Web services standard, Simple Object Access Protocol:
In reality is not so simple and is much more complex than REST or GraphQL
It is an exclusivelly XML driven protocol for transmitting data usually over HTTP
protocol (but SMTP or JMS can be used as well)
Uses schema described in WSDL file - it describes the services available their
expected parameters:
https://www.dataaccess.com/webservicesserver/numberconversion.wso?WSDL
Read a comparison between SOAP and REST: https://smartbear.com/blog/soap-vs-rest-
whats-the-difference/
SOAP
Python Crash Course
Summary comparison of REST and SOAP (some items can be questioned)

SOAP
Python Crash Course
SOAP request and response structure - we can see how verbose SOAP was

SOAP
Python Crash Course
Learning SOAP with mocked endpoints:
https://www.mockable.io/a/#/space/demo4933883/soap/new?inwizzard=true → does not
fully imitate crud actions easily
https://getsandbox.com/ → does not seem to work with SOAP imports from URL (does
not create the appropriate endpoints after the import is done).
https://www.soapui.org/downloads/soapui/ → SOAPUI (not READY API) - once a payed
tool, now is free and is capable of generating SOAP endpoints from WSDL that you
can then enhance. SOAP UI has additional capabilities like exporting the mocked
SOAP endpoints to a Java WAR file, that you can then add to Tomcat or Jetty servlet
containers, see: https://stackoverflow.com/a/12750792/1964707
https://doughellmann.com/posts/evaluating-tools-for-developing-with-soap-in-python/
→ creating SOAP API with Python is not that popular. There are some tools, but it’s
not a common usecase so tutorials are old, scarce and not the best quality.

SOAP
Python Crash Course
Creating a mocked service with SOAPUI for learning:
Unfortunately you need to know groovy to be able to script the responses and you
dont get the full CRUD services which we got from json-server, see:
https://www.soapui.org/docs/soap-mocking/creating-dynamic-mockservices/
Simple usecase demo, ref:
https://www.dataaccess.com/webservicesserver/numberconversion.wso?WSDL

You can find a bunch of SOAP APIs here:


https://documenter.getpostman.com/view/8854915/Szf26WHn

Using curl to call SOAP endpoints:


Obviously we can also send SOAP requests with curl (or postman, but with postman
you need to add custom Content-Type)
https://stackoverflow.com/questions/12222607/how-to-do-a-soap-wsdl-web-services-
call-from-the-command-line
Example:

curl https://www.dataaccess.com/webservicesserver/numberconversion.wso -X POST -d


'<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:web="http://www.dataaccess.com/webservicesserver/">
<soapenv:Header/>
<soapenv:Body>
<web:NumberToDollars>
<web:dNum>23.50</web:dNum>
</web:NumberToDollars>
</soapenv:Body>
</soapenv:Envelope>' -s -H 'Content-Type: text/xml;charset=UTF-8'

curl http://DESKTOP-S8L0LJ8:8088/mockNumberConversionSoapBinding12 -X POST -s -d


'<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:web="http://www.dataaccess.com/webservicesserver/">
<soap:Header/>
<soap:Body>
<web:NumberToWords>
<web:ubiNum>?</web:ubiNum>
</web:NumberToWords>
</soap:Body>
</soap:Envelope>'
SOAP
Python Crash Course
There are many SOAP client libraries in Python:
https://stackoverflow.com/a/206964/1964707
A good one to choose might ne Zeep, but you should do research before deciding as
with any library.
Ref for Zeep: https://docs.python-zeep.org/en/master/ and some friendly
description: https://stoplight.io/api-types/soap-api/
How do you make a SOAP call:
You pass in the WSDL location when constructing the client
Call any one of the operations defined

Exercise:
find a SOAP service from the list we have seen previously
call it with python and print the result

Homework
Python Crash Course
Watch this Postman tutorial - https://www.youtube.com/watch?v=VywxIQ2ZXw4 , unit 1
Complete the exercises in the collab notebooks, that you were not able to do in
class
Launch json-server
Launch graphql-server

Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 1UiPl2qGIFDTg6pPTSjSaEFng2EjXXvB_.pptx ---


Artificial Intelligence
Neural Networks for Tabular Data
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Fast.ai module structure
01
02
Fast.ai dataLoaders
Neural Networks for Tabular Data
00
Fast.ai tabular
Fast.ai problem to solve
03
04
Fast.ai automatic model construction
05
06
07
Fast.ai categorical embeddings
Fast.ai entity embedings in depth
Embeddings in other frameworks
08
Practical project 6
Fast.ai is the first DL framework to take tabular data seriously (to my knowledge,
ref. needed) and have dedicated utilities for it.
The fastai.tabular package is one of 4 important packages in the fastai library,
namelly:
Collab - this submodule handles the collaborative filtering problems.
Tabular - this sub-package deals with tabular (or structured) data.
Text - this sub-package contains Natural Language Processing tools.
Vision - this sub-package contains the classes that deal with Computer Vision.
Medical - utilities to read DICOM files (MRI, etc.)
Each of these packages contains appropriate data loading and preproc utilities for
very simple and fast initial model creation. Let’s see those.
We are going to see more fastai.vision in the next part of the course (07):
computer vision and image classification.
In this lecture we will talk about fastai.tabular.
Fast.ai tabular
Neural Networks for Tabular Data
//
Fast.ai module structure
Neural Networks for Tabular Data
Transformations available for tabular data
Categorify - transform the categorical variables
FillMissing - fill the missing values in continuous columns.
FillStrategy - instructs on how to fill missing data:
median: nans are replaced by the median value of the column
common: nans are replaced by the most common value of the column
constant: nans are replaced by fill_val
Normalize - normalize the continuous variables.
add_datepart - automatically splits date object into usable date information (day
of the year, day of the week and os on).
cont_cat_split - automatically splits pandas dataframe to continous and categorical
variables.
Ref: https://fastai1.fast.ai/tabular.transform.html
Fast.ai dataLoaders
Neural Networks for Tabular Data
We will solve another well known problem with tabular data.
We are going to try to predict whether an adult human being will get <50K/y or
>50K/y.
We will use tabular_learner model for that:

Fast.ai problem to solve


Neural Networks for Tabular Data
Fast.ai automatically provides categotical embeddings for all categorical collumns!
There will be as many embedings as there are categorical columns - the 0’th will be
for the first categorical feature and so on.
The parameters Embedding(10, 6) - emb_szs is Embedding(ni, nf, std=0.01).
The ni, nf come from Pytorch:
https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
in_channels (int) – number of channels in the input. Should correspond to
cardinality of the column + 1
out_channels (int) – number of channels produced. This can be arbitrary, often less
than in_channels
Embed. layer count: cat columns + 1 ?
Fast.ai categorical embeddings
Neural Networks for Tabular Data
Here is how you specify the layers and neuron count in each layer!

Fast.ai automatic model construction


Neural Networks for Tabular Data
Embeddings: are fixed length vectors of numbers in euclidian space, that represent
a certain data point in our original data (think: row in a table). They have a
property of being close to each other in that space when the underlying values they
represent are “conceptually” similar.
In other words: you have a bunch of rows in a table, they can be represented as
numeric vectors, those vectors will be close to each other in that space if the
underlying data rows are conceptually similar w.r.t. the problem that the ML/DL
model is trained to solve.
Examples:
word embeddings in NLP (Word2Vec)
collaborative filtering (recommendation systems)
categorical embeddings (for categorical data in tabular datasets)
Paper: https://arxiv.org/pdf/1604.06737.pdf
Article: Link
Fast.ai entity embeddings in depth
Neural Networks for Tabular Data
What is an entity embedding?
How is a entity embedding contructed?
It is learned. Each category is mapped to a distinct vector, and the properties of
the vector are adapted or learned while training a neural network w/ backprop. The
vector space provides a projection of the categories, allowing those categories
that are close or related to cluster together naturally.
Is entity embedding part of our data or our network?
Probably both.
What are the benefits of using an entity embedding?
Lower memory usage due to compression
Data relations preserved or even learned - more precision
Fast.ai entity embedings in depth
Neural Networks for Tabular Data
How do entity embeddings reduce memory usage and speed up neural networks?
Any guesses?
What kinds of datasets are entity embeddings especially useful for?
Any guesses?
Fast.ai entity embedings in depth
Neural Networks for Tabular Data
How do entity embeddings reduce memory usage and speed up neural networks?
Fixed lenght, so smaller than one-hot. Can be made even smaller.
… although the precise amount needs to be measured on a case by case basis.
What kinds of datasets are entity embeddings especially useful for?
For tabular datasets where cardinality is high (a lot of unique values).
Natural language processing where feature relationships are unclear, non-local and
learnable.
Fast.ai entity embedings in depth
Neural Networks for Tabular Data
Embeddings come from NLP - word embeddings. And commonly replace one-hot encoding
Fast.ai entity embedings in depth
Neural Networks for Tabular Data
Obviously the technique is considered powerful and very useful, so other frameworks
should also support it to be able to compete.
Keras: https://keras.io/api/layers/core_layers/embedding/
Pytorch:
https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embeddin
g
Embeddings in other frameworks
Neural Networks for Tabular Data
This time the practical project is simple - you need to go though 6 models we have
trained in this part (imdb, reuters, boston housing, etc.).
You will need to create a separate notebook, choose an example - 1 of those 6 - and
improve it.
How can you improve it, there are a few ways:
Either tune model to be more precise, accurate (dropout, batch size tunning, batch
norm, embeddings).
Either implement at least 1 TODO task written at the end of the notebook (indicate
which one).
Add error / acc. vs. epoch visualization, decision boundary visualization,
regression line visualization.
Or come up with some kind of another improvement on your own (indicate which one).
Please provide a small written paragraph (5-10 sentences) of what you have improved
in the notebook.
Please provide a link to the notebook (double check the share options of the
notebook) when finished for review and evaluation.
Advice: rewatch the Tabular data analysis with NNs video while doing this to get
some ideas on what to do.
Practical project 6
Neural Networks for Tabular Data
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Neural Networks for Tabular Data


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1NMCyzB0Zfp4wS68OK_uItX46D6MpGPW2.pptx ---


Artificial Intelligence
Reverse Image Search
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Minimal Flask App
01
02
Flask App with RIS
Reverse Image Search
00
Delivering RIS
Further explorations
03
04
Practical project 8
After learning to tune the feature extraction, feature set and the KNN algorithm in
multiple ways let’s talk about delivering a RIS app. We have many options for that:
A traditional MVC web app / SSR - the simplest way, it can look similar to the
google image sarch UI.
An API (REST, etc.) that can accept requests from a mobile app fronted, web
interface (SPA/CSR), desktop app, script.
A script that calls a model accepting image path as a parameter.
Other options.
Would it be possible to create a RIS system w/o a backend? Why?
If vector size is 2048 and we are using 32bit floats in them, how big will be the
vector database for 1million images? ~8GB
Delivering RIS
Reverse Image Search
Schematically
Delivering RIS
Reverse Image Search
Flask is a minimal Python framework (microframework) for creating SSR/MVC web apps
and APIs with Python.
A good introduction to Flask: https://flask.palletsprojects.com/en/2.0.x/
Alternatives exist in Python ecosystem like Django, FastAPI and others.
Minimal Flask App
Reverse Image Search
Steps:
Install python, flask, tensorflow/keras, scikit-learn (for KNN) and import
dependencies
Create a simple hello world web app and test it
Add a form to upload an image with and test that form
Import all the prerequisites (CNN generated features, the dataset from which to
return the found images - image database). You can get the caltech101 dataset here:
http://www.vision.caltech.edu/Image_Datasets/Caltech101/
Implement endpoint that would extract CNN features and call KNN when user uploads
the image
Return HTML with the neightbors to the user

Troubleshooting
If the app does not return appropriate pictures, inspect extracted CNN features in
a t-SNE plot
… also, add the head, freeze the weights, check how well the networks is
classifying
… if it does not predict well - train it more, or even change the architecture if
it’s not sufficient.
Try other KNN algorithms / libraries or PCA n_components parameter.
Flask App with RIS
Reverse Image Search
We can suggest finding examples of images for which the reverse search is not
returning good results, then further tunning the network until it does and then
recreating the web app with improved model.
Reimplement the system with your own custom images (CALTECH-256, Imagenet, or some
images downloaded from google, stanford cars dataset:
https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset/data ).
Could we use TF lite for this app? Could we use some smaller KNN library? Any other
means of making the dependencies smaller / fewer?
Further explorations
Reverse Image Search
Take all the code available for this part in the notebooks (all 3 of them) and
create a single notebook that has all the main parts of the code - code in one
place (CiOP):
downloads the data (caltech101 or similar),
initialization of the model,
extracting the weights from the images using the CNN,
using KNN for similarity search and
the flask web app code.
… you can skip the optimizations for KNN / PCA and so on.
You don’t need to improve the model or the webapp, just make it comfortable to
create the web app from the code in the notebook - the only requirement is that the
code should work.
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 8 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) when finished for review and evaluation.
Practical project 8
Reverse Image Search
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Reverse Image Search


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1tRkoO5FJ6kNejDI6KOIyWELjlylBeNLN.pptx ---


Artificial Intelligence
Introduction to Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Learning
01
02
Hyperparameters, tunning
Introduction to Deep Learning
00
Deep Learning Processes
Propagation
03
04
ReLU, Leaky ReLU, PReLU
05
06
07
Deep Neural Networks
Softmax
08
Radial Basis Function - RBF
Swish
09
Activation functions: summary
In this lecture we will answer the questions:
What are the processes that underline Deep Neural Networks,
What does the network and we as ML engineers (or whatever the designation might be)
do in deep learning?
So things like: like:
learning / training vs. inference,
forward propagation and backpropagation,
hyperparameter tuning,
After we discuss the processes underlying deep learning we will finish off with a
more complete list of activation and error functions.
Deep Learning Processes
Introduction to Deep Learning
What does the word learning mean in the context of deep learning? Learning in the
context of a NN is the process of finding / searching for a set of weights (and
biases) that would minimize a certain cost/error function.
In terms of lingo: we as people are training the network, and the network is
learning. Sometimes used interchangeably.

Process:
Start with values (often random) for the network parameters (weights and biases).
Take a set of examples of input data and pass them through the network to obtain
their prediction - forward pass (dot + sigmoid).
Compare predictions obtained with the values of expected labels and calculate the
loss with them using the cost / loss function (y_hat - y)
Perform backpropagation in order to propagate this loss to each and every one of
the parameters that make up the model of the neural network.
Use this propagated information to update the parameters of the neural network with
the gradient descent in a way that the total loss is reduced.
Continue for specific amount of iterations until we consider that we have a good
model.

Let’s take a step back - what do we notice about the training / learning?
It’s performed in a loop, like gaming (some frameworks hide it, like Keras,
Fast.ai, some don’t like TF, Pytorch).
Can we optimize the starting weights?
What else?

Demo: Let’s implement the learning for our perceptron. Ref, same as before:
https://www.youtube.com/watch?v=kft1AJ9WVDk
Additionally there is an object oriented version as well:
https://www.youtube.com/watch?v=Py4xvZx-A1E

Learning
Introduction to Deep Learning
A hyperparameter is an external parameter set by the operator of the neural network
as opposed to learnable parameters called “weights”.
Examples: number of iterations of training, number of hidden layers, or activation
function type.
Different values of hyperparameters can have a major impact on the performance of
the network.
Hyperparameters determine how the neural network is structured, how it trains, and
how its different elements function.
Related question: do you think neural networks are classified as parametric model
or not? Ref: https://stats.stackexchange.com/questions/322049/are-deep-learning-
models-parametric-or-non-parametric
The manual or automated adjustment of hyperparameters is called “tunning” of the
neural network.
Some hyperparameters related to neural network structure:
Number of hidden layers
Number of neurons in hidden layers
Activation function
Weights initialization (not weights, but their initializers:
https://keras.io/api/layers/initializers/ )
Whether to use bias or not
Some hyperparameters related to the training algorithm
Learning rate
Epoch, iteration and batch counts/size
Optimizer algorithm
Momentum (dampens the zig-zag pattern when searching for error minimum)
Loss / error function
Dropout amount
Hyperparameters, tunning
Introduction to Deep Learning
Momentum and different optimizers
Hyperparameters, tunning
Introduction to Deep Learning
Tunning is the process of optimizing the networks non-learnable parameters
(essentially structural properties) in order for it to learn more efficiently.
Optimizing hyperparameters is an art (or bruteforce ;^) ): there are several ways,
ranging from manual trial and error to sophisticated algorithmic methods.
Following are common methods used to tune hyperparameters:
Manual hyperparameter tuning - an experienced operator can guess parameter values
that will achieve very high accuracy. Requires trial and error.
Grid search - systematically testing multiple values of each hyperparam, retraining
the model for each comb.
Random search - a research study showed that using random hyperparameter values is
actually more effective than manual search or grid search.
Bayesian optimization - trains the model with different hyperparameter values over
and over again, and tries to observe the shape of the function generated. It then
extends this function to predict the best possible values. This method provides
higher accuracy than random search.
There are libraries and tools that help with hyperparameters tunning
(hyperparameter optimization frameworks, see: https://towardsdatascience.com/10-
hyperparameter-optimization-frameworks-8bc87bc8b7e3 ). Automated hyperparameter
tunning with keras: https://medium.com/analytics-vidhya/automated-hyperparameter-
tuning-with-keras-tuner-and-tensorflow-2-0-31ec83f08a62 . Also AutoKERAS
Demo: implement tanh function for our perceptron and change this hyperparameter.
Hyperparameters, tunning
Introduction to Deep Learning
Forward Propagation
The use of the NN w/ its current parameters to compute a prediction for each
example in our training dataset. This involves simple math (summation, activation).
We use the known correct answer that a human provided to determine if the network
made a correct prediction or not. An incorrect prediction, which we refer to as a
prediction error, will be used to teach the network to change the weights of its
connections to avoid making prediction errors in the future.
Backpropagation
Backward propagation of error, or more succinctly, back propagation. In this step,
we use the prediction error that we computed in the last step to properly update
the weights of the connections between each neuron to help the network make better
future predictions. This is where all the complex calculus happens. We use a
technique called gradient descent to help us decide whether to increase or decrease
each individual connection's weights, then we also use something called a training
rate to determine how much to increase or decrease the weights during each training
step. [...] Essentially, we need to increase the strength of the connections that
assisted in predicting correct answers, and decrease the strength of the
connections that led to incorrect predictions. We repeat this process for each
training sample in the training dataset, and then we repeat the whole process many
times until the weights of the network become stable. When we're finished, we have
a network that's tuned to make accurate predictions based on all the training data
that it's seen.
Backpropagation intuition
Backpropagation algorithm is sometimes not explained in detail in courses.
While it is not strictly necessary in order to use DL, however an intuitive
explanation would be no doubt helpfull.
Good video on the topic: https://www.youtube.com/watch?v=s8pDf2Pt9sc
Note: bias term is only applicable to non-input layers:
https://stackoverflow.com/questions/7241537/should-an-input-layer-include-a-bias-
neuron
Note: is bias a trainable parameter? Yes:
https://stackoverflow.com/a/54347129/1964707
Note: XOR problem is not solvable using a simple perceptron, this is related to
first AI winter, read more about it: https://towardsdatascience.com/history-of-the-
first-ai-winter-6f8c2186f80b#:~:text=Fall%20of%20Connectionism
Propagation
Introduction to Deep Learning
So finally what are deep NNs?
Deep NNs (abrev. DNNs) are neural networks with > 1 hidden layer.
Shallow NN aka Perceptron = 1 layer.
Adding more hidden layers allows the network to model progressively more complex
functions. The ability to model more complex functions is what gives deep neural
networks their power.
Abstractness increases with each additional layer. This is what we refer to as
learning hierarchical representations of underlying data. There's a hierarchy of
components starting from the low-level details to the high-level abstractions. A
deep neural network learns to model this compositional hierarchy in order to make
predictions.
A MLP - sometimes a simple fully connected feed forward neural network (FCFFNN or
FCFFANN) is called a multilayer perceptron.
Deep Neural Networks
Introduction to Deep Learning
We are not going to code a deep neural network from scratch, although there are
plenty of resources if you are interested. The main idea would be to pass from one
perceptron to another the output. Here is a good example:
https://developer.ibm.com/articles/neural-networks-from-scratch/
Scikit implements a Multilayer Perceptron for both Regression and Classification
tasks: sklearn.neural_network
Demo: Scikit MLP
We will use scikit MLP to familiarize ourselves with other activation functions and
optimizers. Unfortunately scikit does not support changing loss functions as
hyperparameters, so this will have to wait a bit till we reach serious Deep
Learning frameworks like Keras and Pytorch.
Deep Neural Networks
Introduction to Deep Learning
Learky ReLU and PReLU are prefered. Only Relu supported for scikit.
ReLU, Leaky ReLU, PReLU, GELU
Introduction to Deep Learning
Converts a set of values into a probability distribution represerving the relative
sizes of the input values.
If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024,
0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where
the '4' was in the original input.
This is what the function is normally used for: to highlight the largest values and
suppress values which are significantly below the maximum value. Why not just use
the values themselves? Pior to applying softmax, some vector components could be
negative, or greater than one; and might not sum to 1; but after applying softmax,
each component will be in the interval (0,1], and the components will add up to 1,
so that they can be interpreted as probabilities.
Softmax is often used as the activation for the last layer of a classification
network because the result could be interpreted as a probability distribution
(actually the function is called cross-entropy loss: https://www.quora.com/Is-the-
softmax-loss-the-same-as-the-cross-entropy-loss ).
To gain more intuition, it’s good to play around with calculators:
https://redcrabmath.com/Calculator/Softmax
Softmax
Introduction to Deep Learning
RBF is associated w/ RBF networks commonly used in signal processing applications,
but can be used for classification and other things.
There are not part of your course, so if you want to explore something on your own
- perfect target! Implement it as a custom Keras activation function, write in
LinkedIn and promote yourself.
More:
https://en.wikipedia.org/wiki/Radial_basis_function_network
https://www.researchgate.net/publication/
280445892_Introduction_of_the_Radial_Basis_Function_RBF_Networks
https://ieeexplore.ieee.org/abstract/document/651633
Radial Basis Function - RBF
Introduction to Deep Learning
Aka: SiLU.
Not an often used function, but standard implementations exist in most frameworks.
Swish
Introduction to Deep Learning
Why do we need activation functions?
They introduce non-linearity to our NN. We especially need non-linear functions,
like ReLU
NN’s can be described as approximators of functions and to approximate non-linear
functions, we need non-linear activations (...or some other mechanism to introduce
non-linearity).
If we did not have non-linearities, the only problems we could solve is linear
regression.
Activation functions are ussually not complex due to the fact that we want our
forward pass to be fast.
NNs learn “slow”, but we want to work fast once they have learned (<~1ms) - in the
inference phase.
Why do we need so many of them?
They have different properties. Some cause fast convergence / training. Some have
vanishing gradients.
Different situations.
What are the most common activation funtions?
Sigmoid, tanh, relu (variats of relu).
Which one to use and when?
Learky ReLU - in the hidden layers (try others as a tunning exercise).
Softmax for classification, nothing (linear) for regression in the last layer.
Activation functions: summary
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 18O3k0nwKN3AztMiA6WQZ3GDXrWSFNQcf.pptx ---


Artificial Intelligence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Assertions
01
Python Crash Course
00
Exceptions
02
Context managers
Exceptions
Python Crash Course
All programs have a regular execution path. But in some rare cases circumstances
that deviate can arise, causing errors.
We call those errors “Exceptions” - objects representing errors - in Python and in
many other modern (?) programming languages.
Prior to exceptions (for example in C programming language) programmers relied on
flags and function return codes to indicate than an error happened, in modern
languages errors are mostly handled with Exceptions. This is because compared to
exceptions error code have disadvantages: they always need to be checked for
explicitly and as such can be easily just be ignored:

High abstraction level langauges ussually use exception objects, low level
languages use return codes and flags (C, RUST/ZIG (?)).
Examples where exceptions of are commonly used / encountered:
Program is instructed to read a file but the file does not exist
Program is instructed to read a file but the directory does not exist
Program is instructed to read a file but the user under which the program is
running does not have sufficient access privileges
Program is instructed to access an external service but it can not reach that
service
User enters invalid data, like 0 to the division operation (error class: user input
errors)
We can see from this list that the I/O boundary / user input is a very common
source or exceptions! Others: IndexError, KeyError
Exceptions
Python Crash Course
We will talk about exception handling and the process of using the exceptions

To learn how to raise the exceptions ourselves we need to learn how to handle them
in other peoples code. Knowing how to handle exceptions is more important than
knowing how to throw, because 3rd party code will throw them even in trivial
applications.
Exceptions
Python Crash Course
Exception messages and propagation through the function call stack and is contained
in the traceback:

KeyError - exception type


eleven - exception message
Exception propagated from: STRDIGIT_STRINT[key] → convert_stoi() → module / the
interpreter.

From this we can formulate an important rule: if the exception is not handled it
will terminate the execution of the program at the location of the exceptions first
occurrence.

Exceptions
Python Crash Course
Exception handling is done with try except block/statements (similar to try-catch
statements in other languages).
The try statement surrounds the code that can throw the exception.
The except - code that is written with the purpose of executing if / when the
exception happens. It is the exception handler.
Both these statements create a block of execution.
There are no limitations on how many try blocks you can potentially have in a
single function.
You can have many except blocks in one function for a single try block (multi-
except).
The except block can target multiple exception types if we want to handle those
exceptions in a similar way.
Exceptions
Python Crash Course
Not all errors / exceptions should be handled - these errors should be handled at
development time almost always (unless you are developing tooling like IDEs,
interpreters, formaters, validators, linters, programs that accept code as input or
similar):

Additionally you can use pass keyword when the except block does not have much to
do.
You can use multiple returns if you need to return different values between the
successful and the exceptional path.
You can interrogate the exception object in the except block and print the
information from it, using an fstring like: f”{e!r}” prints the repr of the
exception object.

Exceptions
Python Crash Course
Exceptions can be raised and re-raised.
reraise the exception when you want to do something withing the function and pass
the original exception to the caller. This can often be done to indicate to the
caller that it might retry (with a different parameter or after some time). This
splits the handling code into two places, which might not be ideal from the design
perspective.

raise a standard or custom exception type when


when exception occurs due to incorrect arguments and when you implement input
validation the exception type will be completely different (ZeroDivisionError →
ValueError). Same applies for setters in OOP and regular functions.
you want to hide details of the original exception with something that the users of
code can better understand (domain specificity / situation specificity).
in a custom algorithm when something unexpected is detected (with an if condition
similarly to the input validation case).
Exceptions
Python Crash Course
Exceptions are part of the protocol of the function and need to be documented
appropritatelly so that the callers would know what to expect.
This fact also implies that there are standard exceptions implemented for sequences
and other types.
Follow patterns of established Python libraries and the code in your company if the
general rules fail.

Common exception types (for ContainerClasses like we have seen: Flight):


IndexError - index out of range. x = [1, 2], x[3]
ValueError - object is of the correct type, but has a values that is not compatible
with the functions operation: int(“sometring”)
KeyError - when a lookup in any mapping fails (can be used not only with
dictionaries but custom types and classes)

def avg(lst):
"""
What does it do

:param lst: the list to calculate average on


:raises ValueError, ArgumentError
:return: the arithmetic average
"""
if len(lst) == 0:
raise ValueError("Can't pass empty list!")
return sum(lst) / len(lst)
Exceptions
Python Crash Course
In python we ofter avoid explicit type checking in our functions (citation needed).
So we avoid catching TypeError preemptively.
This increases functions reusability.
We let the calling code handle TypeErrors if they arrise, usually relying on
functional testing.
Also in Python we usually follow the EAFP approach - we place the emphasis on the
“happy path” to keep the code clean and leave the exceptional cases to be handled
by exceptions rather than pre-conditions (with the notable exception of expensive
I/O operations)

Errors should never pass silently unless silenced - this explains why exceptions in
python are ubiquitous. Exceptions are very intrusive and this is a benefit - the
developer is left without choice, but to handle the exception if he/she wants the
app to continue working.

Exceptions
Python Crash Course
Cleaning up is performed with the standard finally block.
The finally block is always run (except if the computer explodes 💣🤯 … or looses
power).

Common usecases
Closing files
Closing connections to external services
Closing database connections
A note on try-else: https://stackoverflow.com/questions/855759/what-is-the-
intended-use-of-the-optional-else-clause-of-the-try-statement-in

Exceptions
Python Crash Course
Python exception hierarchy - in python exceptions are arranged into a hierarchy
using inheritance.
This facilitates catching exceptions by their base classes (polymorphically).
For example IndexError and KeyError in a sense happen in similar circumstances -
are they related?
The hierarchy changes depending on the version.
All non-system-exiting exceptions inherit from Exception.
Ref: https://docs.python.org/3/library/exceptions.html#exception-hierarchy

What if a function throws multiple exceptions that belong to multiple levels of


this hierarchy? We have two options - either we do not care about differentiating
the exception types at which point we can just handle the most general exception
type we want (Exception). Or, if we care and want to differentiate we catch the
most specific exception first.
Exceptions
Python Crash Course
Defining custom exception types is also possible.
This is done when the inbuild exceptions are not sufficient (TODO :: DDD). This
insufficiency usually comes from them being not domain-specific (dealing with CSV
file CSVFileFormatException, web app - Authentication exception, etc.) enough to
capture what the exception tries to warn about. Another situation: when writing
libraries.
We usually inherit from Exception class, not the BaseException when creating custom
exceptions.
A simple exception class MyCustomException(Exception): pass is a valid exception
since it inherits all the necessary attributes from the base exception and you can
pass string argument when throwing it.
Ref: https://stackoverflow.com/a/10270732/1964707

Exceptions
Python Crash Course
It is recommended that exception payloads be of string type.
Most build in exceptions that we can throw accept a single argument in their
constructor.
PEP 352 is clear:

For more information about the cause of the exception object attributes can be used
(like UnicodeError):

Exceptions
Python Crash Course
Exceptions in python are chained - a thrown exception is associated with another
one.
You might have seen this if you ever saw the message in the screenshot.

This facilitates better diagnostics and context representation in the exception


messages / tracebacks
Two exceptions can arise in situations when one exception was being handled and
meanwhile another one arose - this is implicit chaining. Python associates the
first exception with another exception by setting the __context__ attribute to be
the second exception. It’s important to remember because the 2nd exception is
thrown inside the except block of the first, but both are shown! Don’t be fulled
the first exception might have been properly handled, so pay attention to the last
one.
The other way two exceptions can be linked is explicit chaining. This uses the
__cause__ attribute and “from e” syntax.

Exceptions
Python Crash Course
Tracebacks are objects that can be interacted with using the traceback modules in
Python.
Traceback modules has many methods that are useful: print_tb, format_tb, etc.
Never store tracebacks in any collection for latter use as they are associated with
the function call, which in turn keeps the stack variables. It would cause memory
problems fast. And avoid accessing traceback objects beyond the scope of the
current exception or save them to a database or file (which is actually what
logging does) right away so that the object would be destroyed.

Exceptions
Python Crash Course
Exceptions and with block (context manager) patterns:
https://stackoverflow.com/questions/713794/catching-an-exception-while-using-a-
python-with-statement

Exceptions
Python Crash Course
Python interpreter is not x-platform – platform-specific (linux has one, windows
has another).
Working with platform specific code / x-platform code it is not uncommon to detect
exceptions that happen during import.

Exceptions
Python Crash Course
Lasty, a question for the audience. We have a program where a mistake is present in
the code - there is a defect in it. However it’s not handle’able no one knows how
to reproduce it or fix it. Which of the following scenarios is best & worst:
The error causes an error at startup - the program does not even start.
The error causes an error at runtime - the program starts and executes for
potentially long time before terminating.
The error does not cause the program to terminate at all - the program executes
even though the bug / mistake is there.
… and why?
… heisenbug.
… mathematically provable languages.
https://en.wikipedia.org/wiki/Formal_verification
https://stackoverflow.com/questions/4065001/are-there-any-provable-real-world-
languages-scala
https://en.wikipedia.org/wiki/Coq
Exceptions
Python Crash Course
Summary:
Raising exceptions interrupt normal program flow and transfer the execution to the
handling code or propagate through the call stack until either some code handles
the exception or program terminates.
try (for exception prone code, you will need to exercise some judgement to choose
what to include in the try block) , except (for handling the exceptions, don’t have
complex code in expect block to avoid exceptions there), finally (for code that
must execute regardless, cleanup, finally is even executed if except block has a
sys.exit() call)
Python uses exceptions pervasivelly, even for logic decision/conditionals - as a
control structure.
Avoid catching programmer errors (IndentationError, SyntaxError, etc.)
raise w/o argument reraises current exception
raise can throw a new exception type that is more readable and understandable
depending on the context (domain specific)
Prefer build-in exception types if possible (TypeError)
!r - repr in f-string
return codes are ignorable so prefer exceptions (over return code and error flags)
use EAFP in general (as we did for platform specific code)
always catch specific exceptions and never leave the except block empty unless you
explicitly know what you are doing.
Assertions
Python Crash Course
A short note on assertions:
Associated more with LBYL
They are used to validate the the code written inside the function / method is
correct - the invariants of the function
They are NOT (questionable) used to validate that the arguments passed by the
caller are valid for the logic the function encapsulates
AssertError is just another exception and we can pass messages to it: assert
[boolean condition] [message when fail]
Often used in when writing tests. Sometimes used to check initial conditions when
program starts (not enough ram).
Context manager
Python Crash Course
Context manager - object designed to be used in with statements ensuring that
resources are properly managed & automatically handled.
An extended discussion is provided here: https://realpython.com/python-with-
statement/
Two methods: one is executed on the beginning of the with block and the other at
the end even if there is an exception. Conceptually you can think of these methods
as setup and teardown / enter, exit.
We have used file context manager with this mechanism:
Context manager
Python Crash Course
Context manager protocol consists of only two methods: __enter__() and __exit__()
Order of operations. Note, the return value of the __enter__ method is bound to the
name “x”, not the value of the expression. The with block can be exited in two
ways: exceptionally or via normal termination in both cases __exit__ is executed.
Context manager
Python Crash Course
The enter method:

the __enter__ is relatively simpler than the __exit__ method


Context manager
Python Crash Course
The exit method:

Because the behavior of the exit method depends on whether an exception was or was
not thrown it is common to check the exception type inside the exit method.
By default __exit__ propagates exceptions from the with block to the enclosing
context. To control this we use the return value of the exit method - if it is
falsy then the exception will be propagated further (all functions return None,
that is why the default). It is recommended to not reraise any exceptions, simply
return falsy value. Only raise exceptions when something bad happens inside the
function itself (?)
Context manager
Python Crash Course
Context manager decorator and the contextlib
For some applications creating context managers as classes is too much - this is
where this decorator is useful.
The @contextmanager decorator is used for creating new context managers (as
functions).
It is used on a generator function, that uses the yield statement. It does not use
two separate methods, just the contexts of the underlying generator function (enter
- is the body of the generator, exist - in the try block).

Note: context managers created with decorator do not propagate exceptions


automatically - use standard exception handling techniques.
Re-raising the exception is what is commonly used.
Context manager
Python Crash Course
Multiple context managers in a with statement - that can be easily accomplished.

Note that using multiple context managers is equivalent to nesting the with
statements.
Because of this equivalence we can infer how exception handling will work when
using multiple context managers:
Context manager
Python Crash Course
A realistic example of context manager usage would be resource handling: File
Handling, Database Connetions, Network Connections, Locks (acquisition and release)
in a multithreaded program and so on.
Let’s imagine we are writing a library to manage database connections and
transactions. We want our transaction management to be usable as a context manager
so that the users of our library would not forget to commit the transaction when
everything is OK, and rollback when there is an exception inside the with block.
Transactions are a group of database statements that their happen all or not happen
at all (atomicity). DBMS systems will not commit an open transaction unless you
tell them to as application developers, so managing transactions can be a good
candidate to implement in a database connectivity library in Python.

Another example:
https://gist.github.com/MindaugasBernatavicius/93f20ad9c2d44dab1d7ccc3a62d510bd
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 10psfPIlfR78idjSslofOykXnkJIdUChQ.pptx ---


Artificial Intelligence
Introduction to Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Perceptron for regression
01
02
Bias, Learning Rate and Epoch
Introduction to Deep Learning
00
DL frameworks
Normalization and denormalization
03
04
Tensorflow execution modes
06
07
08
Tensorflow intro
Tensorflow datatypes
09
Tensorflow debugging
Tensorflow estimators
10
Tensorflow checkpoints
Keras intro and installation
12
11
12
Keras sequential model
13
Keras functional model
Keras model visualization
14
Keras callbacks & checkpoints
05
Tensorflow installation
11
Tensorflow summary
15
Keras summary
In this part we will get an introduction to the most popular deep learning
frameworks and tools used in the industry.
Let’s start with no-framework intro to Linear regression using a Neural network
Then let’s implement it in Tensorflow and Keras
Then we will talk about Pytorch and Fast.ai
After than we will talk about some honorable mentions.
DL frameworks
Introduction to Deep Learning
Although scikit offers some DL capabilities, it is not suited for deep learning on
an industrial or even profesional level (no GPU support, lack of layer-specific
activation functions, etc.).
The most popular and advanced Deep Learning frameworks now are Tensorflow, Keras,
Pytorch and Fast.ai. We will talk about them in depth in the future parts of the
course.
Comparing Tensorflow, Keras, Pytorch and Fast.ai. in terms of popularity.
There are a few honorable mentions like Theano, MS CNTK and Pytorch Lightning.
DL frameworks
Introduction to Deep Learning
2019
We saw how we can predict 1 or 0 from some features - binary classification.
Neural Network that can solve the Linear Regression problem as well.
In fact the minimal Neural Network that is sufficient for this task is a
Perceptron!
We don’t even need an activation function. The summation and a bias term is
enought!
Important: it would not be possible to solve LR problem with activations like BSF
or Sigmoid since these functions have a very constrained output (unless, of course,
the target values are normalised).
Demo: LR with simplest NN
Perceptron for regression
Introduction to Deep Learning
We mentioned bias term in the first lecture. However we did not use (at least
explicitly) in our first simple NN that we used to classify people as potential
“badies” / “non-badies”.
For linear regression problem the bias term is necessary and in general every
network we will use will have a bias term from now on. Why is it necessary for
linear regression? Y = mx + b, where b is the intercept
If after the summation (Σmx + b) we would have an activation function (like a
sigmoid), then the bias term would shift the activation left-right.
Concrete examples and explanations?
Real problem where a simple NN can’t clasify w/o a bias term:
https://stackoverflow.com/a/38253140/1964707
Another real world problem: https://stackoverflow.com/a/1714030/1964707
Explanation: https://stackoverflow.com/a/2499936/1964707
Bias is a trainable parameter, but also often a hyperparameter in a sense that you
can choose to not include it.
Additionally when drawing the neural network error dependence on weights plot bias
is also an independent variable, a dimension in that plot.
Bias, Learning Rate and Epoch
Introduction to Deep Learning
The learning rate is a hyperparameter
It controls how much to change the model in response to the estimated error each
time the model weights are updated.
Choosing the learning rate is challenging as a value too small may result in a long
training process that could get stuck, whereas a value too large may result in
learning a sub-optimal set of weights too fast or an unstable training process.
Learning rate ussually has a small positive value, often in the range between 0.0
and 1.0.
In the future we will see advanced strategies for learning rate tunning (like
Leslie Smith one-cycle-policy).
Bias, Learning Rate and Epoch
Introduction to Deep Learning
What is an epoch? In neural networks generally, an epoch is a single pass through
the full training set.
Passing the entire dataset through a neural network is not enough. And we need to
pass the full dataset multiple times to the same neural network. But keep in mind
that we are using a limited dataset and to optimise the learning and the graph we
are using Gradient Descent which is an iterative process. So, updating the weights
with single pass or one epoch is not enough.
As the number of epochs increases, more number of times the weight are changed in
the neural network and the curve goes from underfitting to optimal to overfitting
curve.
A Perceptron is high bias, low variance ML model, but an MLP is a high variance
model and can easily overfit - in general it is reasonable to experiment with a low
learning rate, and big epoch count, see: https://nnfs.io/pog/
Bias, Learning Rate and Epoch
Introduction to Deep Learning
What is the relationship between learning rate and epoch?
The learning rate controls how quickly the model is adapted to the problem. Smaller
learning rates require more training epochs given the smaller changes made to the
weights each update, whereas larger learning rates result in rapid changes and
require fewer training epochs.
Why is that? Because if we have a model that learn slow (so small LR) we need to
guarantee that it will reach the global minimum - which we can achive by increasing
the epoch count.
However if the model is not small (has many parameters - high variance) ir can
easily overfit when many epochs are used. So start with smaller networks, more
epochs and small learning rates.
Bias, Learning Rate and Epoch
Introduction to Deep Learning
Neural networks are models that are sensitive to differences in feature scale
(unlike decision trees).
For that reason we need to normalize the input to put them on the same scale when
using neural networks.
This is done very frequently, almost by instinct.
Reminder:
In the training phase we normalize the X_train.
When validating the performance we normalize X_test.
In the inference phase on new data we normalize the features of the new data before
feeding to the network.
Deep learning frameworks have utilities to perform feature scaling /
standartization and normalization, for example:
https://keras.io/api/layers/normalization_layers/ … and many people are just using
the same scikit scalers we have seen before.
Normalization and denormalization
Introduction to Deep Learning
What is Tensorflow?
TensorFlow is an open source software library created by Google for distributed
numerical computations that is used to implement machine learning and deep learning
systems using the concept of dataflow graph.
It is an interface for expressing machine learning algorithms and for
implementation of such algorithms using more primitive datatypes.
Structure of Tensorflow?
Efficiently works with mathematical expressions involving multi-dimensional arrays
(optimized).
Good support of deep neural networks and machine learning concepts.
GPU/CPU computing where the same code can be executed on both architectures.
High scalability of computations across machines and huge data sets - no need to
change the models developed on the developers machine.
Note on deprecations
Tensorflow is one of those google projects that seemed like it was an experiment in
it’s initial incarnation V1. Then was rewritten, a lot was deprecated and so on.
There are a lot of pain points the users are feeling:
https://github.com/tensorflow/tensorflow/issues/26844
For example even tf.estimator api was deprecated in favor of Keras APIs.
Thus, we will not discuss it in much depth, but some discussion is required.
TS.js and Lite are still usable directly.
Tensorflow intro
Introduction to Deep Learning
TF is not just one library, it’s an ecosystem.
Tensorflow runs everywhere:
TF: datacenters, servers, developer machines
TF JS, see: codelabs/tfjs-training-regression/index.html#8
TF for IoT: TF lite
Languages: which and where are they used?
The are many usecases you might want to use it for and levels of users:
Tensorflow intro
Introduction to Deep Learning
Why is it called Tensorflow?
Computation graph is at the core through which tensors are flowing.
Data (all data are tensors) and operations
Since data are tensors, we can represent pictures of dogs and cat’s, for example.
The output id another tensor.
Tensorflow intro
Introduction to Deep Learning
What is a DAG?
A directed acyclic graph.
TF uses it for the computations.
It is portable, written in C++ so the model developed on one platform can ussually
be run on a different platform for which tensoflow execution environment is
available (similar idea to what JVM did for regular programming).
Tensorflow intro
Introduction to Deep Learning

Tensorflow intro
Introduction to Deep Learning
How is tensorflow used?
Most of the time we will just work with Keras high level API and the Data API.
But when you need extra control to write custom loss functions, custom metrics,
layers, models, initializers, regularizers, weight constraints, and more. You may
even need to fully control the training loop itself, for example to apply special
transformations or constraints to the gradients (beyond just clipping them) or to
use multiple optimizers for different parts of the network. We will cover all these
cases in this chapter, and we will also look at how you can boost your custom
models and training algorithms using TensorFlow’s automatic graph generation
feature.
Tensorflow intro
Introduction to Deep Learning
Installation considerations:
Consider: OS (windows only Python3)?
Docker, direct, virtual?
CPU / GPU? CUDA required so you need an nvidia card:
https://www.tensorflow.org/install/gpu
OpenCL and AMD GPUs: don’t know the current status, but people have ran on AMD
GPUs.
Google collab
Recommendation:
If you have windows and nvidia gpu - go native.
If you have linux with nvidia gpu - maybe go native.
Anything else: use CPU, native.
If you are always online - use google collab (we’ll see how to do it latter).
AWS / GCP / Azure
Tensorflow Installation
Introduction to Deep Learning
Installation w/ Anaconda:
conda install -c conda-forge tensorflow
problems when installing tensorflow-gpu (YMMV)
Installation w/ Pip:
GPU: https://www.codingforentrepreneurs.com/blog/install-tensorflow-gpu-windows-
cuda-cudnn https://stackoverflow.com/questions/51306862/how-to-use-tensorflow-gpu
pip3 install tensorflow-gpu==2.2.0 (or whichever version is newest)
Much simpler to install and use the gpu (than with anaconda)
Obviously, if you don’t use anaconda you will need to install jupyter with pip as
well
Tensorflow Installation
Introduction to Deep Learning
Verify the installation:
import tensorflow as tf
with tf.compat.v1.Session() as sess:
# verify that the math works
a = tf.constant(50)
b = tf.constant(51)
print("a + b = {0}".format(sess.run(a + b)))
Tensorflow Installation
Introduction to Deep Learning

Verify the ability to execute on the GPU:


# import and suppres warnings # needed for anaconda sometimes
# ... (if you don't want to suppress them, you need to match the numpy version)
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import Markdown, display
import tensorflow as tf
# tf.__version__

from tensorflow.python.client import device_lib


display(Markdown(str(device_lib.list_local_devices())))
Additional log messages about hardware:
If you are running on a GPU you will see this:

If you see that your CPU supports some instructions that TF was not compiled with
ignore them for now, but if you want you can try and compile TF on your machine
with the full instruction set. This will improve the performance of your TF
programs:

Tensorflow Installation
Introduction to Deep Learning

Tensorflow Execution modes


Tensorflow has two execution modes: lazy and eager
TF 1.X had lazy execution by default. Session object was used.
TF 2.X eager execution by default. Session is legacy. “Functions instead of
sessions”.
This change is actually interesting to see, as the previous approach (lazy, session
based) was said to be much better in TF 1.X. Allegedly developers improved the
eager execution mode so much that it can now be the default. However many suspect
that this reason is just a cover up to the fact that massive adoption of Pytorch in
research around 2018, due to it being easier to use as it just runs right away is
the reason.
We will see and understand lazy execution as you can encounter it in the real world
projects as TF 1.X is still more popular albeit deprecated.

Introduction to Deep Learning


Lazy evaluation
First in lazy mode you create a DAG
Execute it in the session context calling session.run()
What benefits does lazy evaluation provide?
Optimization (fusing operations) before running
Transformation (adding debug nodes)
Partitioning across devices / machines before running
Not the recommended way to use TF since 2.0
How to get performance benefists in TF2 as in TF1 lazy? The answer is @tf.function
decorator. This compiles a function into a callable TensorFlow graph
Demo: Graph Execution

Tensorflow Execution modes


Introduction to Deep Learning
Eager evaluation
Default mode in TF2
TensorFlow's eager execution is an imperative programming environment that
evaluates operations immediately, without building graphs: operations return
concrete values instead of constructing a computational graph to run later. This
makes it easy to get started with TensorFlow and debug models, and it reduces
boilerplate as well.
Demo: eager execution

Tensorflow Execution modes


Introduction to Deep Learning

Tensors
We saw how to evaluate scalars - now let’s evaluate tensors
Evaluating multiple tensors.
Demo: Tensors

Tensorflow datatypes
Introduction to Deep Learning

Tensors
Essentially we are using TF as numpy, since tensor is similar to ndarray.
They have shape and datatype and indexes
What are other operations on tensors
Stacking
Sliceing
Reshaping
More: https://www.tensorflow.org/api_docs/python/

PS: same operations are available as in NP, but they sometimes are different. For
example transpose: NumPy’s T attribute: in TensorFlow, a new tensor is created with
its own copy of the transposed data, while in NumPy, t.T is just a transposed view
on the same data

Tensorflow datatypes
Introduction to Deep Learning

Variables
Variables are a bit more involved and currently not necessary for us.
They are used when we want to do something low level in TF (custom ML).
Variable is a tensor that gets initialized and it’s value get’s changed as program
runs.
Ref: https://www.tensorflow.org/api_docs/python/tf/Variable

Tensorflow datatypes
Introduction to Deep Learning
Placeholders
Use to feed values into the computation graph
Not used much in TF2 eager mode, only if you need lazy mode
Tensorflow datatypes
Introduction to Deep Learning

Other datastrcutures
Sparse tensors (tf.SparseTensor) - efficiently represent tensors containing mostly
zeros. The tf.sparse package contains operations for sparse tensors.
Tensor arrays (tf.TensorArray) - are lists of tensors. They have a fixed size by
default but can optionally be made dynamic. All tensors they contain must have the
same shape and data type.
Ragged tensors (tf.RaggedTensor) - represent static lists of lists of tensors,
where every tensor has the different size. The tf.ragged package contains
operations for ragged tensors.
Tensorflow datatypes
Introduction to Deep Learning

Comparing tf to numpy: hardcoded values

Computing areas of triangles using Heron’s formula


Tensorflow datatypes
Introduction to Deep Learning
In-memory dataset
There is good compatibility between pandas (df) and numpy (ndarray) and TF.
If you have your data in df or ndarray TF has convenience methods that accept
those.
Tensorflow offers efficient data API (big subject itself):
https://www.tensorflow.org/guide/data_performance

Tensorflow datatypes
Introduction to Deep Learning

Debugging can be tricky in lazy eval paradigm. You will not see the result /
intermediate result untill you execute the program.
That is why eager execution is usefull and one of the reasons it became default in
v2.
However eager is not a panacea and you still need to know how to debug lazy TF if
you work with it
Here are the guidelines from google

Tensorflow debugging
Introduction to Deep Learning
Common problems
Shape problems. Solution: reshape
Datatype mismatch: casting.

Tensorflow debugging
Introduction to Deep Learning

Type conversions

Tensorflow debugging
Introduction to Deep Learning

Additional tshooting: appropriate logging, tfdbg utility, TensorBoard

Tensorflow debugging
Introduction to Deep Learning

Estimator API (deprecated, so don’t use them for future projects)


Estimators: higher level API that provides production ready models:

TF estimator workflow: https://www.tensorflow.org/guide/estimator


You could write the training loop yourself, but estimators provide ready code for:

Tensorflow estimators
Introduction to Deep Learning
Estimator API (deprecated, so don’t use them for future projects)
Most common TF estimators:
LinearClassifier - logisting regression
BoostedTreesClassifier
LinearRegressor
DNNClassifier - classification models based on dense, feed-forward neural network.

Source: TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level


Machine Learning Frameworks - https://arxiv.org/pdf/1708.02637.pdf

Tensorflow estimators
Introduction to Deep Learning

Predicting house price


Let’s first see how TF handle house price prediction.
We’ll first use linear regresion so that only TF would be new to you.
Process:
Features: square feet.
Model: LinearRegressor estimator.
Demo: Linear Regression with Tensorflow

Tensorflow estimators
Introduction to Deep Learning

Checkpoints are saved training models: checkpoints capture the exact value of all
parameters (tf.Variable objects) used by a model.
To restart training completelly you need to delete the folder or checkpoint files.
Estimators use checkpoints from the start if present.
Checkpoints will be generated automatically in tmp folder:

Tensorflow checkpoints
Introduction to Deep Learning
A computation engine that is based on the concept of computation graphs
We can use tensorflow in eager or lazy mode
V1 and V2 are different:
V1 has session object that is rarelly used and is considered legazy in V2;
V1 was lazy by default in V2 we need to use functional api to have a lazy graph
built;
V2 had a slogan “functions not sessions”;
Most of the time, at least at the beginning we will not use raw tensorflow, we will
use Keras - to which we now turn.

Tensorflow summary
Introduction to Deep Learning
You have a friend that tries to learn DL w/ Tensorflow. He wrote the following code
to solve a simple linear regression problem: task . It does not work - for data:
[1000, 2000, 7000] instread of predicting [1250, 2250, 7250] the predictions are
off by more than 200 or similar number.
TASK: help your friend:
Generate a list of suggestions of what can be tried (enumerate those suggestions in
the notebook)
Use these suggestions and try them yourself.
GOAL: train the network to predict the values almost exactly (1) in ~500 training
steps (2).
HINT: when generating the list of suggestions, think about two categories of
suggestions: hypermarameter tunning (learning rate, optimizers, others) and data
adjustments.
SPOILER (don’t look if you wanto to tackle the problem on your own):
https://archive.ph/5tSH4

Tensorflow summary
Introduction to Deep Learning
What is Keras?

In short: it’s a bunch of methods that can call the framework underneath to perform
an ML task. Keras does not execute the neural network - it calls it’s backend.
There is also an R version, but most popular implementation is in python.
Schematically Keras is on top of other frameworks, most popular combo - TF + Keras.
Open source: https://github.com/keras-team/keras

Keras intro and installation


Introduction to Deep Learning

Installation
The most popular option is to use Keras with Tensorflow as the backend. We will use
that.
However let’s note, that Keras standalone also makes sense if we wanted to try
Theano and other backends. This way we could compare the frameworks.

Note installing standalone keras for usage with tensorflow is not recommended. So
just install tensorflow and it will be available, see:
https://stackoverflow.com/a/68397927/1964707

Keras intro and installation


Introduction to Deep Learning

Test the installation

Keras intro and installation


Introduction to Deep Learning

We have two ways to create models with Keras: sequential and functional.
The sequential model is used for more simple model creation. Straightforward (a
simple list of layers), but is limited to single-input, single-output stacks of
layers (as the name gives away). Essentially think of MLP for now (although you can
create more advanced NNs).
This is how it would be defined:

However there are alternative syntaxes: https://keras.io/guides/sequential_model/


Keras sequential model
Introduction to Deep Learning
The Functional API, which is an easy-to-use, fully-featured API that supports
arbitrary model architectures, the one in the screenshot.
This is the Keras "industry strength" model.
Simple usage:

There is also the layer subclassing option - so a 3rd option, see:


https://keras.io/guides/making_new_layers_and_models_via_subclassing/

Keras functional model


Introduction to Deep Learning
It’s easy to define the shape of the input data and add layers with sequential
model. And it’s enought for the was majority of simple tasks.
Where does the sequential model fall short?

The functional API alows:

Keras functional model


Introduction to Deep Learning
Choosing layer and neuron counts, paper:
https://www.heatonresearch.com/2017/06/01/hidden-layers.html

Another: https://stats.stackexchange.com/a/1097/162267

Keras neuron count


Introduction to Deep Learning
A loss function (aka cost, error function) is a measure of "how good" a neural
network did with respect to it's given training sample and the expected output. It
also may depend on variables such as weights and biases.
It is a number (scalar), and not a vector, because it rates how good the neural
network did as a whole.
Cost is what is being minimized during backpropagation step.
Common cost functions:
Quadratic cost (mean squared error, maximum likelihood, and sum squared error,
residual sum of squares)
Cross-entropy cost (Bernoulli negative log-likelihood and Binary Cross-Entropy,
Categorical Cross Entropy)
Exponentional cost, Hellinger distance, etc:
https://stats.stackexchange.com/a/154880/162267
Choosing the right objective function for the right problem is important: your
network will take any shortcut it can, to minimize the loss; so if the objective
doesn’t fully correlate with success for the task at hand, your network will end up
doing things you may not have wanted. Imagine a stupid, omnipotent AI trained via
SGD, with this poorly chosen objective function: “maximizing the average well-being
of all humans alive.” To make its job easier, this AI might choose to kill all
humans except a few and focus on the well-being of the remaining ones—because
average well-being isn’t affected by how many humans are left. That might not be
what you intended! (This is more important for reinforcement learning, ship game
example)
Just remember that all neural networks you build will be just as ruthless in
lowering their loss function - so choose the objective wisely, or you’ll have to
face unintended side effects.
Keras loss functions
Introduction to Deep Learning
Fortunately, when it comes to common problems such as classification, regression,
and sequence prediction, there are simple guidelines you can follow to choose the
correct loss. For instance, you’ll use:
binary crossentropy for a two-class classification problem
categorical crossentropy for a many-class classification problem
mean-squared error for a regression problems
connectionist temporal classification (CTC) for a sequence-learning problems
(speech recognition https://distill.pub/2017/ctc/ )
… and so on.
Only when you’re working on truly new research problems will you have to develop
your own objective functions, you can do that with keras.
Keras provides the following loss functions: https://keras.io/api/losses/
It also provide the capability to implement custom loss functions:
https://keras.io/api/losses/#:~:text=Creating-,custom%20losses

Keras loss functions


Introduction to Deep Learning
We have delayed our discussion about optimizers until we encountered one.
What is an optimizer?
The loss function is a measure of the model's performance. The optimizer will help
improve the weights of the network in order to decrease the loss. Optimizers are
algorithms that minimize the loss function - the core of neural network learning
process. There are different optimizers available, but the most common one is the
Stochastic Gradient Descent (SGD) and Adam
Video intro: https://www.youtube.com/watch?v=mdKjMPmcWjY
Common optimizers
Batch gradient descent i.e. GD
Stochastic Gradient Descent SGD
Mini batch GD
… etc: https://ruder.io/optimizing-gradient-descent/
How do optimizers work mathematically?
But the basic idea is simple to explain.
Viz: https://ruder.io/content/images/2016/09/contours_evaluation_optimizers.gif
You can find the full list of Keras provided optimizers here:
https://keras.io/api/optimizers/#available-optimizers
Other frameworks also have them, so we will refer to the documentation.
Keras optimizers
Introduction to Deep Learning
Models have features for visualization

Explanation:
We can give names to layers and model
Params are split into trainable and not (non-trainable will be seen when we discuss
transfer learning), use_bias=False will remove some params.
Shape: (None, 1), means we have a single-featured dataset of unbounded lenght
Objective: you should know how a network architecture looks by just reading the
code
Keras model visualization
Introduction to Deep Learning
Callbacks - an object that can perform actions at various stages of training (e.g.
at the start or end of an epoch, before or after a single batch, etc - training
lifecycle). Example: early stopping.
Callbacks enable checkpoints - you can save a model in some state and use it to
train it on new data or resume it. It’s usefull when the weight count is huge.

Keras callbacks & checkpoints


Introduction to Deep Learning
Keras summary
Summary
What is Keras and how it relates to Tensorflow
Sequentional and functional API - when to use which.
Optimizers and Loss functions (what makes Adam so advanced, diff. btw/ SGD and
mini-batch SGD (if we use batch_size))
Callbacks (most importantly early stopping)
Layer/param calculations (how many layers we need to have for some problems)

Introduction to Deep Learning

TensorBoard is a browser based visualization tool available with Tensorflow


Purpose: providing the measurements, tracking and visualizations during the machine
learning workflow.
Enables tracking experiment metrics like loss and accuracy, visualizing the model
graph and more.
Using tensorboard:
Using is with raw Tensorflow is a bit different than with Keras, we will do it with
Keras
Define log file location → log files is what tensorboard will biuld the graph from
Add Keras callback: callbacks=[tensorboard_callback]
Train and Run tensorboard
Tensorboard can be run directly in Google Collab. Or it can be used on your
personal computer
Features:
“Graphs” → Legend to understand what is what
Comparing several runs of the model, trace inputs
TPU compatibility
Alternatives: MLFlow, Neptune.ai: https://neptune.ai/blog/mlflow-vs-tensorboard-vs-
neptune-what-are-the-differences
It is important to evaluate model training visually to observe the stability and
potential
Tensorboard
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1s3Koj3Uvpu-OJwpDFlzNb7Pz8W0ORbin.pptx ---


Artificial Intelligence
Computer vision and image classification
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


How does Transfer Learning work?
01
02
Why does Transfer Learning work?
Computer vision and image classification
00
Transfer Learning
Famous CNNs for Image Classification
03
04
Decision factors for using TF
05
Further explorations
06
Practical project 7
Isaac Newton once said:

… this is the main idea of transfer learning. Let’s discuss.


Transfer Learning
Computer vision and image classification
Definition: the practice of reusing a (part of) pre-trained (by profesionals)
neural network that solves a problem similar to yours involving the freezing of
lower layers of the pre-trained network and re-training the higher layers to fit
the problem at hand!
Although we will be using transfer learning for image classification problems here
the applicability is much broader, to the point where we can say that most DL
applications could potentially benefit from transfer learning.
Transfer Learning
Computer vision and image classification
We will reuse a trained NN for our problem.
This pre-trained network will ussually be a well known network that comes from
researchers that discovered a particular architecture which has been very
successfull at a specific ML/DL task.
Parts that can be reused:
reusing the architecture → we can choose to just take (part) of the architecture
and completely retrain the network with our data. Change will usually be needed at
the last layer, for example when the number of classes for classifying objects is
different. This is primitive way, most people will not consider this transfer
learning as we are only mimicking the architecture.
reusing the architecture + the pretrained weights of the model (lower layers) →
take the architecture and the weights trained on some dataset and use both the
architecture and the weights - this is the real meaning of T.L.
How does Transfer Learning work?
Computer vision and image classification
Reusing generally involves something similar to the following steps:
You encounter a problem (I want to classify some images).
You obtain the/some data for your problem.
You check if a well known model exists that is trained on a similar data like you
have (sometimes we have architectures trained on several different datasets
VGG16+CIFAR10 weighs vs. VGG16+CIFAR100 weights vs. VGG16+ImageNet weights).
You check if a well known model works for your dataset out of the box (ussually it
will not).
You then take a well known NN architecture and freeze lower layers (switch off
weight updates).
If necessary remove the top layer (or more) (1000 (imagenet1K) categories to 10
cifar-10) categories)
You train the upper layers on the data you obtained (new data that the network has
not seen).
You tune the model (the upper layers) … tune, retrain, tune, retrain (iterative
process).
You repeat the same process for other well-known architectures untill you get the
best or satisfactory result.
How does Transfer Learning work?
Computer vision and image classification
Benefits
When reusing the model with pre-trained weights you are using a model that was
trained on a huge corpus of data, which you might not have. This is one way to use
a DNN when data is absent. The fact that you are reusing the “intelligence” gained
from a very large corpus of data is often mentioned as the biggest benefit of
transfer learning as it reduces the need to have a lot of data for our specific
problems! Remember - the biggest players in ML/DL markets are the ones with the
access to the most amount of data (Microsoft → Github Copilot), so TF provides
democratization.
Time savings on developing / tunning the NN architecture (weight initialization,
activation and cost functions, number of layers and neurons in them, etc.) - you
can just use the architecture w/o the weights and w/o implementing the
architecture, just downloading it.
Time savings in training - training few layers is much faster than the entire
network. The initial epochs when training should immediately have a high accuracy
(>70%). If the network is not expanded much (just the head is replaced) then the
accuracy of a frozen network will never increase by much. Try unfreezing more
layers to freeup the capability of the model to learn from your data.
Often it’s also more precise / accurate.
All the benefits are the development level!
How does Transfer Learning work?
Computer vision and image classification
Constraints
The problem you are trying to solve needs to be similar/same to the problem that
the network was designed to solve. If it's an entirelly different/unique problem
then the resusability will not be possible, however many, especially computer
vision problems, do not exhibit such property!
The more similar the data and the problem, the better the reuse should be.
Transfer learning models available in popular frameworks like Keras and Pytorch are
usually “heavy-weights” optimized for accuracy (and server environments) so you
need to check if they would even work for an IoT (TinyML) problem or witch the
hardware constraints you have.
How does Transfer Learning work?
Computer vision and image classification
Reusing the network works because
Lower layers ARE NOT specific to your problem, they are data specific (images of a
set of items). They can be fronzen and not touched (remember progressive
generalization).
Higher layers are problem-specific. They can be retrained for your problem. At
least you will likelly need to retrain the last layer, since number of classes
might be different.
Why does Transfer Learning work?
Computer vision and image classification
We have general 3 possibilities when doing TF:
Use only the well known architecture - training from zero, big loss from the
beginning, takes long
Use architecture + weights w/ freezing (how much freezing should be determined by
trials), training should not take long, loss will be small/very small from the
beginning, improvements might be minimal (depending on the head added).
Use architecture + weights w/o freezing (lt. “datreniravimas”, eng. need a good
term: “~ additional training” / “suplementary training”). We get a network that is
more adapted to our data, training can take longer than w/ freezing, training will
start from lower error.
Why does Transfer Learning work?
Computer vision and image classification
Keras offered models: https://keras.io/api/applications/
Pytorch: https://pytorch.org/vision/stable/models.html and
https://pytorch.org/vision/master/models.html#
Resnet variations are probably the most popular.
Famous CNNs for Image Classification
Computer vision and image classification
Some of the terms used in the visual:
ILSVRC - ImageNet Large Scale Visual Recognition Challenge
ImageNet - The ImageNet project is a large visual database designed for use in
visual object recognition software research. More than 14 million images have been
hand-annotated by the project to indicate what objects are pictured and in at least
one million of the images, bounding boxes are also provided. ImageNet contains more
than 20,000 categories with a typical category, such as "balloon" or "strawberry",
consisting of several hundred images. The database of annotations of third-party
image URLs is freely available directly from ImageNet, though the actual images are
not owned by ImageNet.
Note: calculate how well random guesser would do on this challenge (assuming
symmetric classification).
Famous CNNs for Image Classification
Computer vision and image classification
These can be summarised with two questions:
Is our dataset bigger or smaller than the dataset the model we want to reuse was
trained on.
Are the images similar or not (for example, ImageNet has cats, dogs and other
common items, if our problem has microscopic bacteria - we will not be able to
resuse the models trained on it).
Decision factors for using TF
Computer vision and image classification
From this we can enumerate 4 cases:
The new data set is small, new data is similar to original training data.
The new data set is small, new data is different from the original data.
The new data set is large, new data is different from original training data.
The new data set is large, new data is similar to original training data.
Ref for the visuals: https://thekhblog.wordpress.com/2017/03/28/transfer-learning/
Decision factors for using TF
Computer vision and image classification
What is the current state of transfer learning for tabular data?
What is the current state of transfer learning for NLP w/ Transformers, is it
popular there?
What is the current state of transfer learning for sequence prediction / sequential
data problems?
Can we reuse a famous CNN architecture trained for classification for a regression
problem (bounding box regression)?
What is the smallest pre-trained model offered by Keras and Pytorch. Is it better
than something that we could train from scratch (good candidate for personal
project)?
Further explorations
Computer vision and image classification
Take Covid data from this article:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7372265/ … or any other place you can
find a credible source of covid lung imaging datasets.
Train a model that can classify people as having covid and not having covid based
on the images.
Almost no requirements on the performance of the model, anything >50% (random
guessing) would be acceptable!
No requirements for the usage of a framework: you can use Keras, Pytorch or Fast.ai
- or even all.
No requirements to use transfer learning (however it is recommended).
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 7 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) or github link with jupyter notebook code when finished for review and
evaluation.

** Another option is take the model from the article above (both the model and the
data are provided) and improve the sensitivity and specificity. Also, additionally
you can research on how helpful were datascience techniques for providing solutions
to various problems over the pandemic (the status for mid. 2020 was that advanced
DS/ML/DL techniques were not very helpful - hopefully latter on there were some
changes).
Practical project 7
Computer vision and image classification
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1vEaHSfz_V-bce30RGQDoYLIJOQ-00w10.pptx ---


Artificial Intelligence
E2E ML Platforms
2024
Lecturer
Mindaugas Bernatavičius
2 Level
1 Chapter
Today you will learn
Neptune.AI
01
02
Databricks
E2E ML Platforms
00
MLFlow
Others
03
After learning ML what to learn next:
DL (this course)
AutoML
Tools & Platforms / MLOps / Operationalization / Horizontal movement.
Beyond the basics
E2E ML Platforms
Tracking is possible to do in excel (the learn way) or a .md file trackable inside
a git repo for the project, or even inside the same file you write the code in
(simplest).
In fact, for a small company I would almost always recommend the simplest possible
approach and only abandon it when there is a pretty big need to do it (a benefit
gained). There is rarely a strong case to abandon simple tools.
It is very often that companies introduce tools just because some developers
(mostly those that implement key tasks) want to use. Their motivation is often not
productivity (anything from not knowing how to think in the most lean way possible,
not seeing the need for lean engineering/business, CV/portfolio driven
development).
MLFlow
E2E ML Platforms
Initial launch:
Follow the following tutorial:
https://mlflow.org/docs/latest/getting-started/intro-quickstart/index.html
The installation with pip install mlflow will install 600MB+ of dependencies in
your virtual environment, but it shows how integrated mlflow is with the other
tools as it comes with scipy, numpy, pandas and scikit by default.
If you see the error: ModuleNotFoundError: No module named 'pkg_resources', please
run: pip install setuptools
After the install run: mlflow ui or mlflow server --host 127.0.0.1 --port 8080 to
test that everything works OK.

Experiment tracking functionality


Create a project (Pycharm, VSC, notebook).
Push metrics to the servers you want (can be local or central, that the team is
using).
Run experiments, push the results, inspect and compare them.
Choose columns you want to see for a quick global inspection and comparison.

Separate MLflow server


Launch the server in another directory (or some linux virtual server in the cloud
or on premisses).
You will need to do the pip intall mlflow in the server (or you can dockerize the
project and have a docker container for the mlflow server)

Model serving for production


Create a new project (Pycharm, VSC, notebook).
Load the model from MLflow server (on application load or othervise), see:
https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_m
odel
After this, if the model is loaded on app restart, you can just restart the app and
the newest model will be served
MLFlow
E2E ML Platforms
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Iris dataset and split


X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters and train the model


params = {
"solver": "liblinear",
"max_iter": 10,
"multi_class": "auto",
"random_state": 8888,
}

lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set and calculate metrics


y_pred = lr.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')

import mlflow

# Set our tracking server uri for logging


mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# Create a new MLflow Experiment


mlflow.set_experiment("MLflow Quickstart")

# Start an MLflow run


with mlflow.start_run():
# Log the hyperparameters
mlflow.log_params(params)

# Log the loss metric


mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("precision", precision)
mlflow.log_metric("recall", recall)
mlflow.log_metric("f1", f1)

# Set a tag that we can use to remind ourselves what this run was for
mlflow.set_tag("Training Info", "Basic LR model for iris data")
# Infer the model signature
signature = mlflow.models.infer_signature(X_train, lr.predict(X_train))

# Log the model


model_info = mlflow.sklearn.log_model(
sk_model=lr,
artifact_path="iris_model",
signature=signature,
input_example=X_train,
registered_model_name="tracking-quickstart",
)
Model serving for production
Create a new project (Pycharm, VSC, notebook).
Load the model from MLflow server (on application load or othervise), see:
https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_m
odel
After this, if the model is loaded on app restart, you can just restart the app and
the newest model will be served
MLFlow
E2E ML Platforms
import mlflow
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")
best_runs = mlflow.search_runs(filter_string="metrics.accuracy >= 1",
search_all_experiments=True)
print(f"Runs:\n{best_runs}")

run_id = best_runs.loc[best_runs['metrics.accuracy'].idxmin()]['run_id']
print(f"Best run id: {run_id}")

loaded_model = mlflow.pyfunc.load_model(f"runs:/{run_id}/iris_model")
predictions = loaded_model.predict(X_test)

result = pd.DataFrame(X_test, columns=datasets.load_iris().feature_names)


result["actual_class"] = y_test
result["predicted_class"] = predictions

print(f"Prediction: {result[:4]}")

Quickstart: https://docs.neptune.ai/usage/quickstart/#__tabbed_1_1
Mostly used as a cloud service, free to try and learn.
Self hosting is more complicated than with MLFlow (seems to require kubernetes).
Neptune.AI
E2E ML Platforms
Managed Apache Spark + More
Hosted datalakehouse
Main competitor: Snowflake
Focused on notebooks
Good presentation: https://www.youtube.com/watch?v=02DBOfYrYT0
Ref: https://www.youtube.com/watch?v=QNdiGZFaUFs
Databricks
E2E ML Platforms
//
Databricks
E2E ML Platforms
//
Databricks
E2E ML Platforms
Demo: creating a free account for learning - community ed. When registering note,
that they remove “.” and “+”:
https://community.databricks.com/s/feed/0D53f00001jxRjLCAU so [email protected]
[email protected]
Databricks
E2E ML Platforms
This article compares a lot of e2e solutions: https://www.netguru.com/blog/machine-
learning-tools-comparison , namely:
Weighs & Biases
MLFlow
Neptune.ai
Databricks
Sagemaker
Backup: https://web.archive.org/web/20231204194946/https://www.netguru.com/blog/
machine-learning-tools-comparison
Others
E2E ML Platforms
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

E2E ML Platforms
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 18M8CLfqXepkK9nDyGBt7S-SXHG47ci2k.pptx ---


Artificial Intelligence
CNN interpretability with CAMs
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Techniques
01
02
Visualizing intermediate activations / features
CNN interpretability with CAMs
00
Intepretability - DNN v CNN
Visualizing convnet filters
03
04
Visualizing heatmaps of class activation
05
Further explorations
A fundamental problem when building a computer vision application (or any deep
learning app) is that of interpretability: why did your classifier think a
particular image contained a fridge, when all you can see is a truck?
Most DL models are black boxes, however CNNs proved to be an exception - since 2013
a wide array of techniques has been developed for visualizing and interpreting
these representations.
Intepretability - DNN v CNN
CNN interpretability with CAMs
The following are the most useful techniques:
Visualizing intermediate convnet outputs (intermediate activations) — useful for
understanding how successive convnet layers transform their input, and for getting
a first idea of the meaning of individual convnet filters. We can also thack the
transformations of them as the network trains or statistical parameters, like
histograms.
Visualizing convnet filters — useful for understanding precisely what visual
pattern or concept each filter in a convnet is receptive to.
Visualizing class activation heatmaps (CA(h)M) in an image — useful for
understanding which parts of an image were identified as belonging to a given
class, thus allowing you to localize objects in images.
More?
It would be naive to expect that we would understand how the network works from one
feature map or one conv filter visualization, but if we were to perform a
comparative analysis (missclasifications vs. correct classifications), then we
might observe anomalies.
Techniques
CNN interpretability with CAMs
How does it look like:
Visualizing intermediate activations / features
CNN interpretability with CAMs
How does it look like:
Visualizing convnet filters
CNN interpretability with CAMs
What can we infer from them?
We can compare the filters between models to get a better insight
We can track the changes during training
We can check whether there is some kind of artifact introduced into the feature
maps for missclassifications and in general compare them to correct
classifications.
More?
Visualizing convnet filters
CNN interpretability with CAMs
This technique is useful for understanding which parts of a given image led a
convnet to its final classification decision. This technique can be used when
debugging a misclassification.
The general class of similar techniques is called CAM - class activation
map/heatmap
A class activation heatmap is a 2D grid of scores associated with a specific output
class, computed for every location in any input image, indicating how important
each location is with respect to the class under consideration.
There are multiple implementations of CAMs. One implementation is described in the
paper: “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based
Localization.” but you can certainly find more (ref:
https://arxiv.org/abs/1610.02391 )
Grad-CAM consists of taking the output feature map of a convolution layer, given an
input image, and weighing every channel in that feature map by the gradient of the
class with respect to the channel. Intuitively, one way to understand this trick is
to imagine that you’re weighting a spatial map of “how intensely the input image
activates different channels” by “how important each channel is with regard to the
class” resulting in a spatial map of “how intensely the input image activates the
class.”
Visualizing heatmaps of class activation
CNN interpretability with CAMs
Visual schematic

DEMO
Visualizing heatmaps of class activation
CNN interpretability with CAMs
Apply CAM for MNIST (possibly on a custom CNN)
** Research, find and launch other CAM implementations (e.g.: faster-grad-cam and
others)
** Create a demonstration where you purposefully inject coocurrences (cat and dog
in one picture) and thus forcefully introduce bias. Then train a classification
model and prove that CAM help troubleshoot misclassifications!
Do Vision-Transformers provide the capability of extracting CAMs? Is it easier?
Further explorations
CNN interpretability with CAMs
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

CNN interpretability with CAMs


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1gr17ZsJFPk-9qbqLpgzCjrU93OUUUMku.pptx ---


Artificial Intelligence
Computer vision and image classification
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Hyperparameter tunning for CNN
01
02
Wide and narrow convolutions
Computer vision and image classification
00
Pytorch CNN
Keras CNN
03
Demo: Let’s build a Pytorch CNN using the mnist dataset
Let’s test it with our own images, drawing a number and uploading it from the
computer.
We will see that even though the picture below looks like a “3” to us, it might
look like a “9” to the computer. We will discuss how to analyze the data when a CNN
misbehaves.
Pytorch CNN
Computer vision and image classification
Let’s compare our FCFFNN and our CNN using MNIST dataset
Let’s compare the GPU vs. CPU performance of collab on MNIST dataset
Even with a small number of epochs (50) we are achieving 95+ % accuracy.
The same accuracy takes 10000 epochs for the FCFFNN.
But the FCFFNN is faster to train per epoch (because it’s much simpler!)
Additional things to explore:
Compare total training time for CNN and FCFFNN/DNN for a given acc/recall/f1.
Derive the coefficient.
You could investigate more complex image sets - the difference should be even
bigger!
Pytorch CNN
Computer vision and image classification
We can perform the following tunning adjustments to any CNN:
Change the kernel size of conv layers
Change the kernel size of pooling layers
Padding adjustment
Conv kernel count (more feature maps)
Stride adjustment
Batch normalization
Activation functions
Different pooling layers (max, avg and so on),
Optimizer
Layer count
Batch size
Dropout
Etc.
Not all of these can be tunned independently (those in red) - let’s understand how
kernel size is calculated and what is padding.
CNN hyperparameter tunning
Computer vision and image classification
When we pass the image through a CNN, the result after applying filters can be
larger or smaller than the original image. A wide convolution is the result of
padding applied.

Without padding the formula will be:


Outw = Imgw - Filterw + 1, when stride = 1
Or generally we can use this formula:
It applies to convolutions and pooling layers.
This is called receptive field arithmetic (remember the term to be able to google
it)
Wide and narrow convolutions
Computer vision and image classification
//
Wide and narrow convolutions
Computer vision and image classification
What does it look like with padding?

There are many forms of padding: https://pytorch.org/docs/stable/nn.html#padding-


layers
When do we need padding? When important features in the image are near the edges.
Usually we don’t try to exceed values of a couple of pixels for padding (2-5).
Wide and narrow convolutions
Computer vision and image classification
We can also tune the vertical and horizontal stride. Ranges to consider:
1 (for Conv)
smaller than kernel size (for Conv)
same as kernel size (for Pooling)

We can also choose to apply batch normalization. Many advanced CNNs use it.
Stride and batch-norm
Computer vision and image classification
Let’s tune the following hyperparameters and see how the network performs:
Different Activation functions: Sigmoid, Relu, Tanh, Elu.
Different Pooling Layers: MaxPool2d, AvgPool2d, LPPool2d, more:
https://pytorch.org/docs/stable/nn.html#pooling-layers
Convolution kernel size: can’t be changed arbitrarily. Need to calculate the
correct size.
Add additional convolution layers.
CNN hyperparameter tunning
Computer vision and image classification
Let’s create an tune CNN with Keras
Flowers photo classification:
https://www.tensorflow.org/tutorials/images/classification
Fashion mnist classification
German traffic sign classification
Keras CNN
Computer vision and image classification
Summary:
We created and tunned a CNN
Hyperparameter tunning is similar to FCFFNN, we just have to take into account the
convolution / pool dimensions
Gains in model accuracy can be obtained by cleaver pre-processing techniques
(sharpening, contrast increase and so on). We often reduce RGB images to grayscale,
unless the objects in the pictures are only differentiable by colors (I probably
would not do that for parrot classification).
We also compared CNN with FCFFNN/DNN
Questions:
What is the default stride of convolution kernel in Pytorch?
What is the default stride of the pooling filter in Pytorch?
How can you pass a parameter indicating how many kernels need to be applied to a
CNN in Pytorch?
Which is better ussually, many smaller kernels or few large kernels?
Summary and questions
Computer vision and image classification
Student Questions:
If we see a chaotic learning curve (error decrease visualization) can we use model
checkpointing (checkpoint callback) as a viable solution to save the best model? In
a sense, we can, however we want repeatable stable learning process to be able to
reach very similar results each time. We can use model checkpointing to reach the
maximum after we know a good hyperparameter set, but not when the model starts
wobbling after 2-3 epochs and we grab checkpointing as a band aid. A bit more
reading about this to get you started investigating this:
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-
model-performance/
We saw great results from classifying MNIST dataset easily, in fact, I have not
seen such great results without much tunning. Seems like the results improved. An
interesting question can be asked - can a given model improve (reach better
validation accuracy) because framework developers changed something? What are the
things that can be changed by framework developers that would make a difference?
Default parameters for: LR, bias (on/off), weight initializers (they are random by
default), etc (good question to explore).
Summary and questions
Computer vision and image classification
Imagine getting a data science task - compare the complexity of an image dataset:
prove that a given image dataset is “more complex” than another dataset? How would
you do it? If a given dataset is easier to learn for a NN with architecture X does
that necessarily mean that it is more complex. What more deterministic ways would
you discover.
Further explorations
Computer vision and image classification
Transfer learning
What’s next
Computer vision and image classification
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1OIaJe19QItSyxNgT6E36-Ic3ZlDllpab.pptx ---


Artificial Intelligence
Recommender Systems
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Keras example
01
02
Further explorations
Recommender Systems
00
Transformer-based recommendation systems
Practical Project 11
03
We can apply the attention mechanism to our recommendation system
This was described here: https://arxiv.org/abs/1905.06874 (Behavior Sequence
Transformer … in Alibaba)
Transformer-based recommendation systems
Recommender Systems
Movie lens dataset is used in this keras example:
https://keras.io/examples/structured_data/movielens_recommendations_transformers/
NOTE: this demonstrates how to implement research papers, which is almost the
maximum possible level of achievement that anyone can achieve in our course. You
are always welcome to attempt it.
Keras example
Recommender Systems
Research on additional SOTA architectures
Reaserch on other architectures that we have seen in the recommendation system
taxonomy. Write an article / blogpost on what the tradeoffs and usecases for each
are
Further explorations
Recommender Systems
For this part take any task we have seen:
collaborative filtering (w/ movielens dataset) or
content-based filtering (Keras w/ movielens dataset) and
implement whichever one you want but with the goodbooks dataset.
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 11 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) when finished for review and evaluation.

More advanced / bonus project: implement Behavior Sequence Transformer and compare
with the model we had (for comparison, provide your own argument/opinion on which
one is more precise and, more interestingly, why).
Practical Project 11
Recommender Systems
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Recommender Systems
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1srYdFcEZCzKoz6t4mrHNe5MZsbty75UX.pptx ---


Artificial Intelligence
Generative Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


//
01
02
//
Generative Deep Learning
00
//
//
03
//
05
04
//
06
//
07
//
Instance segmentation → unet → embeddings → cross attention →
https://www.youtube.com/watch?v=sFztPP9qPRc
Intro
Generative Deep Learning
//
Further explorations
Generative Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Generative Deep Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1BSMSSYW9gVNUtFgD9xiHZBybNzRv1_gv.pptx ---


Artificial Intelligence
Generative Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Concept vectors for image editing
01
02
VAEs
Generative Deep Learning
00
Sampling from latent image spaces
VAE implementation in Keras
03
Equilibrium requirement
05
04
GANs
06
Further explorations
07
Practical Project 12
One the most popular and successful applicationd of creative/generative AI today is
image generation: new faces, places & so on.
Three main techniques:
VAEs - variational autoencoders
GANs - generative adversarial networks
LDMs - latent diffusion models
Flow models (?)
SOTA: https://paperswithcode.com/task/image-generation
The techniques aren’t specific to images - you could develop latent spaces of
sound, music, or even text, using GANs and VAEs - but in practice, the most
interesting results have been obtained with pictures, and that’s what we’ll focus
on.
Latent diffusion models (LDMs) - capable of “stable diffusion” (text-to-image
tasks) - can also be mentioned here (2022 hype, even Geohotz used them as a
marketing tool to get his tinycorp company going).
Services: DALL-E, Stable Diffusion, Midjourney (closed source)
Sampling from latent image spaces
Generative Deep Learning
The key idea of image generation is to develop a low-dimensional latent space of
representations (which, likemost things in deep learning, is a vector space), where
any sample can be mapped to a “valid” image: an image that looks like the real
thing.
The module capable of realizing this mapping, taking as input a latent point and
outputting an image (a grid of pixels), is called a generator (in the case of GANs)
or a decoder (in the case of VAEs).
Once such a latent space has been learned, you can sample points from it, and, by
mapping them back to image space, generate images that have never been seen before.
These new images are the in-betweens of the training images.
GANs and VAEs are two different strategies for learning such latent spaces of image
representations, each with its own characteristics.
VAEs are great for learning latent spaces that are well structured, where specific
directions encode a meaningful axis of variation in the data.
GANs generate images that can potentially be highly realistic, but the latent space
they come from may not have as much structure and continuity.
Sampling from latent image spaces
Generative Deep Learning
Taken from Deep Learning with Keras, Chollet
//
Sampling from latent image spaces
Generative Deep Learning
Taken from Deep Learning with Keras, Chollet
A concept vector is a part of an embedding space (think word or categorical
embeddings).
Given a latent space of representations, or an embedding space, certain directions
in the space may encode interesting axes of variation in the original data. In a
latent space of images of faces, for instance, there may be a smile vector, such
that if latent point z is the embedded representation of a certain face, then
latent point z + s is the embedded representation of the same face but smiling.
Once you’ve identified such a vector, it then becomes possible to edit images by
projecting them into the latent space, moving their representation in a meaningful
way and then decoding them back to image space. In the case of faces, you may
discover vectors for adding sunglasses to a face, removing glasses, turning a male
face into a female face, and so on.
… so we can see how powerful embeddings are.
Concept vectors for image editing
Generative Deep Learning
See: https://arxiv.org/vc/arxiv/papers/1609/1609.04468v2.pdf
Discovered around the start of 2014 by two groups of researchers independently.
Image editing via concept vectors - key/google’able phrase.
A new take on autoencoders (... so if you want deeper understanding then read up on
them) - type of network that aims to encode an input to a low-dimensional latent
space and then decode it back that mixes ideas from deep learning with Bayesian
inference.
Most commonly, you’ll constrain the code to be low-dimensional and sparse (mostly
zeros), in which case the encoder acts as a way to compress the input data into
fewer bits of information (we could imagine something like this to be useful for
transmitting data, if fast).
Classic autoencoders don’t lead to particularly useful or nicely structured latent
spaces and not much good for compression, hence dimensionality reduction (this
touches on a question how DL is used for Unsupervised Learning, until now we used
NN (which is unsupervised (not to be confused with kNN, which is supervised)) in
part8 in tandem with DNNs, now we use DNNs to perform UL tasks).
Non-linear dimensionality reduction (while PCA is linear):
https://arxiv.org/abs/2205.11673 and https://towardsdatascience.com/how-
autoencoders-outperform-pca-in-dimensionality-reduction-1ae44c68b42f
VAEs add statistical mechanisms to autoencoders to augment their functionality and
are useful [need proof from original papers]
VAEs
Generative Deep Learning
Ref: https://towardsdatascience.com/difference-between-autoencoder-ae-and-
variational-autoencoder-vae-ed7be1c038f2
VAEs
Generative Deep Learning
A VAE, instead of compressing its input image into a fixed code in the latent
space, turns the image into the parameters of a statistical distribution: a mean
and a variance. Essentially, this means we’re assuming the input image has been
generated by a statistical process, and that the randomness of this process should
be taken into account during encoding and decoding.
The VAE then uses the mean and variance parameters to randomly sample one element
of the distribution, and decodes that element back to the original input. The
stochasticity of this process improves robustness and forces the latent space to
encode meaningful representations everywhere: every point sampled in the latent
space is decoded to a valid output.
VAEs
Generative Deep Learning
Because epsilon is random, the process ensures that every point that’s close to the
latent location where you encoded input_img (z-mean) can be decoded to something
similar to input_img, thus forcing the latent space to be continuously meaningful.
Any two close points in the latent space will decode to highly similar images.
Continuity, combined with the low dimensionality of the latent space, forces every
direction in the latent space to encode a meaningful axis of variation of the data,
making the latent space very structured and thus highly suitable to manipulation
via concept vectors (even if this is unclear, we can appriciate that this is not
intuitive).

The parameters of a VAE are trained via two loss functions:


reconstruction loss that forces the decoded samples to match the initial inputs
regularization loss that helps learn well-rounded latent distributions and reduces
overfitting to the training data. For the regularization loss, we typically use an
expression (the Kullback–Leibler divergence) meant to nudge the distribution of the
encoder output toward a well-rounded normal distribution centered around 0. This
provides the encoder with a sensible assumption about the structure of the latent
space it’s modeling.
VAEs
Generative Deep Learning
Most of the tutorials online demonstrate VAE that generate MNIST dataset (like the
original paper).
You can find the code for one here: https://github.com/fchollet/deep-learning-with-
python-notebooks/blob/master/chapter12_part04_variational-autoencoders.ipynb
The model has 3 parts:
An encoder network that turns a real image into a mean and a variance in the latent
space.
A sampling layer that takes such a mean and variance, and uses them to sample a
random point from the latent space.
A decoder network that turns points from the latent space back into images.
Some interesting parts / main ideas:
convolutional strides are used for downsampling feature maps instead of max
pooling. In general, strides are preferable to max pooling for any model that cares
about information location (... this is an important, generally applicable
statement) - that is to say, where stuff is in the image—and this one does, since
it will have to produce an image encoding that can be used to reconstruct a valid
image.
This is an example of a model that isn’t doing supervised learning (an autoencoder
is an example of self-supervised learning, because (i) there is a loss function
used (if it unsupervised - no labels and error functions would exist) and (ii) it
uses its inputs as targets).
It is common to implement this by subclassing and adding loss_trackers as
variables. So no loss is passed when the model is compiled or fitted.
The grid of sampled digits shows a completely continuous distribution of the
different digit classes, with one digit morphing into another as you follow a path
through latent space. Specific directions in this space have a meaning: for
example, there are directions for “five-ness,” “one-ness,” and so on.
VAE implementation in Keras
Generative Deep Learning
Generative adversarial networks (GANs), introduced in 2014 by Goodfellow et al.
(link to the paper: https://arxiv.org/pdf/1406.2661.pdf ) are a mechanism for
learning latent spaces of images. They enable the generation of fairly realistic
synthetic images by forcing the generated images to be statistically almost
indistinguishable from real ones.
There are some examples of services created from this architecture:
https://affinelayer.com/pixsrv/ ;; https://phillipi.github.io/pix2pix/ And of
course there are numerous video created showing the impresive final results of
generated face images of people who do not exist: https://www.youtube.com/watch?
v=BIZg_PPuj_0
StyleGAN (2018, Nvidia) is an implementation of GAN architecture.
StyleGAN 2
Gender Swap GANs
Image colorization: https://github.com/zhaoyuzhi/VCGAN
GANs
Generative Deep Learning
An intuitive way to understand GANs is to imagine a forger trying to create a fake
Picasso painting. At first, the forger is pretty bad at the task. He mixes some of
his fakes with authentic Picassos and shows them all to an art dealer. The art
dealer makes an authenticity assessment for each painting and gives the forger
feedback about what makes a Picasso look like a Picasso. The forger goes back to
his studio to prepare some new fakes. As times goes on, the forger becomes
increasingly competent at imitating the style of Picasso, and the art dealer
becomes increasingly expert at spotting fakes. In the end, they have on their hands
some excellent fake Picassos. That’s what a GAN is: a forger network and an expert
network, each being trained to best the other. As such, a GAN is made of two parts:
Generator network - takes as input a random vector / randomly sampled noise vector
(a random point in the latent space), and decodes it into a synthetic image
Discriminator network (or adversary) takes as input an image (real or synthetic),
and predicts whether the image came from the training set or was created by the
generator network.
The generator network is trained to be able to fool the discriminator network, and
thus it evolves toward generating increasingly realistic images as training goes
on: artificial images that look indistinguishable from real ones, to the extent
that it’s impossible for the discriminator network to tell the two apart.
Meanwhile, the discriminator is constantly adapting to the gradually improving
capabilities of the generator, setting a high bar of realism for the generated
images. Once training is over, the generator is capable of turning any point in its
input space into a believable image.
Unlike VAEs, this latent space has fewer explicit guarantees of meaningful
structure; in particular, it isn’t continuous.
Remarkably, the generator never sees images from the training set directly;
information it has about the data comes from the discriminator.
GANs
Generative Deep Learning
//
GANs
Generative Deep Learning
Let's see a summary video, that maight explain things better:
https://www.youtube.com/watch?v=ZnpZsiy_p2M
GANs
Generative Deep Learning
Researchers distinguish between several versions of GANs:
Simple ("vanilla") GANs
Deep Convolutional GANs (DCGANs)
Conditional GANs (CGANS)
Wasserstein GAN (WGAN)
CycleGANs - image-to-image problem
More on that:
https://heartbeat.fritz.ai/introduction-to-generative-adversarial-networks-gans-
35ef44f21193
https://machinelearningmastery.com/impressive-applications-of-generative-
adversarial-networks/
GANs
Generative Deep Learning
Image colorization
Deepfakes
Generate fake image
Combine real and fake images into one vector and do the same for labels
Add some randomness to the label distribution
Train Discriminator with the generated images
Construct another random vector and pass it to generator
Immediatelly get the predictions from the discriminator and compute how accure it
was.
Update generator trainable weights according to how poorly the generator did when
generating fake images (we had all labels as "REAL", so the only way that
discriminator returned "FAKE" is if the generator did a poor job - that is why we
penalize it).
Schematic implementation
Generative Deep Learning
More interesting points:
Generator is a deconvolutional “network” (or part of it at least)
Discriminator has Dense(1) layer at the end + sigmoid ⇒ (0,1).
Generator and Discriminator have separate loss metrics and optimizers (unlike VAE)
At the very first iteration, the generator is first to execute (inference), then
discriminator weights are updated, then generator weights are updated
GANs are notoriously hard to train, there is a “bag of tricks” approach, not a
“guidlines and rules” approach.
GAN implementation in Keras
Generative Deep Learning
GAN is a system where the optimization minimum isn’t fixed, unlike in any other
training setup in most of deep learning problems. Normally, gradient descent
consists of rolling down hills in a static loss landscape. But with a GAN, every
step taken down the hill changes the entire landscape a little. It’s a dynamic
system where the optimization process is seeking not a minimum, but an equilibrium
between two forces. For this reason, GANs are notoriously difficult to train —
getting a GAN to work requires lots of careful tuning of the model architecture and
training parameters. This is made even more difficult since the performance is slow
due to large NNs and images involved and images being multidimensional.
In short: GANs are difficult to train, because training a GAN is a dynamic process
rather than a simple gradient descent process with a fixed loss landscape. Getting
a GAN to train correctly requires using a number of heuristic tricks, as well as
extensive tuning: https://github.com/soumith/ganhacks (these tricks need
verification and testing!)
Equilibrium requirement
Generative Deep Learning
Find other datasets that are used with GANs for face generation or non-face related
tasks
Answer questions for VAE:
What is the main difference between classic autoencoders and VAE, is it the usage
of statistical mean and variance of encoding?
Why mean and variance (log variance) are used, what’s the benefit?
Find a VAE example with more realistic image data - what can be achieved?
Make a GAN that as simple as possible - remove layers from generator and
discriminator. Train it with simple datasets (grayscale images of faces or other).
Does it make sense to use MNIST dataset with a GAN? It should generate a slightly
modified version of any particular number, so it should make sense. Try it.
Further explorations
Generative Deep Learning
For this part - create visual art with your own image(s) by applying either deep
dreaming or neural style transfer techniques learned in this part (take into
account that deep dreaming works best with images that are similar to the ones that
the transfer learning model was trained on, similar, but not the same. Take
resizing into consideration).
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 12 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook (double check the share options of the
notebook) when finished for review and evaluation.
Practical Project 12
Generative Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Generative Deep Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1QiGY05MMYlFI4voSiTeb6kkuS5cj85bt.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Unit Testing
Pytest
01
02
03
Unittest
Python Crash Course
00
Testing
05
Mocking
06
Job Scheduling: cron
07
Job Scheduling: windows task scheduler
08
Job Scheduling: jenkins
Testing
Testing - the activity people perform to ensure that the metrics and behavior of
software meets expectations.
Some argue that automated tests are not part of testing (notably James Bach). The
argument is that testing is a critical activity that is not repetitious but
investigatory / exploratory by definition.
Expectations need to be known - either explicitly defined by the client
(validation) or derived via common sense / technical-expert knowledge
(verification).

Regardless of the definition it is a fact that test automation is a hugelly


important field in sotware engineering and is performed at almost every software
company.
Python Crash Course
Testing
Test levels (corresponding to programming abstractions):
Component / unit - classes tested via their public methods (public api) or just
functions. Performed w/o connecting to the database, filesystem or external
services in order to be fast and provide fast feedback and only fail when the logic
is incorrect not due to externals.
Integration - several classes or functions together. There is no standard
definition on what constitutes an integration test. It can be close to system level
or close to unit level. Sometimes the persistence layer and external services are
exercised, sometimes not.
System / e2e - testing performed exercising the functionality of the entire system
mostly from end-user perspective. Persistence layer and external services are
exerviced for sure.
Acceptence - testing with the client or representative of the client (product owner
in scrum). Alpha and Beta testing is an example.

The difference can be understood when thinking about the creation of an automobile
or a plane:
Unit: valve springs, timing belt are tested individually (mechanical engineers)
Integration: engine is assembled and tested in a testbed (automotive engineers)
System: the car is driven in various conditions (rain, temperature, altitude,
speeds) (test drivers)

Python Crash Course


Testing
Test types:
Functional - testing what application does and if it’s correct.
Non-functional - testing how the application does what it does (how fast, how
securely how usable is the system).

Putting levels and types together:


Python Crash Course
Unit testing is usually automated. Automated testing or test automation is the
process of writing code that calls the main logic of your program (functions),
gives predefined inputs to it, and verifies that the expected output is produced.
We can impose even more structure to a unit test using AAAT / GWTT patterns -
fundamental parts of any automated tests:
arrange / act / assert / teardown
given / when / then / teardown

Since code is testing other code, why don’t we test, tests? Unit tests should as
simple as mathematical axioms. Namely, unit test should be independent from each
other:
No data should be created (or left after the test)
No data should be used that is not created by the test
No other side effects / no global, persisten side effects

Benefits:
Helps you design better - especially if you do test-after-one, test-first or test-
driven development. To do unit testing you need units!
Helps you refactor with confidence as when a test is successful after refactoring
you know you have to made a mess
Helps you be confident in your code as if the test pass you know you did at least
OK
Helps you debug code faster as the error is pinpointed right away
Helps you be more productive in long and medium run (not so much in the short run):
https://medium.com/swlh/why-invest-in-unit-testing-8f1bdc2d688e
Provides “executable documentation” - executable documentation is usually in-sync
with the code, while non-executable can be inconsistent
Protects agains regressions

Drawbacks
Requires you to write more code and when code is rewritten the test code also needs
to change
Need to know and learn more concepts and more tooling
Unit Testing
Python Crash Course
Unittest
A built in testing module in Python. No install.
Available since 2001, based on JUnit (xUnit architecture)
Has all testing framework tools needed:
assertions
test runner
reporting capabilities
mocking capabilities
We are going to implement and test a PhoneBook class, reqs:
ability to add name with assoc. phone number
ability to retrieve phone via name
validity: phone book can’t contain numbers with the same beginning
We will use TDD
Launching
Commant line: python -m unittest test.test_phonebook.PhoneBookTest (don’t forget
__init__.py if you are launching from a different folder)
Pycharm test runner
Python Crash Course
import unittest

class PhoneBook:
def __init__(self):
self.dict = {}

def add(self, name, num):


self.dict[name] = num

def find(self, name):


return self.dict[name]

class PhoneBookTest(unittest.TestCase):
def test_find_by_name(self):
# given / arrange
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")

# when / act
number = phonebook.find("Jonas")

# then / assert
self.assertEqual("+370-111-1111", number)
# teardown

def test_fail1(self):
self.assertEqual(1, 2)

def test_fail2(self):
self.assertEqual(1, 2)
Unittest
Negative testing (negative path) - expecting that when incorrect input is given an
appropriate error / exception will be thrown (the opposite of happy path).

Skipping tests
Extracting a test fixture.
A test fixture is just code that supports your tests methods.
For example setUp(self) and tearDown(self)
These methods will usually be called by the test framework
tearDown() will be called if exception is thrown in the test
test will be not called if the error is thrown in the setUp() method.
Python Crash Course
def test_when_desired_name_not_in_phonebook(self):
# given / arrange
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")

# then / assert / act


with self.assertRaises(KeyError):
phonebook.find("Petras")

# teardown

class PhoneBookTest(unittest.TestCase):
def setUp(self) -> None:
self.phonebook = PhoneBook()

def tearDown(self) -> None:


pass

def test_find_by_name(self):
self.phonebook.add("Jonas", "+370-111-1111")
number = self.phonebook.find("Jonas")
self.assertEqual("+370-111-1111", number)

def test_when_desired_name_not_in_phonebook(self):
self.phonebook.add("Jonas", "+370-111-1111")
with self.assertRaises(KeyError):
self.phonebook.find("Petras")

# @unittest.skip("WIP")
def test_empty_phonebook_is_consistent(self):
self.assertTrue(self.phonebook.is_consistent())
@unittest.skip("Not used")
def test_fail1(self):
self.assertEqual(1, 2)

@unittest.skip("Not used")
def test_fail2(self):
self.assertEqual(1, 2)

Unittest
Bad unit test design

Better test design


Follow AAAT / GWTT
Multiple setup steps are acceptable and common
Multiple assertions are acceptable, but less common (see if the testcase can’t be
split in thatcase)
Test naming conventions:
https://dzone.com/articles/7-popular-unit-test-naming
https://stackoverflow.com/questions/20513596/pep8-naming-convention-on-test-classes
Python Crash Course
Unittest
Code and additional remarks:
Test discovery: https://stackoverflow.com/a/15630454/1964707
Be sure to use appropriate assertions

Python Crash Course


import unittest

class PhoneBook:
def __init__(self):
self.dict = {}

def add(self, name, num):


self.dict[name] = num

def find(self, name):


return self.dict[name]

class PhoneBookTest(unittest.TestCase):
def test_find_by_name(self):
# given
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")

# when
number = phonebook.find("Jonas")

# then
self.assertEqual("+370-111-1111", number)

# teardown
Unittest
Exercise:
Create a class Calculator in src/ directory (use the methods defined below)
Write tests for the calculator in the test/ directory w/ the unittest library: for
methods: subtract() and divide()
Think about how many tests do you need for each method and why (BVA, EQP - black
box techniques, if you developed the function under test (FUT / AUT / SUT) you
might not need it).
Python Crash Course
Pytest
Pytest is an alternative to unittest
Unittest module is not “pythonic” as it was designed to be similar to the xUnit
family or tools. An alternative that is more pythonic was created when reasons to
be close to the patterns of xUnit were no longer too relevant. So it is not part of
the xUnit family.
No included in standard python distro so it needs to be installed w/ pip.
Launching: pytest test\test_phonebook.py or using Pycharm configuration (need to
create manually)
With pytest you need to import fewer modules. For example we can use the default
assert python keyword for assertions.
More:
https://realpython.com/pytest-python-testing/
https://docs.pytest.org/en/6.2.x/
Python Crash Course
class PhoneBook:
def __init__(self):
self.dict = {}

def add(self, name, num):


self.dict[name] = num

def find(self, name):


return self.dict[name]

def get_all_names(self):
return self.dict.keys()

def test_find_by_name():
# given
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")

# when
number = phonebook.find("Jonas")

# then
assert "+370-111-1111" == number
import pytest
from src.phonebook import PhoneBook

def test_find_by_name():
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")
number = phonebook.find("Jonas")
assert "+370-111-1111" == number

def test_phonebook_contains_all_names():
phonebook = PhoneBook()
phonebook.add("Jonas", "+370-111-1111")
phonebook.add("Petras", "+370-111-1112")
assert "Jonas" in phonebook.get_all_names() \
and "Petras" in phonebook.get_all_names()

def test_missing_name_raises_keyerror():
phonebook = PhoneBook()
phonebook.add("Petras", "+370-111-1111")
with pytest.raises(KeyError):
phonebook.find("Petras")
Pytest
Python Crash Course
Pytest
Test fixtures implemented more consiselly, as functions being passed to tests.
All available at: pytest --fixtures
We can supply a new object before each test
We can also cleanup after (delete file)
And inject a temporary directory - tmpdir. This is another test fixture that the
fixture we defined is using.
Using markers:
@pytest.mark.skip
Custom markers can be defined:
https://docs.pytest.org/en/6.2.x/example/markers.html
Html reports: pip install pytest-html && pytest --html=report.html or python -m
pytest test/ --html=report.html
To print to stdout use: pytest -s
Python Crash Course
Pytest
Test doubles:
xUnit category.
Mock, spy, fake and others, see:
https://martinfowler.com/articles/mocksArentStubs.html#TheDifferenceBetweenMocksAnd
Stubs

Sometimes called mocks generically, but mock means a specific test double.
They are useful when you want to make your test independent of the infrastructure
by providing a mock object instead of a real one representing the infrastructure
(repository). Or provide other features when testing (like tracking how many times
it was called).
Python Crash Course
Pytest
Mocking:
Data obtained from data.csv with the following data:

Jonas
Petras
Jonas
Antanas
Mindaugas
Python Crash Course
import csv
from unittest.mock import MagicMock

class PhoneBookRepository:
def __init__(self):
self.file = '../src/data.csv'

def get_all(self):
print("File will be opened!")
with open(self.file, newline='\n') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
data = list(reader)
return data

class PhoneBookService:
def __init__(self, repo: PhoneBookRepository):
self.repo = repo

def get_most_popular_name(self):
# 0 .. initial implementation
# return self.repo.get_all()

# 1. .. final implementation
flat_list = [item for sublist in self.repo.get_all() for item in sublist]
from statistics import mode
return mode(flat_list)

# pbs = PhoneBookService(PhoneBookRepository())
# print(pbs.get_most_popular_name())

def test_given_single_name_in_repo_service_returns_that_name_with_mocking():
# arrange
pbr = PhoneBookRepository()
pbr.get_all = MagicMock(return_value=[["Jonas"]])
pbs = PhoneBookService(pbr)

# act
res = pbs.get_most_popular_name()

# assert
assert res == "Jonas"
assert pbr.get_all.call_count == 1

def test_given_a_common_name_in_repo_service_returns_that_name_with_mocking():
# arrange
pbr = PhoneBookRepository()
pbr.get_all = MagicMock(return_value=[["Jonas"], ["Krambalkis"], ["Petras"],
["Antanas"], ["Krambalkis"]])
pbs = PhoneBookService(pbr)

# act
res = pbs.get_most_popular_name()

# assert
assert res == "Krambalkis"
assert pbr.get_all.call_count == 1
def
test_given_multiple_names_with_same_count_in_repo_service_returns_first_name_with_m
ocking():
# arrange
pbr = PhoneBookRepository()
pbr.get_all = MagicMock(return_value=[
["Jonas"], ["Krambalkis"], ["Petras"], ["Antanas"], ["Petras"],
["Krambalkis"]])
pbs = PhoneBookService(pbr)

# act
# pbs.get_most_popular_name() # this will provoke 2nd call to get_all()
res = pbs.get_most_popular_name()

# assert
assert res == "Krambalkis"
assert pbr.get_all.call_count == 1

Doctest
Another library helpful for testing your docstring documentation inside the code.
It uses special syntax in the doctests to recognize where a test starts and run it.
See: https://docs.python.org/3/library/doctest.html
Python Crash Course
def factorial(n):
"""Return the factorial of n, an exact integer >= 0.

>>> [factorial(n) for n in range(6)]


[1, 1, 2, 6, 24, 120]
>>> factorial(30)
265252859812191058636308480000000
>>> factorial(-1)
Traceback (most recent call last):
...
ValueError: n must be >= 0

Factorials of floats are OK, but the float must be an exact integer:
>>> factorial(30.1)
Traceback (most recent call last):
...
ValueError: n must be exact integer
>>> factorial(30.0)
265252859812191058636308480000000

It must also not be ridiculously large:


>>> factorial(1e100)
Traceback (most recent call last):
...
OverflowError: n too large
"""

import math
if not n >= 0:
raise ValueError("n must be >= 0")
if math.floor(n) != n:
raise ValueError("n must be exact integer")
if n+1 == n: # catch a value like 1e300
raise OverflowError("n too large")
result = 1
factor = 2
while factor <= n:
result *= factor
factor += 1
return result

if __name__ == "__main__":
import doctest
doctest.testmod()

Job Scheduling: cron


Cron is a task / job scheduler for linux distros configurable w/ command line.
Cron syntax: https://crontab.guru/#*_*/5_*_*_* . Every minute: * * * * * Every
five minutes: */5 * * * * Every hour: 0 * * * * or @hourly
Cron commands in linux: crontab -l and crontab -e
You can try out linux servers cheaply @ www.digitalocean.com
You can redirect the error and standard out to different files
use >> for append and > for write: python3 script.py 2>>err.log 1>>out.log
* * * * * python3 /home/mindaugas/.../script.py 2>>/home/mindaugas/.../err.log
1>>/home/mindaugas/.../out.log
Python Crash Course
import random
rand_num = random.randint(0, 1)

if rand_num == 0:
print("It's 0!")
else:
raise Exception("It's 1!")

Job Scheduling: windows task scheduler


GUI task / job scheduler for windows.
You can open it from the start menu, just write: “task scheduler”
Configure to run every minute
Create a .bat file:

@ECHO OFF
"C:\U\.\AppData\Local\Programs\Python\Python39\python.exe" "C:\U\.\Desktop\
Projects\CAAI\CAGitDemo\test\cmd_script.py"
REM PAUSE
EXIT

Command: C:\Users\..\Desktop\Projects\CAAI\CAGitDemo\test\command.bat
Arguments: 1>>C:\Users\Mindaugas\Desktop\Projects\CAAI\CAGitDemo\test\out.log
2>>C:\Users\Mindaugas\Desktop\Projects\CAAI\CAGitDemo\test\err.log
We will use the same script as with Linux!
Run the task in cmd to check if it works before scheduling
You can enable the history of the task to debug better (see next slide)
Refs:
https://www.windowscentral.com/how-create-automated-task...
https://datatofish.com/python-script-windows-scheduler/
Python Crash Course
Job Scheduling: windows task scheduler
Log with debugging info
Python Crash Course
Job Scheduling: jenkins
Web app for performing all kinds of tasks: running test, tasks, reports and so on.
Can be used also for periodically occurring tasks.
Python Crash Course
Job Scheduling: jenkins
Automating periodic tasks with Jenkins
Creating a new freestyle project
Execute Windows Batch Command:

cmd /Q /C "C:\Users\Mindaugas\AppData\Local\Programs\Python\Python39\python.exe C:\


Users\Mindaugas\Desktop\Projects\CAAI\CAGitDemo\test\cmd_script.py" 1>>C:\Users\
Mindaugas\Desktop\Projects\CAAI\CAGitDemo\test\out.log 2>>C:\Users\Mindaugas\
Desktop\Projects\CAAI\CAGitDemo\test\err.log
Python Crash Course
import time
import random

rand_num = random.randint(0, 1)
time.sleep(5)

if rand_num == 0:
print("It's 0!")
else:
raise Exception("It's 1!")

Job Scheduling: jenkins


Adding periodicity:
Python Crash Course
Job Scheduling: team city
TBD
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1e8OdbHCykOxvEptI55oIwOXtysddUyIa.pptx ---


Artificial Intelligence
Generative Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Content Loss
01
02
Style Loss
Generative Deep Learning
00
Introduction
Gram matrix
03
Style Transfer Model
05
04
General Process
06
Further explorations
In addition to DeepDream, Text generation, another major development in deep-
learning-driven image modification is neural style transfer, introduced by Leon
Gatys et al. in the summer of 2015.
The neural style transfer algorithm has undergone many refinements and spawned many
variations since its original introduction, and it has made its way into many
smartphone photo apps. For simplicity, we will focus on the formulation described
in the original paper.
Definition (see side pic).
Introduction
Generative Deep Learning
Style essentially means textures, colors, and visual patterns in the image
(microstructure), at various spatial scales;
Content is the higher-level macrostructure of the image.
For instance, blue-and-yellow circular brushstrokes are considered to be the style
(using Starry Night by Vincent Van Gogh), and the buildings in the Tübingen
photograph are considered to be the content.
The idea of style transfer, which is tightly related to that of texture generation,
has had a long history in the image-processing community prior to the development
of neural style transfer in 2015. But as it turns out, the deep-learning-based
implementations of style transfer offer results unparalleled by what had been
previously achieved with classical computer-vision techniques, and they triggered
an amazing renaissance in creative applications of computer vision.
Introduction
Generative Deep Learning
The key notion behind implementing style transfer is the same idea that’s central
to all deep-learning algorithms: you define a loss function to specify what you
want to achieve, and you minimize this loss. You know what you want to achieve:
conserving the content of the original image while adopting the style of the
reference image. If we were able to mathematically define content and style, then
an appropriate loss function to minimize would be the following:

Here, distance is a norm function such as the L2 norm, content is a function that
takes an image and computes a representation of its content, and style is a
function that takes an image and computes a representation of its style. Minimizing
this loss causes style(generated_image) to be close to style(reference_image), and
content(generated_image) is close to content(generated_image), thus achieving style
transfer as we defined it.
A fundamental observation made by Gatys et al. was that deep convolutional neural
networks offer a way to mathematically define the style and content functions.
Let’s see how.
Introduction
Generative Deep Learning
As you already know, activations from earlier layers in CNN contain local
information (think: “small features”) about the image, whereas activations from
higher layers contain increasingly global, abstract information.
Formulated in a different way, the activations of the different layers of a convnet
provide a decomposition of the contents of an image over different spatial scales.
Therefore, you’d expect the content of an image, which is more global and abstract,
to be captured by the representations of the upper layers in a convnet. A good
candidate for content loss is thus the L2 norm between the activations of an upper
layer in a pretrained convnet, computed over the target image, and the activations
of the same layer computed over the generated image.
This guarantees that, as seen from the upper layer, the generated image will look
similar to the original target image. Assuming that what the upper layers of a
convnet sees is really the content of their input images, then this works as a way
to preserve image content.
So essentially we pass the content image and the generated image and compare the
activations at some high layer in the CNN architecture.
Content Loss
Generative Deep Learning
The content loss only uses a single upper layer, but the style loss as defined by
Gatys et al. uses multiple lower layers of a convnet: you try to capture the
appearance of the stylereference image at all spatial scales extracted by the
convnet, not just a single scale.
For the style loss, Gatys et al. use the Gram matrix of a layer’s activations: the
inner product of the feature maps of a given layer. This inner product can be
understood as representing a map of the correlations between the layer’s features.
These feature correlations capture the statistics of the patterns of a particular
spatial scale, which empirically correspond to the appearance of the textures found
at this scale.
Hence, the style loss aims to preserve similar internal correlations within the
activations of different layers, across the style-reference image and the generated
image. In turn, this guarantees that the textures found at different spatial scales
look similar across the style-reference image and the generated image.
Style Loss
Generative Deep Learning
Just feature maps are not enough to capture style in the “stary night” the circles
mostly appear in yellow color - these features are highly correlated.
Correlations between feature maps allow for computation of a more abstract notion
of style.
These correlations are captured by using gram matrices.
Explanation with references to papers: https://www.youtube.com/watch?v=Elxnzxk-AUk
- however note, that this video does not explain the underlying reasons and does
not give an intuition on why gram matrices provide the high level property of
matrix correlation. For this we would need to delve into the maths or some
simplified examples (MWEs).
MWE example: extract 4 feature maps from conv2d(4, kernel_size=(3, 3)) layer -
these are “decorrelated”/seaprate so we need to combine them somhow, then compute
the Gram matrix and apply the computed matrix to the style image to see what you
get.
Ref: https://www.tensorflow.org/tutorials/generative/style_transfer#calculate_style
Gram matrix
Generative Deep Learning
In short, you can use a pretrained convnet to define a loss that will do the
following:
Preserve content by maintaining similar high-level layer activations between target
content image and generated image. The convnet should “see” both the target image
and the generated image as containing the same things.
Preserve style by maintaining similar correlations within activations for both low-
level layers and high-level layers. Feature correlations capture textures: the
generated image and the style-reference image should share the same textures at
different spatial scales.

This is the general process:


Set up a network that computes VGG19 layer activations for the style-reference
image, the target image, and the generated image at the same time.
Use the layer activations computed over these three images to define the loss
function described earlier, which you’ll minimize in order to achieve style
transfer.
Set up a gradient-descent process to minimize this loss function.
IMPORTANT: durring the training process for this problem the weights in the CNN
will remain constaint. WE DO NOT TRAIN THE NETWORK IN THIS PROBLEM! WE MODIFY
TARGET IMAGE untill the differences in style + content are the lowest possible.
General Process
Generative Deep Learning
There are many implementations, see:
https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/
chapter12_part03_neural-style-transfer.ipynb
https://www.tensorflow.org/tutorials/generative/style_transfer

The interesting parts are:


Defining the different weights → It’s always important to think about which leavers
can we pull to obtain different results from the neural network - these are
hyperparameters. This will dictate the contribution to the total loss by each of
the separate losses - increasing the magnitude will make that, for example, style
be more important when calculating loss
ExponentialDecay() → something we saw with Fast.ai, now we see with Keras. We could
run the model w/o it to see what happens.
combination_image = tf.Variable(preprocess_image(base_image_path)) → combination
image is first obtained from the base image! It’s not a blank slate from the start!
Style Transfer Model
Generative Deep Learning
# Weights of the different loss components
total_variation_weight = 1e-6
style_weight = 1e-6
content_weight = 2.5e-8
Use a well (pre)trained network.
Loss function is combined from style and content loss.
The CNN’s lower layers are used to extract low level features - style
The CNN’s higher layers are used to extract hight level features - content
The total loss is a combination of the two
Gram matrix is used to combine separate features maps obtained from lower layers of
the CNN gram_matrix(style_feature_maps)
We do not train the network, we apply the gradients to the target_image! Network
weights do not change, the optimization changes the image.
Summary
Generative Deep Learning
How many things can we change in the context of Neural Style Transfer to obtain
different looking target images (TL model, layers for style (count and which ones),
layers for content (count and which ones), epoch count, TL model dataset on which
it was trained). Also parameters like style_weight and content_weight.
How can we optimize the NST network? Can we even compare which one is better?
Generated target image is hard to compare to other similar target images and it
does not make sense to ask what is the current SOTA on NST of your cat and some
picasso image. But it does make sense to compare several models and see both what
loss they achieve and fast they achieve it. The loss is a precise measurement of
much style and content does the target image reflect. Thus we can compare models
and find a better one.
Questions
Generative Deep Learning
What were the results of style transfer in the past, what are the classical
approaches (prior to 2015)?
You can easily experiment on your own images.
To understand Gram matrix, create some MWE’s.
Fast style transfer: research on other options to transplant style from another
image. For example fast style transfer can be achieved by first spending a lot of
compute cycles to generate input-output training examples for a fixed style-
reference image, using the method outlined here, and then training a simple convnet
to learn this style-specific transformation (... or should it be encoder-decoder /
GAN / Transformer). Once that’s done, stylizing a given image is instantaneous:
it’s just a forward pass of this small convnet.
Use a different neural network for style extraction and content extraction (vgg16
vs xception)? Experiment.
See also: https://arxiv.org/abs/2008.03706
Styling a video: https://github.com/lengstrom/fast-style-transfer + with custom NST
Explore this:
https://www.tensorflow.org/hub/tutorials/tf2_arbitrary_image_stylization
Multiple style images (https://reiinakano.com/arbitrary-image-stylization-tfjs/ ).
If the loss is additive why can’t we add another additive term for style_img2?
Explore.
Is NST repeatable? If neural network parameters do not change nor the images will
you always get the same repeatable result.
Further explorations
Generative Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Generative Deep Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1SXB41c2zRsLKLGJowk3m6RqHRU621JBH.pptx ---


Artificial Inteliigence
Python Crash Course
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Kintamieji
Teksto apdorojimas
01
02
03
Operatoriai
Python Crash Course
00
Komentarai
04
Vartotojo įvestis (input)
05
Aplinkos kintamieji
Komentarai
Python Crash Course
Komentarai:
padeda debuginant bei bandantis kodą
padeda prisiminti ką norėjai padaryti po tam tikro laiko
padeda kitiems suprasti kodą (dokumentacijos)
nėra vykdomi, ignoruoja interpretatorius
Python komentarai prasideda #
Taip pat galima naudoti multiline komentarus.
Ctrl+/ - klavišų kombinacija (hotkeys, comment shortcut)

Ref: https://www.w3schools.com/python/python_comments.asp
Kintamieji
Python Crash Course
Kas yra kintamieji:
Kintamasis tai atminties gabalėlis turintis vardą, reikšmę bei tipą.
Metafora juos suprasti: dėžutė, kurioje yra daiktas. Norėdami pasinaudoti tuo
daiktu kviečiame kintamąjį vardu.
Kvietimas daugelyje kalbų yra daromas vardais, bet yra ir išimčių, kai pasiekiame
reikšmę naudodamiesi jos pozicija kolekcijoje (apie tai kitose paskaitose) t.y.
indekso pagalba.
Ref: https://www.w3schools.com/python/python_datatypes.asp
x = “Jonas”; x[0] vs. “Jonas”[0] - pakeičiamumas tarp literal reikšmės ir kintamojo
vardo.
Sužinoti kokio duomenų tipo yra kintamasis galime su type() funkcija.

Skaliariniai kintamieji: integer, float, complex, null object, bool. Daugiau apie
skaičius: https://www.w3schools.com/python/python_numbers.asp

Svarbiausi: int, float, dict, bool, str, list

Šiandien dar kalbėsime apie strings, sekančioje paskaitoje apie kolekcijas (Set,
Tuple ir t.t.). Žinoma, galima pasigaminti ir “user defined types” su objektinio
programavimo mechanizmais. Apie juos kalbėsime neužilgo taip pat.
Kintamieji
Python Crash Course
Integer:
- ~unlimited size (2 ** 128)
- galima reprezentuoti daubybę pagrindinių skaitinių tipų: 10, 0b10, 0x10, 0o10
- galime konvertuoti skaičius int() (bin(), jei norime binary reprezentacijos)
funkcija: int(3.8) bei int(“3.8”) - kodėl 3.8 yra 3 ?
- taip, pat galime konvertuoti skaičius į kitas skaičių sistemas: int(“1000”, 2)
- atvirkšinis konvertavimas įmanomas su specialiomis funkcijomis: bin(8)

Daugiau apie kintamųjų konversiją:


https://www.w3schools.com/python/python_casting.asp
Psiaudoatsitiktinio skaičiaus generavimas:
https://www.w3schools.com/python/python_numbers.asp (apačioje)
Kintamieji
Python Crash Course
Digresija apie skaičių sistemas
Kintamieji
Python Crash Course
Float (floating point numbers)
IEEE 754 floatai, 53 bitų binary precision.
Yra apibrėžta visa aritmetika kaip sudėti tokius 3-gabalų skaičius.
Galime naudoti scientific notation: 3e8 → 300000000.0
print(30_00000_00000_00000_00000.0) → 3e21 → 3 * 10^21
Operator semantics:
<int> + <float> → <float>
<float> + <int> → <float>
<str> + <int> → ??? (in JS ?)
<int> / <int> → ???
NaN - “not a number” acronym
Dealing with imprecission / handling money, example: 0.1 + 0.2 == 0.3 ⇒ False
Rounding both slides to some acceptable precision
Use decimal: https://docs.python.org/3/library/decimal.html Decimal('0.1')
Convert to integer if appropriate
Kintamieji
Python Crash Course
Complex
Python is a “scientific language” - much liked in the scientific community.
Kompleksiniems skaičiams (su menamąja dalimi) reprezentuoti skirtas tipas.
Menamoji dalis rašoma j ar J - “mada” ateina iš elektronikos, kur “i” rezervuota
srovei.
Palaikoma ne tik inicializacija, bet ir operacijos. Išbandykime: 4-3j + -8+2j
Daugiau: https://www.tutorialspoint.com/complex-numbers-in-python
Kintamieji
Python Crash Course
None
naudojamas taip pat kaip null kitose kalbose
patikrinimui naudojamas operatorius is
Kintamieji
Python Crash Course
Bool
George Boole
True ir False
kitos reikšmės paduotos į bool konstruktorių gali būti “truthy” ir “falsy”
(implicit boolean coversion)
Daugiau: https://www.w3schools.com/python/python_booleans.asp
Kintamieji
Python Crash Course
Identifikatoriai ir rezervuoti žodžiai:
Identifiers can be a combination of letters in lowercase (a to z) or uppercase (A
to Z) or digits (0 to 9) or an underscore _. Names like myClass, var_1 and
print_this_to_screen(), _do_something(), all are valid example.
Kadangi python3 +/- unicode compliant, galima naudoti kitas kalbas pavadinimams -
bet nerekomenduojama griežtai, naudokite ASCII pavadinimus (kas tai yra pakalbėsime
tuoj pat).
Pavadinimai iš 2-jų ar daugiau: snake_case kintamisiems ir funkcijoms, PascalCase
klasėms: https://visualgit.readthedocs.io/en/latest/pages/naming_convention.html
Negalime naudoti keywordų pavadinimams kintamųjų, funkcijų (str(), list()) ir kt.
dalykų.
Kintamieji
Python Crash Course
Nukrypimas apie ASCII ir Unicode ir UTF-8:
ASCII - 7nių bitų enkodingas, talpinantis tik angliškus simbolius ir kitokius
kritinius simbolius.
Unicode yra charsetas: skaičiukai priskirti simboliukams visų kalbų. Išsprendė
daugybės ekodingų problemą kaip visos šalys pradėjo naudoti kompiuterius.
UTF-8 (Unicode Transformation Format – 8-bit) - encoding systema naudojama
reprezentuoti Unikodui kompiuterio atmintyje.
Uses 1 to 4 bytes to encode a single character, multibyte, variable-width.
ASCII compatible - ascii simbolius galime reprezentuoti viename baite tiek ascii
tiek utf-8 enkodinguose.
Unikodas reprezentuoja daugiau nei 149K+ characters. Netgi emoji! 🤑
Character codes have the "U +" prefix.
Alt kodai: alt+0128 euro simboliui, alt+0036.
Daugiau: https://www.joelonsoftware.com/2003/...
Operatoriai
Python Crash Course
Galima klasifikacija:
Arithmetic operators (kas unikaliau Python: ** , // , %)
Assignment operators (kas unikaliau Python: nėra i++ , ++i, multiple assignment
patogu daryti, galima swapinti a, b = b, a)
Comparison operators (==, != , >, < , <= , >=)
Logical operators (kas unikaliau Python: and, or, not)
Identity operators (kas unikaliau Python: is, is not - matėme kai tikrinome dėl
None)
Membership operators (will discuss with loops and sets): in
Bitwise operators - bitiniai operatoriai (daugiau žemo lygio programavimui arba
mikrooptimizacijoms, kai traktuojame kintamuosius ne pagal jų duomenų tipą, bet
išgauname iš jų bitus (operuojame jais bitų lygmenyje)): & , | Ref:
https://stackoverflow.com/questions/2096916/real-world-use-cases-of-bitwise-
operators
Ref: https://www.w3schools.com/python/python_operators.asp

Operatorių asociatyvumas ir pirmumas (precendence):


Intuityvus, kaip matematikoje.
Jei neesate tikri (ar manote, kad kiti nesupras) - dėkite skliaustus.
Paini vieta: -3 ** 2 vs. (-3) ** 2
Ref: https://www.programiz.com/...
Teksto apdorojimas
Python Crash Course
Tekstas Python kalboje reprezentuojmas String tipo kintamaisiais:
Immutable (tikriname su id() f-ja), unicode stringai nuo Python3: python -c
"print(\"Берн\")" .
Gali būti iteruojami kaip listai (iterable) bei pjaustomi (slice’inamei) nes yra
sequence tipai.
String Python turi turtingą API standartinėje bibliotekoje, todėl nereikia
apsirašinėti savo metodų ar naudoti 3-čių šalių.
Plačiau apie Strings: https://www.w3schools.com/python/python_strings.asp
String formatavimo subtilybės: https://realpython.com/python-string-formatting/
https://www.w3schools.com/python/ref_string_format.asp
Žinotini metodai / ypatybės:
split()
partition() - gražina tris reikšmes: kairę, separatorių (dažnai: _) ir dešinę
join() - turi pirmenybė prieš konkatenaciją: see:
https://stackoverflow.com/a/3055541/1964707
len()
index()
count()
"Hello, World!"[7:12:2]
"Hello, World!"[::-1]
upper() / lower() / capitalize()
strip()
format() → less often used since the appearance of f-strings
… apie slice’inimą bei iteravimą plačiau kalbėsime sekantį kartą.
Teksto apdorojimas
Python Crash Course
Print funkcija:
Can do a lot of things and is not so simple.
Display several values at once.
Define the separator between the values (parameters) given to it [...sep=” “]
Add a string after the last value su end=””
Teksto apdorojimas
Python Crash Course
Escape sequences:
Python turi naują ypatybę: universal newline support - išverčia į \n pagal
platformą į \r\n.
Formatting minilanguage: https://learnpython.com/blog/python-string-formatting/
Raw strings:
Norint išvengti daugybinio escape sequences naudojimo galime naudoti raw string.
WYSIWYG principas, sintaksė: r’’
Vartotojo įvestis (input)
Python Crash Course
Norėdami, kad mūsų programos būtų įdomios ir naudingos, negalime manipuliuoti
reikšmėmis įrašytomis pačiame kode (literal). Turime iš kažkur gauti duomenis,
kuriuos apdorosime. Apie duomenis iš failų ir iš interneto pakalbėsime vėliau,
dabar gaukime varotojo įvestį. Vartotojo įvestis gali būti panaudota ją pasiekiant
iš dviejų vietų:
parametrų interpretatoriui, tam turime naudoti sys paketą. Juos gausime kaip listą
parametrų.
specialiai tam sukurtų funkcijų: input(“Ko paklausti vartotojo”). Įvestį gausime
kaip string tipo kintamąjį.
Aplinkos kintamieji
Python Crash Course
Kompiuteryje galime būti specifikavę aplikos kintamuosius, jie
tarp kompiuterio perkrovimų išliekančios konfigūracinės reikšmės kurias
administruojame operacinės sistemos lygmenyje
tam tikra prasme operacinių sistemų (kurso) koncepcija, bet reikalinga
programuotojams gan dažnai
vadinami: environment variables
pasiekiami su os paketu
visos operacinės sistemos juos palaiko
aprašyti: https://www.nylas.com/blog/making-use-of-environment-variables-in-python/
padeda, kai mūsų programa / skriptas veikia skirtinguose kompiuteriuose ir norime
modifikuoti programos veikimą pagal tai. Uždarose aplikose kartais naudojami net
slaptažodžiams ar servisų username’ams pasetinti.
turi svarbos google collab aplikoje, kaip matysime kursui įsibėgėjus.
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 1qUHcI_qmxfsMWS_ll1kKeFyUYKQi5Dh-.pptx ---


Artificial Intelligence
Sequential Data Analysis
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Intro to stock predition DL
01
02
FCFFNN for stock market prediction?
Sequential Data Analysis
00
Appetizer questions
RNN LSTM for stock prediction
03
04
RNN GRUs for stock prediction
Further explorations
06
07
Practical project 9
05
Comparing RNN LSTM and GRU
Whos is the most successful investor in stock markets?
Is it true that stock markets are unpredictable? Why is it true / false?
Does the stock market price only depend on the previous price?
How successful at the limit can we be with stock predictions? Does this question
relate to the first question?
What is your minimal goal on stock prediction? Beating HFT, something long term.
Appetizer questions
Sequential Data Analysis
About RenTech and Jim Simons:
Growth of the fund: 66% – or 39.1% after fees – between 1988 and 2018. Only a
single year in the negative, only 2 years growth below 10%.
How do they do that? Only they know precisely, but people have guestimated and
analyzed hiring trends to reverse engineer what kind of mathematical models they
use: https://quantivity.wordpress.com/2011/05/08/manifold-learning-differential-
geometry-machine-learning/#more-5397
Then several approaches which were/are being used are described, like Hidden Markov
models (e.g. one of earliest hires was Leonard Baum, inventor of Baum-Welch
algorithm!), speech recognition (for news tracking), high frequency trading and
more agnostic machine learning techniques for finding short-lived patterns in
financial time series.
About James Simons (pronounced: saɪmənz): https://www.youtube.com/watch?
v=QNznD9hMEh0
Do you know why RenTec algorithm is secret? → Observer effect.
RenTec secrecy, technology, research, ability to change and inovate are
fascinating. After initial company was a failure, presumably because they did not
have a systematic approach of improving their algorithms, they created RenTec and
had a systematic approach which gave results, Simons hired by intelligence, not
“skill”. They expanded from commodities, to local US stocks, to international
stocks. If you want to know more: https://www.youtube.com/watch?v=xkbdZb0UPac
Buffet might be more luck than skill, while this algorithm is no luck (fair in a
sense). An example of algorithmic trading.
Appetizer questions
Sequential Data Analysis
The financial industry was one of the first industries to embrace the use of
machine learning and deep learning in its investment analysis and operations to add
value to their customers. Prior to machine learning, deep learning, and the entire
“Quant” revolution in the 2000’s up until now, analysts and investors relied on
less technologically reliant techniques. Fundamental and technical analysis reigned
supreme and, although they still make up a big part of the analysis, they’re now
combined with forecasts and analysis done by computers using statistical learning
techniques.
As most people know, the stock market is a place where people buy and sell stocks.
The act of buying and selling these stocks (i.e. trading) takes place in physical
and virtual environments called “Exchanges”. These exchanges are houses for indexes
(commonly known ones are the Dow Jones Industrial Average and NASDAQ Composite and
S&P 500). The exchanges are where the price of stocks that make up the indexes are
set.
There are many factors that can make the the price of a stock fluctuate.
Daily news reports bearing good or bad news about the current or future prospects
of a company is arguably one of the most influential factors driving daily price
fluctuations (this is why NLP is important for stock prediction as well).
A company’s profitability, revenue growth and future expansion prospects are
indicators that can set long term and short term price changes.
Values / trends / patterns in the past.
Deep Learning and other statistical models can only account for so much and is
usually better suited for short/medium term forecasting rather than long term (e.g.
years).
What differentiates machine learning techniques and technical analysis, is that
technical analysis only looks at the charts, while with machine learning can be
used for technical analysis, but also with the help of NLP we can add exogenous
knowledge, like negative news about the company. So we are not only looking at the
charts!
Other concepts to know: algorithmic trading, high-frequency trading (a type of
algorithmic), automated trading, quant, liquidity.
Intro to stock predition DL
Sequential Data Analysis
Spectrum:
Technical analysis vs. fundamental analysis (in the middle - technical + exogenous
data).
Based on active involvement: win every interaction (invest before raise, sell
before fall) vs. passive investment based on his.
Risk axis: risky vs. non-risky (conservative investment)(based on volatility).
Example: high volatility stocks - TSLA. Low volatility: APPL, S&P, D&J. Bonds vs.
Bitcoin.
Always check if the risk-profit tradeoff holds (do not invest in what you do not
understand - because you can’t reason about risks in that domain).
Intro to stock predition DL
Sequential Data Analysis
We will introduce this topic using an example of sequence data, say the stocks of a
particular firm. A simple machine learning model, or an Artificial Neural Network,
may learn to predict the stock price based on a number of features, such as:
the volume of the stock
the opening value
the high value that day
the difference between high and low (engineered feature)
combination of all of these (and engineered features)
etc.
In conventional feed-forward neural networks, all datapoints are considered to be
independent. Can you see how that’s a bad fit when predicting stock prices? The NN
model would not consider the previous stock price values – not great!
RNNs can work, but not well or long sequences - as we discussed before.
LSTMs / GRUs can help in that case! That however does not mean that you should just
jump straight into a GRU implementation - creating simpler models first can give
you “a feel for data” and make you more familiar with it while also giving you a
benchmark to exceed. It is recommended to start with a simple model.
FCFFNN for stock market prediction?
Sequential Data Analysis
LSTM’s are a derivative of a Recurrent Neural Network (RNN). An RNN is an adequate
model for a short time horizon of perhaps a week to a month, depending on the
resolution of your data (30 lags). Going further than that the RNN is unlikely to
produce reliable forecasts, due to Vanishing Gradient Problem (the weight update
signal, which we can call the learning signal will vanish when traveling across
many timesteps in an RNN).
A good general LSTM explanation is provided here:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
What we need to remember:
Various classical techniques (Markov models, STL based forecasting, maybe even
FFT).
ARMA / ARCH models have constraints on stationarity - RNNs do not, but might
perform better with it.
FCFFNNs do not have mechanisms for long term dependencies.
RNNs - short term memory
LSTMs/GRU - should be best for long terms sequences.
RNN LSTM for stock prediction
Sequential Data Analysis
An LSTM will take the data through what it calls “cells” as you can see in the
diagram above depicted in the middle rectangle. It is through this cell state that
the LSTM can choose to remember or forget things.
An LSTM will have 3 different dependencies according to the information it
receives:
The previous cell state (i.e. the information that was present in the memory after
the previous time step).
The previous hidden state (i.e. this is the same as the output of the previous
cell).
The input at the current time step (i.e. the new information that is being fed in
at that moment).
RNN LSTM for stock prediction
Sequential Data Analysis
In short: LSTM has gates that help it forget or remember (update) short term and
long term states. It has two states rather than one.
To the domain of financial asset price prediction, these dependencies can be
explained as below:
The Previous Cell State – trend of the stock at the previous day.
The Previous Hidden State – price of the stock at the previous day.
The Input at the Current Time Step – the current stock price.
In the above diagram, the horizontal line CT denotes the cell state which is the
memory of the LSTM. As is immediately mentioned above, this relates to the trend of
the stock in the previous day (1). That information will flow into the cell and be
processed with the other information that flows in. The line which is denoted by
CT-1 is the Hidden State (2) which in our case of stock prediction will contain the
previous time steps information (i.e. previous day’s stock price). The horizontal
line denoted by HT-1 is the input at the current time which is the current stock
price (3). Using the information from the previous stock price (Hidden State), the
current price combined with the previous day’s trend (Cell State), the LSTM will
generate an output.
Source: https://www.exxactcorp.com/blog/Deep-Learning/forecasting-stock-prices-
with-lstm-an-artificial-recurrent-neural-network-rnn
RNN LSTM for stock prediction
Sequential Data Analysis
Let’s compare GRUs for the same dataset.
And remember the differences between LSTM and GRU once more.
RNN GRUs for stock prediction
Sequential Data Analysis
SimpleRNN cells don't have a gatting mechanism - 0 gates.
LSTMs have 3 gates, GRUs have 2 gates.
GRUs are usually faster to train, since the simpler internal structure.
See: https://arxiv.org/pdf/1412.3555v1.pdf - this paper is often praised for
explaining GRUs clearly and providing strong evidence that in many types of
problems GRUs are preferable to LSTMs. Also see this discussion for summary of the
points: https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-
lstm (... each of these points could be verified by you, providing grounds for
practical tasks if you have interest).
Comparing RNN LSTM and GRU
Sequential Data Analysis
You can implement stock price forecasting model with Pytorch. For example this post
both implements it in Pytorch and provides a reasonable argument for the usage of
GRUs rather than RNNs and LSTMs: https://medium.com/swlh/stock-price-prediction-
with-pytorch-37f52ae84632 - needs factchecking.
Additionally implement the same stock forecasting with Keras, compare which
framework you like better.
You can also implement forecasting for other stocks or cryptocurrencies,
commodities and so.
Implement stock prive forecasting model with Fastai.
Check tsai: https://github.com/timeseriesAI/tsai
Further explorations
Sequential Data Analysis

Further explorations
Sequential Data Analysis
For this part create a stock market (or crypto, commodities (gas, oil prices) or
other time series) forecasting model for the stocks we have not forecasted in the
class. Can be historical, old data or it can be modern. Obtain data from any source
you like.
You should use at least one of: RNNs, LSTMs, GRUs (but comparative approach is
recommended).
Write a short paragraph on what you learned while implementing a solution for this
specific task (not part 9 of the course, just the task) (5 sentences / ideas
minimum).
Please provide a link to the collab notebook/github link (double check the share
options of the notebook) when finished for review and evaluation.

Alternative: Kaggle competition related to RNNs/Sequential Data


Practical project 9
Sequential Data Analysis
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Sequential Data Analysis


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1T35pr8PSt_3N4fY0qc62HIsAYIjIvoX_.pptx ---


Artificial Intelligence
Natural Language Processing
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Tokenization
01
02
Attention mechanisms
Natural Language Processing
00
Intro
Transformers
03
04
LLMs
//
07
08
//
06
//
09
//
10
//
05
//
11
//
Starting in 2017, a new model architecture started overtaking recurrent neural
networks across most natural language processing tasks - the Transformer.
Encoder-decorder (not RNNs, DNN) + positional encoding + attention (many layers) =
transformer.
Transformers were introduced in the seminal paper “Attention is all you need” by
Vaswani et al. in 2017. The gist of the paper is right there in the title: as it
turned out, a simple mechanism called “neural attention” could be used to build
powerful sequence models that didn’t feature any recurrent layers or convolution
layers.
Attention has become one of the most influential ideas in deep learning ever, w/o
it there would be no BERT, not GPT-2 and not GPT-3, hence, no GitHub Copilot
(OpenAI Codex, a modified version of GPT-3), Meta OPT-175B model, NLLB-200.
Encoder-decoder models are good for short sequences in a seq2seq setup. Attention
mechanisms outperform them when we are using longer sequences. Let us turn to
attention mechanisms as they are a continuation of the encoder - decoder
architectures.
Attention mechanisms are older than transformers however and are used in many more
applications than just sequence modeling, see wikipedia article on attention:
https://en.wikipedia.org/wiki/Attention_(machine_learning)
It is commonly held that The attention mechanism was introduced by Bahdanau et. al.
(2014), see: https://arxiv.org/abs/1409.0473
Attention mechanisms
Natural Language Processing
https://arxiv.org/pdf/1409.0473.pdf
Working principle
Attention enhances some parts of the input data while diminishing other parts — the
idea is that network should devote more focus to that small but important part of
the data.
Attention mechanism lives inside a Neural Network, and like ALL mechanisms inside a
neural network it deal with matrices of (ussually) real numbers (in and out).
This is a familiar idea if you think about analogies from what we have seen before:
max/avg pooling in convnets - keeps one feature, all-or-nothing attention in a
sense.
TF-IDF - assigned scores by word importance, continuous-attention since the values
are real numbers.
There are many forms of attention discovered (keras has 3 attention layer types,
able to represent 4 forms of attention commonly discussed in the literature).
Crucially, attention mechanism can be used to make features context-aware. Word
embeddings—vector spaces that capture the “shape” of the semantic relationships
between different words (kind - man + woman = queen). In an embedding space, a
single word has a fixed position—a fixed set of relationships with every other word
in the space. But that’s not quite how language works: the meaning of a word is
usually context-specific: when you mark the date, you’re not talking about the same
“date” as when you go on a date (embedding for date will be the same). Clearly, a
smart embedding space would provide a different vector representation for a word
depending on the other words surrounding it.
Attention mechanisms
Natural Language Processing
Disadvantage of embeddings and advantage of attention mechanisms. This also
explains (in part) why problem specific embeddings are the way to go if you have
enough data.
Working principle (cont.)
The purpose of self-attention is to modulate the representation of a token (~change
the vector representing a word) by using the representations of related tokens in
the sequence. This produces context-aware token representations.
“The train left the station on time.” Now, consider one word in the sentence:
station. What kind of station are we talking about? Space station, bus station,
radio station?
Self-attention allows the input to interact with itself computing the importance of
each word based on other words in the sentence. A simple form of self attention
could be imagined as a simple dot product between word vectors obtained from
embeddings.
A simple kind of attention here can be defined as a reweighting matrix, see:
https://www.youtube.com/watch?v=yGTUuEx3GkA
Attention mechanisms
Natural Language Processing
Working principle (cont.)
Generalized self attention model uses the terminology of keys, values and queries -
K, V, Q, represented as matrices with weights, attention weights / scores are
learned during backprop, see: https://www.youtube.com/watch?v=tIvKXrEDMhk
This generalized self attention model can be expanded to a multi-head attention
model by adding the same mechanism that defines a self-attention multiple times,
see: https://www.youtube.com/watch?v=23XUv0T9L5c
Notebook containing an interactive attention visualization:
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/
tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC
The number of heads in a multi-head attention layer is a hyperparameter that is
often >=2. This allows the model to learn different parts of the context to pay
attention to and explore different possibilities. The layers above should be able
to learn which attention block is more useful. Like with adding additional
convolutons in a CNN this will make the model more powerful but also more expensive
to train, see: https://discuss.huggingface.co/t/what-does-increasing-number-of-
heads-do-in-the-multi-head-attention/1847
Note: there seems be little guidance as to where the attention layers need to be
added in the network architecture - they need to be added before the neural network
layers with weights for learning.
Attention mechanisms
Natural Language Processing
Additional explanations:
1st: https://www.youtube.com/watch?v=SysgYptB198 (intuition)
2nd: https://www.youtube.com/watch?v=quoGRI-1l0A
3rd: https://www.youtube.com/watch?v=eMlx5fFNoYc (one of the best more technical
videos)
For complete understanding of attention mechanisms the usual path of theory → usage
→ mastery of usage and then implementation from scratch is probably the best way.
Attention mechanisms
Natural Language Processing
How many “r” latters in word strawberry is often hard to answer for LLM … due to
tokenization.
Attr. Andrej Karpathy:
https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?
usp=sharing#scrollTo=pkAPaUCXOhvW and video: https://www.youtube.com/watch?
v=zduSFxRajkE
A lot of the weirdness with LLM’s is related to the way that text is tokenized and
fed into them. Context lengths, attention matrix sizes depend on the tokenization
sheme. So both the accuracy and computational performance is impacted just from
that!
Tokenizers are considered to be models in moder NLP, we “train the tokenizer” is
valid and meaningful.
So BPE (byte pair encoding) is the basis, but there are many ways to do BPE, let’s
turn to those:
Tiktoken - what OpenAI uses (gpt-2 & gtp-4 (cl100k_base)). Runs on bytes
representing Unicode points: https://github.com/openai/tiktoken
SentencePiece is LLama (1, 2) and Mistral series uses (also can use BPE, but not
only). See: https://github.com/meta-llama/llama/blob/main/requirements.txt . Llama3
uses tiktoken https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py
What is the difference between cl100k_base and gpt-2. cl100k_base means 100K vocab,
base variant.
Security implications - one of the attack vectors to compromise the allignment of
LLMs is though tokenization mechanism. If you know the how the tokenizer works you
can provoke failures of the LLM or even bypass restrictions.
Please reverse the string ".DefaultCellStyle" - Bing freaks out.
Advanced Tokenization
Natural Language Processing
Like recurrent neural networks (RNNs), Transformers are designed to handle
sequential data, such as natural language, for tasks such as translation and text
summarization, etc.
However, unlike RNNs, Transformers do not require that the sequential data be
processed in order (don’t use RNN cells). For example, if the input data is a
natural language sentence, the Transformer does not need to process the beginning
of it before the end. Due to this feature, the Transformer allows for much more
parallelization than RNNs and therefore reduced training times. Are transformers
better than Attention + RNNs? We would need to check the literature if anyone
proved it or provided a compelling argument, but such research probably still does
not exist. Factually, currently DNNs with Attention are winning, but it might be
simly because: bigger problem requires more data, more data requires bigger model
to represent it. So if you need a complex model, you can tune it much faster if
it’s trainable faster.
Transformers have rapidly become the model of choice for NLP problems, replacing
older recurrent neural network models such as the long short-term memory (LSTM).
Since the Transformer model facilitates more parallelization during training, it
has enabled training on larger datasets than was possible before it was introduced.
This has led to the development of pretrained systems such as BERT (Bidirectional
Encoder Representations from Transformers) and GPT (Generative Pre-trained
Transformer), which have been trained with huge general language datasets, such as
Wikipedia Corpus, and can be fine-tuned to specific language tasks.
Transformed ecosystem is very rich.
Closed companies - open models, open companies - closed models.
Keras3 introduced Transformer specific functionalities
https://keras.io/api/keras_nlp/modeling_layers/transformer_encoder/

Transformers
Natural Language Processing
Explanation of the original transformer architecture:
https://www.youtube.com/watch?v=EXNBy8G43MM
https://www.youtube.com/watch?v=TQQlZhbC5ps
https://www.youtube.com/watch?v=wjZofJX0v4M (not the best, simplistic)

Implement yourself resources:


Keras, F. Chollet: https://colab.research.google.com/github/fchollet/deep-learning-
with-python-notebooks/blob/master/chapter11_part04_sequence-to-sequence-
learning.ipynb#scrollTo=aVwQhk11VW2S
Find an implementation with tensorflow + keras here (step by step):
https://www.tensorflow.org/text/tutorials/transformer
A simplified, lower level implementation can be found here:
https://medium.com/@max_garber/simple-keras-transformer-model-74724a83bb83
Series classification (not translation) example with Conv1D can be found here:
https://keras.io/examples/timeseries/timeseries_transformer_classification/
Transformers
Natural Language Processing
The simplest way to use recent language models is to use the excellent transformers
library, open sourced by HuggingFace.
It provides many modern neural net architectures (including BERT, GPT-2, RoBERTa,
XLM, DistilBert, XLNet and more) for Natural Language Processing (NLP), including
many pretrained models.
It relies on either TensorFlow or PyTorch.
Best of all: it's amazingly simple to use.
See https://huggingface.co/transformers/model_doc/gpt2.html?highlight=gpt2
LLM leader board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Transformers
Natural Language Processing
Are there any advantages for recurrent encoder-decoder (no attention) compared to
transformer? Explore the computational performance characteristics of Transformers
- is an LSTM/GRU encoder-decoder less scalable on a GPU that a Transformer of
similar accuracy?
Research different forms of attention and their usacases / tradeoffs. Take keras,
it has 3 attention layers: AdditiveAttention() (Bahdanau style attention),
MultiHeadAttention() (Vaswani, “Attention is all you need”, if query, key, value
are the same, then this is self-attention) and Attention() (Dot-product attention
layer, a.k.a. Luong-style attention)
Does HuggingFace offer a Transformer trained for code/SQL completion. If there is -
try it!
Further explorations
Natural Language Processing
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Natural Language Processing


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1mMqbhGLEU5aGDaISQZ8GViIVigYKk3TV.pptx ---


Artificial Inteliigence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Scraping
Scrapy
01
02
03
BeautifulSoup
Python Crash Course
00
Regex
04
Scraping client side rendered pages
05
Requests-Html (optional)
06
Selenium
07
Complex cases
Regex
Regular Expression (RegEx / RegExp) is a sequence of characters that defines a
search pattern, that can be used for finding matching strings, extract the mathed
part, splitting the string using the matched part and permitting the string to be
processed further if the match was found, also data validation.
Elementary examples: check if person’s name does not have any digits in it, if some
id matches a pattern AA-DDDD, extract all emails from large corpus of text,
validate that incoming data is a mac address.
To learn regex is to understand the metacharacter syntax, semantics in isolation,
then chain them into mor e complex patterns.
Regex patterns can be simple: \d{5} … or complex: (?i)(select(?!_)|(?<!u)top|(?<!
%20)from(?!s)|(?<!&)limit|(?<!gmt)offset|(?!s)
Regex patterns are strings. A valid string is a valid regex of itself (disregarding
incorrect metacharacter usage). Why? Because if you search for a string literal - a
word - in a some text you should be able to find the word by that word itself.
Regex processing is performed by the regex engine (inside a programming language) -
so schematically:
Python Crash Course
Regex
Python Crash Course
There are multiple implementations of regex engines: JS, Golang RE2 and the
standard feature rich PCRE and PCRE2 implementations. All of them perform the same
basic operations and support the same basic features, but the differences come in
advanced feature support and performance, JS did not support lookbehind mechanism
for the longest time : https://stackoverflow.com/a/3950684/1964707
Python has re module, and regex module of which the last one is preferable:
https://stackoverflow.com/a/7066413/1964707 . Bindings for a very performant
alternative called RE2 engine are also available.
Regex
Python Crash Course
For learning, experimentation and prototyping I would recommend the tool:
https://regex101.com
Let’s see a trivial example - check if there are two spaces in the text.
Regex
Python Crash Course
Task: your friend / colleague comes in with some text output from an old
unmaintaned app, like:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem
Ipsum has been the industry's standard dummy text ever since the 1500s, when an
unknown printer took a galley of type and scrambled it to make a type specimen
book... It has survived not only five centuries, but also the leap into electronic
typesetting, remaining essentially unchanged. It was popularised in the 1960s with
the release of Letraset sheets containing Lorem Ipsum passages, and recently with
desktop publishing software Aldus PageMaker.

… the task is to count the sentences and to clean them up as a list of


sentences.
To solve this you can try doing python string split. However you will see it does
not work as well as you migh think:
You might think I will just do split on a dot and a space, but what id there were
more spaces? How about a exclamation mark or a question mark at the end of a
sentence? The solution becomes complicated.
Regex
Python Crash Course
A hard or intermediate problem can be made much simpler using regex:

How you should solve the problem:


Solve it first in the most comfortable, simplified environment. For me regex101 is
that environment
When port the solution to your programming language
Regex
Python Crash Course
Regex is also used for productivity - most IDEs and editors support search and
replace functionality that can be used with regex.
This point is often left out and not mentioned in tutorials, but it can be very
useful.
How would you clean up a text like that so that only human readable text would
remain? By hand? What if there were 10K lines?

Array
(
[0] => Lorem Ipsum is simply dummy text of the printing and typesetting
industry
[1] => Lorem Ipsum has been the industry's ever the 1500s, when an unknown
printer took a galley of type and scrambled it to make a type specimen book
[2] => It has survived not only five centuries, but also the leap into
electronic typesetting, remaining essentially unchanged
[3] => It was popularised in the 1960s with the of Letraset sheets Lorem Ipsum
passages, and recently with desktop publishing software Aldus PageMaker
)

Let’s try:
Array → ?
Array|\( → ?
… anything else?
Array\n|\(\s|\)| +\[\d\]\s=>\s+

Regex
Python Crash Course
Usage in script and console / terminal usage w/ grep:

ping 1.1.1.1
ping 1.1.1.1 | grep bytes=
ping 1.1.1.1 | grep -o bytes=
ping 1.1.1.1 | grep -oP 'bytes=\d{1,}'
ping 1.1.1.1 | grep -oP '(?<=time=)\d+ms'
Getting our IP from the console:

curl http://ipecho.net/
curl http://ipecho.net/ | grep -o '<h1>.*</h1>'
curl http://ipecho.net/ -s | grep -oP '(?<=<h1>Your IP is ).*(?=</h1>)'
while true; do curl http://ipecho.net/ -s | grep -oP '(?:\d{1,}\.){3}\d{1,}' ; done

The example generalizes easily. What if we wanto to monitor the change in IP or


when some service became available. Regex is also helpful for filtering only the
output we are interested in.
Regex
Python Crash Course
Code search
Regex
Python Crash Course
Mechanisms / regex syntax:
Anchors — ^ $ : ^a|b$ , ^a.+b$ , ^$ – empty line match
Quantifiers — * + ? {} : {1,} == + , {0,} == * , {3} , {0,1} == ?
OR operatorius / char classes — | [] : ^(?:2|3|4|5|6).+1$ == ^[2-6].+1$ also
[^ABC]
Character classes (shortform) — \d \D \w \W \s : \d == [0-9] == [0123456789]
Metacharacters — $ ^ . [ ] ( ) \ ? : need to be escaped \d{1,3}\.\d{1,3}\.\
d{1,3}\.\d{1,3}
Escape sequences — \\ \t \s \n \r : TBD
Flags — /i /m /g /u : regulates the behavior or regex engine
Posix bracket expressions — [] : [:digit:] (use with python regex module)
Boundaries — \b and \B : [A-Za-z]\b . Match only at word ending/beginning

… a bit more advanced concepts

Grouping and capturing — () (?:) : ^\[(\d+)\] (?:GET|POST|PUT|DELETE).+$


Back-references — \1 \2 : (a)(b)\2\1 , (.{1})(.{1})(?:.+)?\2\1
Named references : (?P<x>pattern) → create, (?P=x) → use
Greedy and Lazy matching - ? : latter in the slides
Look-ahead and Look-behind — (?=) and (?<=): latter in the slides
Regex
Python Crash Course
Anchors — find the string that starts with (^) and ends with ($). When you say
^[ERROR]$ - you are expressign the idea, that the whole line should be [ERROR] and
nothing else.
Find the lines that start with spaces, when there are spaces in the middle as well:

1. Antanas, 55
2. Petras, 99
3. Jonas, 66

Quantifiers let you specify repetitions of patterns:


* → Match zero or more times.
+ → Match one or more times.
? → Match zero or one time.
{ n } → Match exactly n times.
{ n, } → Match n times of more.
{n, i} → Match from n to i times.
Regex
Python Crash Course
OR operator let’s you specify the alternatives:
| → pipe
[] → char class
[X-Y] → char class with range syntax
Example:
[A-Z] → all capital letters, range. Logically equivalent to (A|B|...|Z) although
the step count might be different
ab[cde] → logically equivalent to ab(?:c|d|e) , matches: abc,abd,abe

Digression: we can’t say that we know what “^” is in regex, because it is


contextual - it can mean at least 3 things: ^A[^AB]\^$
Regex
Python Crash Course
Backreferences - allows use to use previous partial matches in our regex pattern.
Example: match only lines where the year the log record was created is mentioned in
the text of the log record (line)
Maximum 9 backrefernces: abcdefghiihgfedcba ⇒ (.)(.)(.)(.)(.)(.)(.)(.)(.)\
9\8\7\6\5\4\3\2\1
GET 2011-01-11 sdjfsdkfkds2011fdvfgfd
GET 2011-01-11 sdjfsdkfkds2001fdvfgfd
POST 2011-01-11 sdjfsdkfkds2011fdvfgfd
GET 2011-01-11 sdjfsdkfkds2021fdvfgfd
GET 2011-01-11 sdjfsdkfkds2011fdvfgfd
POST 2011-01-11 sdjfsdkfkds2011fdvfgfd

(\d{4}).+\1
(?:GET|POST) (\d{4})(?:-\d{2}){2} \w+\1\w+
Regex
Python Crash Course
Named references allow adding names to references (using named captured groups):
for symmetric patterns ^(?P<first>\w)(?P<second>\w)(?P=second)(?P=first)$
for repeating patterns (?P<first>.)(?P<second>.)(?P=first)(?P=second), but we would
prefer: (\w\w){2,}
You can refer to a named reference in the same pattern.
Ref: https://www.regular-expressions.info/refext.html
Example:
aa Tarzan "TarzanTarzan" "Tarzan" aaa "Jonas"
(?<=\")(?P<x>Tarzan)(?P=x)(?=\")
Regex
Python Crash Course
Greedy vs. lazy quantifiers
We have a comment section on a news network (delfi.lt ?). We want to detect people
trying to post html tags in our comment section: <p>Labas</p>, however we want to
just clean the tags, we will allow the text between the tags.
In this context it is not regex that is greedy (or lazy) but quantifiers!
Specifically: +, *
By default these quantifiers are greedy - it will match as much as possible until
the pattern allows. Try: <.*>
To make it lazy: <.*?> <.+?> …. or <.{1,}?>
Text: aa Tarzan "Tarzan" Tarzan <h1>"Tarzan"</h1> <p>aaa Tarzan</p> "Tarzan"
Usually we want the global flag (find all) be on when using lazy quantifiers - to
find all matches.
Regex
Python Crash Course
Lookaheads and lookbehinds (lookarounds) - only match pattern if before and/or
after we have another pattern (subpatterns)
(?=) → look-ahead (add after the pattern): {pattern}(?={pattern_ahead})
(?<=) → look-behind (add before the pattern): (?<={pattern_behind}){pattern}
(?!) → negative look-ahead - match the pattern only if the string does not match
the subpattern specified in the lookahead
(?<!) → negative look-behind - match the pattern only if the string does not
match the subpattern specified in the lookahead
Question: what is the difference between just specifying quotes “ and looking
around for quotes? Answer: Lookaround mechanism does not include into the match the
subpatterns they contain.
Example1: Text: aa Tarzan "Tarzan" Tarzan "Tarzan" aaa Pattern: (?<!\")Tarzan(?!\")
Example2: For comparison (with a backreference): (?<!")Tarzan|Tarzan(?!") vs. (?
<!")(Tarzan)|\1(?!") - this is not equivalent to the prev.
Example3:
Regex
Python Crash Course
(cont.)
Example 4: lookbehinds do not support varying length patterns (at least in Python
regex flavor):
Pattern: (?:(?<=<.>)|(?<=<..>)).*?(?=</.{1,}>) - 308 steps, (... or this one: (?
<=>)[a-zA-Z\"]+? ?[a-zA-Z\"]+?(?=<) - 163 steps. Could we compare the generality of
these two patterns?)
Text: aa Tarzan "Tarzan" Tarzan <h1>"Tarzan"</h1> <p>aaa Tarzan</p> "Tarzan"
Goal: match only content between tags, make it as general as possible

(?<=<[a-z]?>).*?(?=</.{1,}>)
(?<=<[a-z]{1,2}>).*?(?=</.{1,}>)
(?<=<\w{1,}>).*?(?=</.{1,}>)
(?<=<\w+>).*?(?=</.{1,}>)
(?<=<(\w\w|w)>).*?(?=</.{1,}>)
(?<=<[a-z]?>) = (?<=<[a-z]>)|(?<=<>) (equivalent)
Regex
Python Crash Course
Short note on performance:
Main principle 1: use regex only when you must - when the pattern is complex.
Main principle 2: be as precise as possible.
Avoid regex for simple tasks, not using regex is the best way to optimize regex
often.
We see the tips to improve the performance of regex itself is in the brown box,
see: https://www.youtube.com/watch?v=EkluES9Rvak
The hardest thing for regex is to find that a string does not match the pattern
fully, when the beginning is matching and then going back to see if it can match
when starting from the next character. In a general case this is callend
backtracking and it can be catastrophic for performance: https://www.regular-
expressions.info/catastrophic.html … ReDOS
One good way to optimize regex is to count the steps it takes:
Regex
Python Crash Course
Understanding complex regexes:
You get a complex regex, how can you understand it to fix it or enhance it?
We can fix 3 things in regex: FPs, FNs and performance.
Steps to take (divide and conquer):
Split the alternatives (|) and other separables
Isolate the parts (you need to know what can not be semantically separated, atomic
syntax units)
Create / synthesize / find data for TP - the more data the better
Create / synthesize / find data for FP/FN - where the current regex does not work
Correct the part / piece that should match but does not or v.v.
Reassmeble the regex from the pieces
Example 1 (JS): (?<!{[^}]*)Tarzan(?=\s+{[^}])
Example 2 (): ^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
Could you create matches for the example regex above
And non matches that only differ by one symbol (but not lenght)
Why is the last dot ( . ) included?
Can we make this better (will it work for .co.uk ?)
Regex
Python Crash Course
Unicode and regex
Ref: https://www.regular-expressions.info/unicode.html
Pattern: [д-я]+ , text: дсвфдвфбф
Pattern: [弛她]+ , text: 弛她弛她弛她弛她弛她弛她弛她
Regex
Python Crash Course
Exercises:
https://alf.nu/RegexGolf
https://regexone.com/
https://www.hackerrank.com/domains/regex

Reference materials:
https://www.regular-expressions.info - probably no. 1 resource.
https://regexone.com/references/python
https://developers.google.com/edu/python/regular-expressions
https://www.w3schools.com/python/python_regex.asp
Regex
Python Crash Course
Interesting example:
https://alf.nu/RegexGolf
Split the regex and understand the parts
Combine them in various ways
Understand the problem (we match all the cases with even number of x’s and then do
a negative lookahead)
To match the even number of x’s: ^(..+)\1$ and then just add the lookahead
This regex: ^(.+)\1$ - matches all the even numbers:
Regex
Python Crash Course
Interesting example 2:
Match only lines which contain the string “as” at the word boundary exactly 3 times
Example string:
Mindaugas Jonas Antanas Stasė
Mindaugas Antanas Stasė
Jonas Antanas Stasė Mindaugas
Regex: .+(?:(as\b).+?\1\b.+?\1\b)(?:.+)?
Regex
Python Crash Course
Regex and python:
Ref: https://realpython.com/regex-python/ → re.DEBUG
Ref: https://realpython.com/regex-python-part-2/
Ref: https://www.w3schools.com/python/python_regex.asp
Intro to 3rd party regex module:
https://learnbyexample.github.io/py_regular_expressions/regex-module.html
A note on re.compile() - you don’t need to do, no performance benefits:
https://realpython.com/regex-python-part-2/#why-bother-compiling-a-regex
Scraping
Python Crash Course
Web Scraping the activity of gathering data from a web page w/o using an API,
almost always done via a robotic process / code that is called a web scrapper / bot
/ scrapping bot. Can be thought of as automated browsing (crawling) + data
gathering / copy.
Unlike screen scraping, which only copies pixels displayed onscreen, web scraping
extracts underlying HTML code and, with it, data stored in a database (or we can
process the data using OCR technologies). The screen scraper can then replicate
entire website content elsewhere.
Why do it?
There is money in web scrapping
Every webpage is just data (almost … games are an exception, some tools might not
be considered data) - analysis
There is no available WEB API
Usecases:
HiQ Labs and talent/HR analytics - who might be at risk of leaving the company?
competitive automated pricing and cheapest product search (autoplius / autogidas /
ebay / aliexpress / amazon / aroudas.lt)
pricing history (proove that companies add prices before holliday discounts)
lead generation - search for business / people that need your service (skelbiu.lt,
skelbimai.lt), looking for jobs and sending cv automatically
keyword research - scrape competitor websites to compare keywords for your website
or google to see improvements after SEO optimization
… science, product research, job descriptions, stock analysis, news analysis,
facebook/twitter user tracking/profile building (“vatnik detector”).
search engines are fundamentally web scrappers (or at least crawlers. Crawling is
just visiting each page in a website to gather info about links).
aggregation/combinations of previously mentioned services…
Example:
Extract data from all the car selling sites in lithuania for a particular model,
sort them by price, send a newsletter each morning.
Students often implement a website (or a script) job search stats website.
Python Crash Course
Scraping
Python Crash Course
There are many libraries available (python is probably the leading language in
“tool making/automation” programming):
Requests + Beautiful Soup → Beautiful Soup is like ElementTree library
for XML, but its for HTML
Scrapy → for serious projects.
Selenium / Puppeteer / Playwright → CSR apps.
Scraping
Python Crash Course
Scraping is a multi-part process:
→ making requests (... until you get the response: auth, redirects …)
→ parsing and processing/gathering the information
→ saving it / use it
→ selector tunning
… using it

Note on HTTP when web scarping - what you need to know:


URL structure: schema, domain, port, path, query string, fragment
Redirects - sometimes there are accidental redirects to handle. Question: when
creating scrappers should we avoid redirects as possible or not? Why?
Error pages / error response codes - sometimes we need to abort or slow down the
scrapping if we see error pages / error response codes
User Agent spoofing - sometimes there are primitive protections in place that check
the user agent (we already saw that)
Knowledge that a form with GET can be replaced with direct call to the URL:
https://blog.mindaugas.cf/?s=Python

Selectors
The biggest part of work when scrapping is tunning selectors!
CSS selectors: tag, #id, .class, other attributes :
https://www.w3schools.com/cssref/css_selectors.php
XPath
Determing the selector can be made easier by using the browser:
Changing the style with the browser for improved targeting
Getting the selectors with chrome

Scraping
Python Crash Course
CSS selector vs Xpath selector
Selectors need to be balanced: precise but not too specific in order to not be
affected by changes in the UI structure or design.
Compare: /html/body/div[1]/div/div/main/div/article/a/div[2]/div[2]/span/span/
span/span vs. //*/div[2]/span/span/span/span
Scraping
Python Crash Course
Sometimes to bypass some protections you need to disable javascript (google slides
website itself, serves as a good demo).

Scraping
Python Crash Course
Autoplius sometimes opens the page in another tab to disallow inspecting the
request details (presumably). For that we need to remove target=”_blank” attribute
in HTML - the page will then open in the same tab. Or delete the taget=”_blank”

Scraping
Python Crash Course
The flow of scrapping is usually like this: you open the page > the familiar with
it > start scripting the scrapper > .... so exploration is key.
Sometimes when scrapping tools are involved you must save the HTML returned to a
file and test the selectors in order to not inadvertently start blasting the site
with request from a bot that is under development.
And certain websites can detect that this is a robot browsing and return a
different version of the site, like google. If you start tunning your selector in
the browser those selectors can be completely useless!
You can check that by issuing a curl request:
curl https://www.google.com/ -o goo.html
curl https://uzt.lt/ -D - (does not work in google collab, but does from regular
PC)

Scraping
Python Crash Course
The legality and ethics of scrapping:
Laws are vague, but you must seek legal advise if you want to do large scale
scrapping as a compercial activity.
The legality depends on your exact web scrapping application - why do you scrape,
what do you do with the data after scrapping.
Note if you interpret web scraping as automated browsing - which is many cases it
is - you could hire someone to just go though the webpages and save the data in a
database. It would accomplish the same thing. So there is nothing unethical in web
scraping inherently. However if you think it would unethical for some person to do
something you are doing in an automated way, then it might be unethical, but still,
not because it’s automated (stealing songs and publishing them as your own).
If you are scrapping a public website for non-commercial use and you are not
claiming that the content is yours, then you should be fine. If you are breaking
into private networks to scrape, then you are in legal trouble, but not because of
scrapping.
Do not (D)DoS the website with your scrapper.
See: https://prowebscraper.com/blog/is-web-scraping-legal/
Scraping
Python Crash Course
See (hiq vs. linkedin): https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn ,
newest: https://techcrunch.com/2022/04/18/web-scraping-legal-court/
Nord Pool case: https://www.nordpoolgroup.com/en/About-us/terms-and-conditions-for-
useofwebsite/
Video about the legality of web scraping: https://www.youtube.com/watch?
v=8GhFmQPZAlo

Web scraping in lithuanian: “Automatinis nuskaitymas” (?)


BeautifulSoup
Python Crash Course
HTML parsing library. It allows you to interact with HTML in a similar way to how
you interact with a web page using developer tools. The library exposes a couple of
intuitive functions you can use to explore the HTML you received.
The version recommended is v4 although v3 is still available.
See: https://realpython.com/beautiful-soup-web-scraper-python/
By itself does not support XPATH, but you can use CSS selectors, see:
https://stackoverflow.com/a/46471325/1964707
Note it might be that when scraping you will see different results in different
environments - if the website opens up collab (united states) it might work one
way, but launching it from Lithuanian might be completely different! You need to be
able to troubleshoot that!

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=python+jobs"
resp = requests.get(URL)
soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)
with open('goog.html', 'w', encoding='utf-8') as f:
print(soup, file=f)

BeautifulSoup
Python Crash Course
Demo: attribute selector.
Exercise:
Find a website - any website.
Use descendant CSS selector to select the element of your choosing
Use attribute selector to select the element of your choosing
Use Xpath to select the element of your choosing
Scrapy
Python Crash Course
Scrapy is
a framework** for crawling / scraping websites and extracting information
it is used for complex scrapping applications
it has a CLI for creating and managing scrapy project
it also has a template to build HTML spiders
also a shell for troubleshooting and refining selectors
supports CSS and XPath selectors (unlike BS4)
has middleware code for cookies, caching, redirects, rate limit.

Why scrapy (for us in this course)?


concrete skill to have on your resume
other features like: robots.txt honoring
extensibility - many spiders, custom plugins
scrapy cli
see: https://app.pluralsight.com/guides/crawling-web-python-scrapy

** Note that a framework usually implies that it’s bigger and more complete than a
library. However this is not necessarily the case. What is the case is that with a
framework you write the code “inside” it according to the conventions of that
framework and then the framework calls your code (“inversion of control”). With
libraries it’s the reverse - you call the functionality you need from the library.
That is the main difference.

Scrapy
Python Crash Course
Scrapy workflow:
… general prep steps: check if API available, robots.txt, terms of use and so
on.
Install - globaly or in venv
Create project - frameworks usually generate some project structure according to
the conventions of the framework
Analyze website of interest - find the site and choose the information that is
interesting to you (product name and qty, price)
Generate spider skeleton - your spider should inherit from scrapy.Spider (check if
https is used)
Do a test crawl or runspider - have to be inside the project directory where the
scrapy.cfg is to launch these commands
Refine the selectors - using scrapy shell or browser
Implement the parse() method – to get the necessary data you want.
Run, inspect, adjust the selectors (one of the most costly steps in a scrapping
project is selector tunning in addition to page visiting algorithm for complex
cases (visiting each item when information is lacking in the title page)) until the
result is satisfying.
Save the results if needed to any format you like (multiple feed exports are
supported: https://docs.scrapy.org/en/latest/topics/feed-exports.html).
Scrapy
Python Crash Course
Scrapy architecture:
See: https://docs.scrapy.org/en/latest/topics/architecture.html?
highlight=architecture
Engine → invokes and processes everything, the coordinator
Spiders → how the content is parsed and extracted
Item pipelines →
Scrapy
Python Crash Course
Installation
pip install scrapy → (can be system level install)

Launching scrapy:
scrapy startproject <project name> → generates the project
write the spider that will be in the spiders directory
scrapy crawl <site_name> -o <file_name>.[csv/json/xml/pickle]
we’ll see more in the code

Other commands:
scrapy → help
scrapy version → version
scrapy bench → bench local webserver (2400p/min is normal)
scrapy settings → settings, returns nothing at first
scrapy view <url> → download and view page locally (only .html is
downloaded)
scrapy fetch --nolog <url> > file.html → download page locally
… you can download the full page (almost full page) via “save as” in the browser.
Scrapy
Python Crash Course
The difference between runspider and crawl commands?
Scrapy
Python Crash Course
Saving to excel?

pip install scrapy-xlsx


FEED_EXPORTERS = { 'xlsx': 'scrapy_xlsx.XlsxItemExporter'}
scrapy crawl zappos -o zappos_2.xlsx
Scrapy
Python Crash Course
Scrapy shell: useful for prototyping and selector tunning:
scrapy shell https://blog.mindaugas.cf → opens the shell, makes the
request and parses it
scrapy shell -s USER_AGENT=“ua” <URL> → specify the user agent when starting
the scrapy shell
response, response.meta → more:
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response
response.css(‘tag.class’).get() → experimenting with css selectors
in the shell
view(response) → open cached page in the browser, for
investigation and playing around
request.headers → check the request headers, observe
below that autocomplete also is available. See docs also.
scrapy shell file:///<file> → investigate the downloaded page
locally using scrapy shell (in windows problematic)
fetch(<url>) → reloads the shell with the new
page
shelp() → see the available objects
spider → check which spider was used for opening
the page
print(“A”) → we can use python functions inside
ctrl+D → quit

Important selector tricks:


response.css(‘css-selector’) → select element, return selector
response.css(‘css-selector’).get() → get the element itself
response.css(‘css-selector::text’).get() → the the text inside the HTML
element
response.css(‘css-selector::text’).getall() → the text for all elements
targeted by the selector (... or use extract() )
response.css(‘css-selector::attr(src)’).getall() → get the src attribute of
some selected html tag (... or use .css().attrib[‘some-attr’] )
response.xpath(‘//img/@src) → use xpath selector: all image elements
at and under the root and extract the src attr

Scrapy
Python Crash Course
Scrapy
Python Crash Course
Note - when scrapping items from products/inventory/articles pages (pages that have
multiples of the same thing displayed) ensure you are mapping all the pieces of
information for 1 item correctly. If you selecteing them independently you might
have a situation where you have 100 titles, but 97 prices - the question “where did
the mismatch come” can take a lot of time to figure out.

It is better to use iterative matching/scrapping/selecting - go though each item


and independently match each piece of information. Compare code:

def parse(self, response):


brands = response.css(Selectors.brands_selector).getall()
product_names = response.css("#products > article > div > a > dl >
dd[itemprop='name']::text").getall()
price = response.css("#products > article > div > a > dl > dd >
span[itemprop='price']::text").getall()
for _brand, _product_name, _price in zip(brands, product_names, price):
yield {
"brand" : _brand,
"product_name": _product_name,
"price": _price,
}

def parse(self, response):


for product in response.css("article"):
yield {
"name": product.css("dd[itemprop='name']::text").extract()[0],
"by": product.css("dd[itemprop='brand'] > span::text").extract()[0],
"color": product.css("dd[itemprop='color']::text").extract()[0],
}

Scrapy
Python Crash Course
Advanced usage:
Disobeying robots.txt - robots.txt is a file, that can be found at the root of the
domain, youtube.com/robots.txt . This file contains instructions for bots how to
parse/srape/crawl the website and index it’s contents. It’s mainly used by friendly
robots like googlebot, bingbot, yahoobot, etc. Although it is advisable to follow
it’s instructions, there are certain exceptions - but it sometimes impossible to
obtain the information needed w/o disobeying. Scrapy obeys the robots.txt file by
default, we can switch that off in the settings. See next page for examples!
Extensions: Throttling with AutoThrottle :
https://docs.scrapy.org/en/latest/topics/autothrottle.html
Items - classes representing something that is extracted:
project_root/project/items.py.
Item Pipelines - used for dropping empty values, cleaning, validating, storing:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Bypassing the european cookie consent pages:
https://stackoverflow.com/questions/32623285/how-to-send-cookie-with-scrapy-
crawlspider-requests
Client side rendered (CSR) pages - two choices splash and playwright integration:
https://docs.scrapy.org/en/latest/topics/dynamic-content.html#pre-rendering-
javascript

Scraping client side rendered pages


Python Crash Course
We will call client side rendered (CSR) a page where javascript injects additional
content after the page has already loaded.
The webapps created with Angular, React, Vue are often working this way. SPA -
single page application.
View page source will disclose that the page is such a page. Also look for XHR and
fetch() requests in the network tab and minimal response for the initial request
(almost no HTML). Also check what happens when you disable javascript.
They can be hard to scrape because the scraper sees the initial HTML page w/o
modifications from JS! So not the page we see.
JS has to be executed in order for us to see that content! But bs4, scrapy, are not
JS engines and not browsers!
We can use requests-html, selenium (many language bindings) and puppeteer (not
available for python (pyppeteer)), playwright (more for testing than general
automation) for this kind of work.
Start with this:
https://github.com/MindaugasBernatavicius/AngularCRUDWithJsonServer
You will need git bash to launch this
Node and npm: https://nodejs.org/en/
git clone https://github.com/MindaugasBernatavicius/AngularCRUDWithJsonServer
cd AngularCRUDWithJsonServer
npm install
ng serve
json-server --watch data/db.json
Another app:
https://github.com/MindaugasBernatavicius/ItemListLocalStorage
https://mindaugasbernatavicius.github.io/ItemListLocalStorage/
A complex page: https://www.iseecars.com/cars-for-sale
More examples: https://sportland.lt/ , https://infinite-scroll.com/demo/full-page
Requests-Html (optional)
Python Crash Course
Requests-html works with CSR web pages because it uses (not sure anymore)
puppeteer’s python version called pyppeteer, which is capable of interpreting JS.
Puppeteer is essentially a google chromium project, a headless browser you can run
on a command line and script. So it is definitely capable of running js and so is
pyppeteer and so is requests-html.
ref: https://requests.readthedocs.io/projects/requests-html/en/latest/
Selenium
Python Crash Course
Selenium is full fledged browser automation tool, commonly used for test
automation, task automation, scraping - can do almost everything.
It supports more all major browsers, while requests-html supports only Chrome (is
it still true).
It is more powerful than requests-html but harder to start with.
Selenium supports multiple language bindings, so you can use other languages as
well. This means that learning it could be useful because you could use the
understanding you obtain in one language to learn another just by reimplementing
the projects!
We will use selenium web driver out the 3 main parts in the ecosystem:
https://www.selenium.dev/projects/

You need a driver for the browsers you are going to use, they can be found in many
places:
https://www.selenium.dev/downloads/
https://github.com/mozilla/geckodriver/releases (gecko driver is the firefox one)
https://chromedriver.chromium.org/downloads (OLD. with chrome driver you have to
use version that is identical to browser version)
https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-
downloads.json NEW LINK for chrome drivers
Complex cases
Python Crash Course
There are some things that are quite hard to scarpe, common obstacles/advanced
cases’ish:
Choose the right tool: requests + bs4, scrapy, or selenium / requests-html when you
have a CSR. Target dictates!
If you don’t hit the elements you see in the browser with the scraper, try using
incognito mode, this will let you see page closer to what your scraping tool sees.
First load problem (cookie acceptance dialogs). Change window size so the elements
would be visible.
Implant the cookies if needed (scrappy, selenium does that automatically).
Optimization with selenium - inject cookie acceptance policy to skip the acceptance
dialog.
JPEG, Video scraping is a thing: https://www.bloomberg.com/graphics/2019-tesla-
model-3-survey/, https://stackoverflow.com/a/37821542/1964707
User agent spoofing - a simple trick that can get though a lot of simple
protections
User agent rotation - a more advanced trick that can be through more advanced
protections
Handling form data (form sending GET vs. POST, sending a CSRF token with POST
requests, selenium handles CSRF)
Authentication, re-authentication - you need to know how to login to reach some
data.
Captchas - most of the time impossible to bypass, even with machine learning
(recaptha). But can be partially automated, just let the client receive
notification that he needs to pass the captcha. Or try automated services:
https://github.com/madmaze/pytesseract , https://www.deathbycaptcha.com/ ,
https://anti-captcha.com/ . Windows API for mouse movement. Robot arm:
https://www.ebay.com/sch/i.html?
_from=R40&_trksid=p4432023.m570.l1313&_nkw=robot+arm&_sacat=0
Rotating IPs with proxies, with scrapy it’s effortless, more work with other
frameworks, see: https://free-proxy-list.net/ , see: https://www.zyte.com/smart-
proxy-manager/ . See: https://stackoverflow.com/a/59410739/1964707
Robots.txt honoring (how to find and interpret).
Complex cases
Python Crash Course
Continued …
Prefer APIs if they exist, even if they are not free
Cache the page / reuse the same variable to avoid superfluous http requests
Crawl / scrape during low-load hours or throttle the requests (nocturnal diurnal
cycle in web apps)
Rotate crawling patterns (especially if you get blocked, countermeasure)
Copyright - do not republish or redistribute the content
Constantly changing HTML (which we don’t control) - can be solved using selectors
as configuration. Generalize your selectors: //#x/ul/li[5]/p vs. li[5]/p ← the
second one is less likely to change, the chain is shorter.
Pagination and infinite scroll (pinterest) are also cases you might encounter
Honeypot traps: sometimes display: none is used to lure the bot to click some
invisible link and then block the ip (selenium handles this automatically because
visibility matters).
Scraper poisoning (scraped data poisoning) /w display: none - fake data inserted
invisible for the user.
Rotating selectors - randomly generated classes / id’s. Use tag selectors,
hierarchies of them.
Headless mode for server environments and massive scrapping.
Screenshots.
Screensizes.
Questions
Python Crash Course
We scrapped data from multiple e-shops for same items, and the names are different
in each shop. How do we compare names/strings in that case? Compare:
https://www.benu.lt/sezono-svarbiausi-produktai/bioderma-apsauginis-kremas-nuo-
saules-raustanciai-odai-naturalaus-atspalvio-photoderm-ar-spf50-30-ml
https://camelia.lt/odos-ir-plauku-kosmetika/151078-apsauginis-kuno-kremas-bioderma-
photoderm-ar-spf50-nuo-paraudimo-30-ml.html
Get structured information: brand, item name, amount in the bottle.

Dumb algoritms: diff btw/ sentences based on number of matching words, number of
matching characters.
Stemming from NLP could be tried: https://towardsdatascience.com/text-cleaning-
methods-for-natural-language-processing-f2fc1796e8c7

Questions
Python Crash Course
Possible to run in linux server w/o a windowing system? YES:
https://stackoverflow.com/questions/68283578/how-can-i-run-selenium-on-linux

Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/
Python Crash Course
Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.

--- Content from 15SQY4Unk9TXKFgRwEwY_hThgbK501c7B.pptx ---


Artificial Intelligence
Python Crash Course
2024
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Wide and long data format
01
02
Multilevel indexes
Python Crash Course
00
Joins
Time series
03
Cleaning data
04
05
06
Correlation
Exercises
07
Practical Task P3
Joins
Just like in SQL, we have the ability to join two distinct dataframes.
Some call this mechanism: “SQL style joins” as they have some parallels with the
Joins used relational model.
SQL discussion about Joins can be found here:
https://docs.google.com/presentation/d/1yaHEZojmi3CcDDAZIItjSC1oekxTMflV/
More info on merge: https://stackoverflow.com/questions/53645882/pandas-merging-101

Methods:
df1.join(df2) → join on index
df.2merge(df2) → join on any column in multiple ways (more versatile than
join)
indicator=True,
how='left',
validate='m:m'
etc.
pd.concat([df1, df2]) → contact top-to-bottom, like “union”

Python Crash Course


Wide and long data format
No standard naming - some call it wide / long some short / fat, see:
https://en.wikipedia.org/wiki/Wide_and_narrow_data
Narrow / long format is called EAV: https://en.wikipedia.org/wiki/Entity
%E2%80%93attribute%E2%80%93value_model
Let’s distinguish between identifiers and measured variables - names and data! Also
dependent and independent variables (the speed of freefalling object is dependent
on G and coefficient of drag) - temp = f(t) (temperature is dependent on time).
Which one is better - we mostly use wide format, but why? Couple of rules:
Columns should not be constructed from data values (like 2001)
All variable names should be in the columns, not in data (like variable)
Melting and pivoting:
Pivoting refers to moving from long to wide format (not the same as in excel, where
a pivot table is a summary statistics table, essentially a grouping aggregate).
Could you create a pivoting algorithm?
Melting refers to moving from wide to long format

Python Crash Course


Wide and long data format
Python Crash Course
Wide and long data format
Example

Additional examples:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
https://dfrieds.com/data-analysis/melt-unpivot-python-pandas.html

Python Crash Course


Multilevel indexes
Also known as “hierarchical indexes”.
When a single level index can’t identify the row uniquelly we can use multilevel
indexes. When index values are duplicated it is hard to select by index,
essentially rendering the notion of an index meaningless. This is where multi
indexes come in.
Multi indexes have levels, levels can be changed / swapped.
Additional information - see how multilevel indexes allow for representation of N-
dimensional data in N-#idx-dimensional structure:
https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-
indexing.html - we can query across all dimensions w/o “hacks”
Api:
creating a multindex: df.set_index(['year', 'id'])
you need to sort index if you want to range-index a multi index

Python Crash Course


Multilevel indexes

Python Crash Course


Multilevel indexes
What are the benefits? Any performance benefits?

Python Crash Course


Time series
Pandas was developed in the context of financial modeling, so as you might expect,
it contains a fairly extensive set of tools for working with dates, times, and
time-indexed data.
We will talk about time series data much more in part “Sequential Data Analysis”,
but let’s plant the seeds early so that next time we get to this topic we would
have the basics down.
We need to understand that we talk about two measurements when discussing time:
points in time and periods of time. int + int ⇒ int, point in time - point in time
⇒ period . We can use the points in time as indexes (independent variables and tie
data values / measurements to them: stock price variation graph).
Pandas actually differentiates that even further:
https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-
series.html#Pandas-Time-Series-Data-Structures
Because Pandas was developed largely in a finance context, it includes some very
specific tools for financial data. For example, the accompanying pandas-datareader
package (installable via pip install pandas-datareader), knows how to import
financial data from a number of available sources, including Yahoo finance, Google
Finance, and others. Here we will load Google's closing price history.
There are TS specific operations implemented by Pandas, like: resampling, shifting
and window operations. See more about resample(). vs. asfreq():
https://stackoverflow.com/a/54051553/1964707
Python Crash Course
Time series

Python Crash Course


Time series
Python Crash Course
Cleaning data
Cleaning data is a very important part of data pre-procesing before analysis /
modeling can begin.
The data may contain: empty cells, data in wrong format, wrong data, duplicates.
Techniques: outlier removal, normalization, data imputation, format adjustments,
duplicate removal … model creation for data cleaning (subject to compound errors).
Ref: https://www.w3schools.com/python/pandas/pandas_cleaning.asp

Python Crash Course


Computational tools
Pandas provides a useful capability of displaying the correlation between variables
- a correlation matrix.
Correlation can be positive (+1), negative (-1) or zero - spectrum.
Both perfect positive and negative correlation between two variables imply
deterministic relationship between them (for some models we need to track /
eliminate that dependence).
Correlation fallacy: correlation does not imply causation - 89% of people that get
sunburns and are admitted to the hospital also eat icecream. However we would not
say that icecream causes sunburn - in fact sun an and high temperature causes both
(po to, vadinasi dėl to).
More on causal inference in datascience: https://www.youtube.com/watch?v=dFp2Ou52-
po
See: https://www.w3schools.com/python/pandas/pandas_correlations.asp
There are additional tools available, see:
https://pandas.pydata.org/pandas-docs/version/1.2.0/user_guide/computation.html

Python Crash Course


Exercises
Please complete the test and exercises to verify pandas knowledge:
Quiz: https://www.w3schools.com/python/pandas/pandas_quiz.asp
Exercises: https://www.w3schools.com/python/pandas/pandas_exercises.asp

Python Crash Course


Practical Task P3
Complete this notebook:
https://github.com/MindaugasBernatavicius/DeepLearningCourse/blob/master/
03_Pandas/PP3_Pandas.ipynb
You will need to launch it, import all the libs and data (sometimes maybe even
think about how to do it) and perform the actions described.
After completing it please provide a link with the completed tasks (link to google
collab or github repository with the notebook).
Python Crash Course
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1jppuksqiiplJbuud-C_aXBWJJr8Zxh2S.pptx ---


Artificial Intelligence
Python Crash Course
2023
Lecturer
Mindaugas Bernatavičius

2 Level
1 Chapter
Today you will learn
Bytes
Dict
01
02
03
List
Python Crash Course
00
Collections
04
Tuple
05
Set
06
Protokolai
07
Control Structures: Ciklai, Sąlygos
08
Comprehensions
Collections
Python Crash Course
Kolekcijos yra aibės elementų pasiekiamos vienu vardu - kintamasis reprezentuojatis
ne vieną, bet daugybę reikšmių. Mes kol kas nekalbėsime apie Python Collections
modulį, tai yra atskira tema, kol kas apibrėžiame žodį “collections” negriežta
prame, kaip antonimą skaliariniems kintamiesiems.
Galime išskirti tokias kolekcijas:
Strings
Bytes
List
Dict
Tuple
Set
… kadangi apie Strings jau kalbėjome, pradėsime nuo bytes.
Bytes
Python Crash Course
Panašūs į String tipo kintamąjį - jei string yra seka unicodo simbolių tai baitai -
sekos baitų.
Galime saugoti binarinius duomenis.
Galime saugoti fiksuoto pločio enkodinimu enkodintą tekstą, pvz: ASCII.
Sutiksime karts nuo karto bibliotekų metodus, kurie atiduoda baitus kaip išdavą -
tarkim failų ir tinklo duomenis.
Literal forma: b’aaa’
Daug string operacijų yra palaikomos - indeksavimas, slicinimas, iteravimas.
Jei turime baitų masyvą ir norime paversti jį į string, turime žinoti, kokiu
encodingu buvo uženkodinti baitai.
List
Python Crash Course
Labai dažnai naudojama, mutable duomenų struktūra reprezentuojanti “aibę” elementų.
Skirtingai nuo stringų yra mutable, taigi append() ar kiti metodai operuoja ant to
paties listo.
Indeksuojami nuo 0 ir turi tvarką - ordered: kaip dedu taip ir susideda.
Galime dėti heterogeniškus tipus bei dublikatus.
Replace operacija, tiesiog priskiriant listo pozicijai kitą reikšmę (lst[5] =
“Jonas”)
Taip pat del operatorius bei remove() funkcija.
Galime unpackinti reišmes iš listo į named references (kintamuosius) a, b = [1, 2]
or even name, num1, num2 = sys.argv
Įsimintini metodai: sort() vs. sorted() (daugiau kitoje skaidrėje), reverse() vs.
reversed(), append(), insert(pos, val), extend([list2]), pop(), clear() ir t.t. …
help(list).
Pasiekti reikšmes “nuo galo” naudojame neigiamus indeksus, plg: list[len(list) - 1]
== list[-1], tačiau list[0] == list[-0], saugokitės OBOB’ų!
List repetition ir concatenation su * ir +
Daugiau: https://www.w3schools.com/python/python_lists.asp

Implementacija:
CPython implementuoja List’us kaip array of pointers su overallocatinimu, O(1)
access
Daugiau: https://stackoverflow.com/questions/3917574/how-is-pythons-list-
implemented .
Gal žinote ar tai yra cache-friendly duomenų struktūra?

List kopijavimas:
skirtumas tarp reference ir value
list2 = list1[:]
list2 = list1.copy()
list2 = list(list1) // galima naudoti netik su listais.
kopijos yra lėkštos taigi nukopijuoti listai visvien mato tuos pačius objektus, jei
reikia “deep copy” reikia naudoti copy modulį.
[0] * 9 taip pat pakopijuoja nuorodą, o ne reikšmes, todėl pakeitę objektą nested
liste pasikeitimą matysime visuose elementuose: s = [ [1] ] * 5 ; s[2].append(6)
DĖMESIO: copy.deepcopy() - source kodą galime peržiūrėti ir tai bus python, bet
copy() metodo kodas jau bus C kalbos.
List
Python Crash Course
Rikiavimas:
sort() metodas, pakeičia originalią list kolekciją (in-place sort).
Naudoja timsort algoritmą (apie tai daugiau kalbėsime paskaitoje apie algoritmus).
Du argumentai: key ir reverse.
Key galime paduoti callable objektą ir tuomet intepretatorius naudos kiekvienam
itemui tą callable objektą, kad palygintų visus narius tarpusavyje.
Pvz: list_of_works.sort(key=len)
Sorted() metodas gražina kopiją listo tik išrikiuotą.
Dict
Python Crash Course
Fundamentali duomenų struktūra Python’e:
key:value poros - { k1 : v1 , k2 : v2 }
panašiai kaip map (java/c#) ar associative array (php), json object (js) kitose
kalbose.
Mutable, bet negalime naudoti dublikuotų key reikšmių - turi būti unikalios.
Nuo 3.7 pridėti itemai yra išlaikomi įdėjimo tvarka (ordered), kas nėra didelė
problema, nes "adresavimas” nėra pagal indeksus, bet pagal key’jus.
Stebuklinga savybė: constant time access O(1) - Implementacija: hashtable. Gal
žinote kuo ypatinga Hashtable struktūra? Kas būna su collisionas tai pythone yra
naudojamas Open Addressing Collision Resolution Strategy.
Metodai: keys(), values(), items(), pp(), update() (why we need .update() -
https://stackoverflow.com/a/70773868/1964707 )
Kopijavimas yra shallow by default taip pat.
Key gali būti tik immutable duomenų tipai: str, int, float, byte, tuple,
datetime.datetime
Plačiau: https://www.w3schools.com/python/python_dictionaries.asp
Tuple
Python Crash Course
Immutable seka objektų:
Literal forma gali būti inicializuojama su skliaustais arba išvis be jų. Taip pat
tuple() konstruktorius.
Charakteristikos: ordered, unchangeable / immutable, leidžia dublikatus
Galime naudoti * ir + operatorių padidinti kartojant ir sujungti du tuple’us (beje,
tai galime daryti ir su string).
Vėlgi galimas nestinimas ir yra daugybė API metodų įvairioms operacijoms.
Tuple unpacking - priskiriame named referene’am tuple reikšmes: a, b, c = ("apple",
"banana", "cherry")
Ref: https://www.w3schools.com/python/python_tuples.asp

Panaudojimas:
Dict (multi)key for fast dict access
Fast access / iteration (but not if you need modification)
Unpacking a = (a, b) // a = a, b // a, b = b, a
Heterogenous collections: [("Max", "Weber", 55), ("Min", "Power", 44)]

Implementacijos detalės:
https://rushter.com/blog/python-lists-and-tuples/

Palyginkime Tuple ir List savybes (šone). Taip pat yra konvencija naudoti tuple’us
heterogeniškoms kolekcijoms, ref: https://stackoverflow.com/a/1708610/1964707

Set
Python Crash Course
Unikalių elementų nerikiuota kolekcija.
Inicializuojamas su { } (jei reikšmių yra) arba su set() konstruktoriumi, tačiau:
python -c "print(type({}))" .
Elementai turi būti immutable, pats yra mutable.
Dažnas panaudojimas: išmesti dublikatus / palikti unikalias reikšmes iš
listo/tuple’o efektyviai.
Galime iteruoti, galime pridėti į setą .add() / .update().
… removinti: .remove() (meta klaidą, kai nėra item) bei .discard() (nemeta klaidos)
… ko negalime padaryti tai adresuoti nario su index notacija.
Ref: https://www.w3schools.com/python/python_sets.asp
Set algebra - Venn diagrams:
.union() - komutatyvi operacija. Result: set
.intersection() - ar komutatyvi? Result: set
.symmetric_difference() - komutatyvi operacija. Result: set
.issubset() - ar komutatyvi? Result: bool
.issuperset() - ar komutatyvi? Result: bool
.isdisjoint() - ar komutatyvi? Result: bool
higher level: union - symmetric difference = intersection
practical example: 2 groups of buyers, 1 - used coupon, 2 - bought >$1000
Protokolai ir apibendrinimas
Python Crash Course
Protocol: https://mypy.readthedocs.io/en/stable/protocols.html
Tam, kad objektas suportintų tam tikrą protokolą jam nebūtina implementuoti
interface’o ar paveldėti iš abstrakčios ar konkrečios klasės. Tiesiog palaikyti tam
tikrą operaciją:

This comes from here: https://docs.python.org/3/library/collections.abc.html - this


answer the question, which method should I implement to support Sized protocol
(__len__)
Protokolai ir apibendrinimas
Python Crash Course
Ref: https://stackoverflow.com/questions/1708510/list-vs-tuple-when-to-use-each
Kontrolės struktūros
Python Crash Course
Python turime tokias kontrolės struktūras:
if , else ir elif sąlygos
while ciklai
for ciklai
….

Kompleksiškumas gali kilti, nes visos šios struktūros gali būti “nestinamos”.
Tačiau prisiminus Zen of Python (PEP20) supraskime, jog tai nėra Pitoniškas kodas:
“flat is better than nested” - geriau nenestinti pernelyg “giliai”. Vengti tokio
kodo:
If sąlyga
Python Crash Course
If sąlyga leidžia sąlyginį kodo vykdymą, kada vienas kodo gabalas gali būti
visiškai nepaliestas interpretatoriaus arba sąlygoms esant viena kodo dalis
leidžiama vietoje kitos:
Schematiškai: if <expr>: - tai sąlygos head’as.
Tuomet yra indentacija, rekomenduojami 4-turi space’ai.
Viskas kas yra pastumta yra if body.
Jei sąlyga <expr> įvertinama kaip False arba falsy tai if body esantis kodas NEbus
vykdomas.
Jei sąlyga <expr> įvertinama kaip True arba truthy tai if body esantis kodas bus
vykdomas.
Naudoti command-line one-lineriuose sudėtinga, nors linux paprasčiau bei windows:
https://stackoverflow.com/questions/2043453/executing-multi-line-statements-in-the-
one-line-command-line
Else - naudojamas kai turi būti įgyvendintas “vienas arba kitas” kodo gabalas.
Elif - jei atsiranda 3-čia, 4-ta, … n-ta opcija ir rašomas tarp if ir else.
Always be mindful to cover all the ranges of values in your conditionals ( … not
really OBOBs, but similar)
… if age < 15
… elif age > 15 || elif age >= 15
Multiple chained if’s (flat) is better than nested if’s.
Eliminating nested if’s using datastrcutures:
https://colab.research.google.com/drive/19unAvFJZu1qPrPHrYO7xLP1DWKgs4qA-?
usp=sharing

Python neturi ternary operatoriaus, jis imituojamas ir:


vadinamas “conditional expression”
naudojamas dėl to single-line ir dėl to, kad patogus comprehension’uose
pvz: state = "nice" if is_nice else "not nice"
labai išsamiai: https://blog.finxter.com/python-one-line-ternary/
If sąlyga
Python Crash Course
Svarbu mokėti restruktūrizuoti sąlygų (o taip pat ir sąlygų bei ciklų kartu) kodą
aka.: if restructuring arba, more generally code restructuring.
Ref: https://www.youtube.com/shorts/Zmx0Ou5TNJs
Ref:
For ciklas
Python Crash Course
Python turi 2-u mechanizmus cikliniam vykdymui - for ir while. For ciklas panašus į
for-each kitose kalbose.
Ciklo headeris baigiasi dvitaškiu, body atskiriame indentacija (blokas).
Psiaudokodas listui iteravimui: for item in items: print(item)
Psiaudokodas dictionary iteravimui: for keys in dictionary: print( key :
dictionary[key] )
Daugiau: https://www.w3schools.com/python/python_for_loops.asp

Range ir enumerate funkcijos:


Ką daryti jei norime išspausdinti list ar dict su kiekvieno elemento pozicijomis?
List: range(), dict: enumerate(). Patartina naudoti enumerate net ir list, tačiau
nevisi taip naudoja realiai.
Range konstruktorius ištiesų gražina range objektą. Tai atskiras python tipas,
neturintis literal formos.
Naudojamas half-open range: range(0, 5) → [0 … 4]
Naudojant range su 3 args turim ‘step’ funkciją. Užduotis: išvardinti kas trečią
listo elementą.

Break ir continue statementai:


Break - nutraukia visą artimiausio apgaubiančiojo ciklo veikimą - programos
veikimas tęsiamas nuo vietos, kur apgaubiantysis ciklas baigiasi (“iššoka iš
ciklo”)
Continue - nutraukia artimiausio apgaubiančiojo ciklo iteraciją - programos
veikimas tęsiamas kita ciklo iteracija (iteracija - vienas apsisukimas / vienas
pakartojimas).
Raskite visus (find all) lyginius skaičius: galima padaryti su continue (tęsti nors
ir radome narį atitinkantį sąlygą).
Raskite pirmąjį (find first) nelyginį skaičių: break (nes radus vieną nereikia
tęsti).
For ciklas
Python Crash Course
Greitaveikia:
Svarbiausia taisyklė - eliminuokite iš ciklo visas nereikalingas operacijas. Pvz.:
darote operaciją visiems nariams, tačiau turite if sąlygą pirmam. Tokiu atveju
sąlogoje geriau neturėti atskiro atvejo pirmam nariui, nes tikrinimas ar šiuo metu
apdorojamas narys yra pirmas ar ne, bus daromas visada (tai reiktų verifikuoti su
disasembleriu).
While ciklas
Python Crash Course
Python turi ir while ciklą, nors do .. while neturi:
Skirtas kartoti veiksmą, tol kol kažkokia sąlyga galioja.
Sintaksė ir semantika labai panaši į kitų kalbų while ciklus.
Nors sąlyga einanti po while yra konvertuojama į bool taip tarsi bool()
konstruktorius būtų buvęs pakviestas, rašyti while c: kai c yra skaičius ir mažinti
reikšmę c -= 1 taip išnaudojant truthy ir falsy reikšme yra “nepitoniška” nes
“explicit is better than implicit” ir “readability counts”.
While / else naudojamas padaryti veiksmui, kai nustoja galioti while sąlyga.
Daugiau: https://www.w3schools.com/python/python_while_loops.asp
Sujunkime ką išmokome
Python Crash Course
Panaudokime bibliotekas išgauti teksui iš viešai prieinamų resursų.
Comprehensions
Python Crash Course
Tai būdas aprašyti list, set ir dict stuktūras, kuris yra:
lengvai skaitomas, trumpas ir kompaktiškas
perskaitomas kaip natūrali kalba
pvz: [len(word) for word in words] → sukuriamas listas.
filtruoti, generuoti, mappinti
set comprehensionai veikia labai panašiai

Dict comprehensionai gali atrodyti sudėtingiau, bet idėja ta pati:


Skiriasi sintakė: { expr(key) : expr(val) for key, val in dict.items() }
Galima panaudoti invertavimui key value : { v : k for k, v in dict.items() }
(galioja apribojimai - key immutable/hashable)
Ir galime iteruoti comprehension viduje per listą, bet sukurti dictionary:
{ word[0] : word for word in words }

Patartina: nenaudoti sudėtingų comprehensionų, nedaryti side effect juose (konsolė,


failai, db, netw.)
Comprehensions
Python Crash Course
Comprehensionai filtruojami:
Comprehensions
Python Crash Course
Comprehensions
Python Crash Course
Comprehensionai gali būti “multi-output”:
Comprehensions
Python Crash Course
Comprehensionai nestinami:
Programos planas
Čia galite susipažinti su programa
Additional information
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Programos detalus planas
Rasite užduotis, praeitas skaidres ir t.t.
--- Content from 14rORcWrL6EoB6NqNdktcm2D-I2GaaIhl.pptx ---
Artificial Intelligence
Computer vision and image classification
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


What is computer vision, applications
01
02
How do human see
Computer vision and image classification
00
Structure of Part 7
Deep learning for computer vision
03
04
Image preprocesing techniques (scikit.image & pytorch)

05
06
Image representation in CV
Questions and what’s next
In this section our main goals are 3
- learn how images are represented and preprocessed in ML
- learn about a new type of neural network called Convolutional Neural Network
(including how to use it)
- learn a new technique about transfer learning!
In the future lectures you will see that we will use CNNs for Inverse Image Search
(part 8) and we will revisit some of the same topics in part in Advanced Computer
Vision (object detection and segmentation, part 13)
After this part you will know how to:
Prepare images for training ConvNet
Train a FCFFNN for image classification and compare it to ConvNet
Understand why FCFFNN is not sufficient (this is a common interview question)
Train and optimize a ConvNet (in all 3 popular frameworks: TFK, Torch, Fast.ai)
Use transfer learning to improve your models
Structure of Part 7
Computer vision and image classification
People ingest ~85% of information about the world through their eyes.
What if we could make computers that understood visual information and acted upon
it?
That is preciselly what computer vision is about - making computers see things and
act on that information.
Computer vision is an interdisciplinary scientific field that deals with how
computers can gain high-level understanding from digital images or videos. From the
perspective of engineering, it seeks to understand and automate tasks that the
human visual system can do (and maybe even can’t do - going above an beyond).
Computer vision tasks include: processing, analyzing and understanding digital
images, extraction of high-dimensional data from the real world in order to produce
numerical or symbolic information and even actions / decisions based on the
information.
Computer vision data can take many forms, such as video sequences, views from
multiple cameras, multi-dimensional data from a 3D scanner or medical scanning
device (CT scan, CAT scan, x-ray scan, MRI scan). Spectrum: greyscale 28x28 images
to huge 3D images and videos.
The technological discipline of computer vision seeks to apply its theories and
models to the construction of computer vision systems.
What is computer vision
Computer vision and image classification
Sub-domains of computer vision include scene reconstruction, event detection, video
tracking, object recognition, 3D pose estimation, motion estimation, visual
servoing (servo motors), 3D scene modeling, and image restoration, image
classification.
Computer vision is not a single thing! There are many interesting problems that are
being solved with a variety of different techniques.
What is computer vision
Computer vision and image classification
Applying filters and other image processing techniques. They can be used by
computer vision, but only as an intermediary step in understanding the image.
Certainly not design, working with photoshop or photography (cameras can use
mechanisms derived from computer vision).
Don’t confuse image processing / retouching / digital art creation / graphic design
and computer vision.
What computer vision is NOT
Computer vision and image classification
Barcode scanner → Amazon go: https://towardsdatascience.com/how-the-amazon-go-
store-works-a-deep-dive-3fde9d9939e9
Parking meter → Pay-by-Plate: https://viso.ai/computer-vision/automatic-
number-plate-recognition-anpr/
Medical diagnostics → cancer recognition / prediction.
Applications
Computer vision and image classification
https://hackernoon.com/a-brief-history-of-computer-vision-and-convolutional-neural-
networks-8fe8aacc79f3
Most important milestones:
Cats experiments (~1950)
1957 images to numbers
3D to 2D scanining (~1960)
1982 established that vision is hierarchical
CNN (1985-1989, LeCun applies backprop to Fukushimas CNN)
AlexNet (2012) - not the first in anywhere, but probably the most impactfull
historically: ILSVRC.
Remember. The target in AI is:
Human level performance
Super human performance
Deep learning for computer vision
Computer vision and image classification
A CNN that beat it’s competition.
Designed by Alex Krizhevsky and published with Ilya Sutskever and Krizhevsky's
doctoral advisor Geoffrey Hinton.
AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on
September 30, 2012. The network achieved a top-5 error of 15.3% (top-1 and top-5
error rates explained: https://stats.stackexchange.com/a/156515/162267 ), more than
10.8 percentage points lower than that of the runner up.
Many consider this the starting point of modern computer vision boom.
Deep learning for computer vision
Computer vision and image classification
The high level process is the same (supervised learning):
Take a set of images
Label them / preprocess them
Create CNN
Train it using backprop
Tune and train again
Additional concepts we will need are:
Image representation in CV
Image pre-procesing techniques
Deep learning for computer vision
Computer vision and image classification
Constructivist vision - we see in 2D, but construct a 3D model of the world.
Our brain constructs, derives 3D from 2D based on:
Size of objects remembered or infered from similar objects
Their ordering in space (closer further)
most importantly stereoscopy (2x2D images produce stereoscopy).
Do we need two cameras for depth in CV? Not really. Link
The brain constructs a lot of what we see - colors in the peripheral vision, no
blind spot.
Colors. There are several color systems: RGB, CMYK
We see RGB: https://www.youtube.com/watch?v=l8_fZPHasdo
How do we see?
Computer vision and image classification
continued…
How do we see?
Computer vision and image classification
Remember, all the qualitative data in DL needs to be reduced to quantitative data.
Images are comprised of pixels.
Each pixel is comprised of RGB values (if it’s a colored image, multichannel image)
or just one value if it is a gray-scale image (single channel image).
We represent the set of all the pixels comprising the image as tensors. In both
grayscale and RGB cases. However in the RGB case the 3rd dimesion has 3 elements.
So an image of size 6px by 6px will be represented as a 3D tensor
Image representation in CV
Computer vision and image classification
Because an image can be represented as 3D tensor, a bunch of images is a 4D tensor.
We ussually feed 4D tensors to NNs when working with image data.
Tensors need to be of the same size, so all images need to be of same size and have
the same number of color chanells (for a NN with a single input layer).
What does this tensor of image data mean:
Image representation in CV
Computer vision and image classification
The answer is it depends, but ofter it’s like in the picture below.
Note that the number of images is first number in all frameworks. The meaning of
other numbers depends on the framework you are using!
Image representation in CV
Computer vision and image classification
We could learn to classify images and perform even more demanding tasks with
already preprocessed images from well known datasets (MNIST (28x28x1, 70K
symmetric, hw diggits, 10 class), fashion MNIST (28x28x1, 70K symmetric, clothes,
10 class), CIFAR-10/50/100 (32x32x3, 60K symmetric, common items, 10 class),
Imagenet, Traffic Signs, etc.). But these techniques will be necessary when doing
work on new images.
Image preprocesing can improve your models. Sometimes even enable them to work!
We will discuss 6 common techniques. There are way more!
Image preprocessing techniques
Computer vision and image classification
Uniform aspect ratio: the ratio of the width to the height of an image or screen.
Since it’s a ratio, that means it’s division: width / height = aspect ratio.

Uniform image size: CNN uses feature maps to extract features from images.
You might need to rescale the images according to the feature maps that you are
going to use (downscaling, sometimes upscaling).
Image preprocessing techniques
Computer vision and image classification
Mean image subtraction. To make models more robust, so that it would work with
noisy images we can use the mean image technique. Calculate the mean and then
subtact from all images. Additional benefit of this technique is making a uneven
backgrounds in the images more even - reducing the chance that our model will take
lightning and background color as a very import feature.
Calculate the mean image, inspect it.

We can subtract the mean image from all our images (2 techniques):

More explanations: Link Link


There are two techniques for mean image:
Very good Q&A, let’s read it: Link
Image preprocessing techniques
Computer vision and image classification
Perturbed image - also enables feeding the network with more images with slight
variations. This makes the model more robust

Normalization of image inputs - normalization is also needed for images!


Image preprocessing techniques
Computer vision and image classification
Dimensionality reduction - cropping, greyscaling, downsampling! An example of
dimensionality reduction. PCA is not usually applies when working with CNNs [TODO::
research on benefits - https://www.google.com/search?
q=The+Effect+of+the+Principal+Component+Analysis+in+Convolutional+Neural+Network+fo
r+Hyperspectral+Image+Classification ,
https://ijaers.com/uploads/issue_files/22IJAERS-07202264-Application.pdf ]. But
image classification with FCFFNN can use PCA.
Sometimes, it would make your model more robust. For example: what if you want your
classification model to work on images in the dark? It must not be learning based
on color, just shape.

Data augmentation - when you train your network you can add more data to the
training set by perturbing, flipping (one of the safest, except when assymetries
apply), adding noise, rotating, shifting (geometric translation & crop) and
applying other techniques.
This will make your model more robust and give you more data. So the definition of
augmentation is a way of generating new data form existing data by applying some
transformations to the data. Be careful not to create data that can not exist!
Data Augmentation is achieved by appling Transformations on images. We need
distinguish that transformations can be applied w/o augmenting the dataset
(transformations “in place”). Be careful not to assume augmentation when using
frameworks. Usually augmentation needs to be specifically programmed, example in
Pytorch:
https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_illustra
tions.html#resize and
https://keras.io/api/layers/preprocessing_layers/image_augmentation/
random_brightness/
Image preprocessing techniques
Computer vision and image classification
You can find a list of real augmentation techniques performed in this article:
https://machinelearningmastery.com/best-practices-for-preparing-and-augmenting-
image-data-for-convolutional-neural-networks/
Demos:
Resizing and rescaling
Cropping and denoising
Augmentation
Normalization
ZCA whitening and decorrelation
Image transformations with Pytorch
Normalization for entire dataset per channel
Image preprocesing techniques
Computer vision and image classification
A note on ZCA whitening
Image preprocesing techniques
Computer vision and image classification
Decorelation
Image preprocesing techniques
Computer vision and image classification
Pytorch image transformations
A good illustrativee reference:
https://pytorch.org/vision/main/auto_examples/plot_transforms.html#sphx-glr-auto-
examples-plot-transforms-py
Keras transformations
Keras choose to not include many preprocessing functions. They rely on the python
ecosystem and packages like Pillow / See:
https://keras.io/api/layers/preprocessing_layers/image_preprocessing/ and
https://keras.io/api/layers/preprocessing_layers/image_augmentation/
Fast.ai transformations
Ref: https://fastai1.fast.ai/vision.transform.html (old documentation is still
better) and new one is here: https://docs.fast.ai/vision.augment.html
Image preprocesing techniques
Computer vision and image classification
Swirling, pixel attacks and making networks more robust.
Ref: https://arxiv.org/pdf/1710.08864.pdf
Ref: https://ipsjcva.springeropen.com/articles/10.1186/s41074-019-0053-3
Image preprocesing techniques
Computer vision and image classification
Next:
We will learn image classification with FCFFNNs
Understand their inadequacies and compare them to Convolutional Neural Networks
that solve the problems with FCFFNNs.

Questions:
Can you pass two different sized images to a CNN? Yes, but they will take different
paths until they are reduced to the same feature vector at the fully connected
layer, ref: https://www.quora.com/How-do-you-create-a-CNN-in-TensorFlow-that-takes-
2-different-sized-images-as-input-Do-these-two-images-pass-through-different-paths-
in-the-network-and-come-together-in-a-fully-connected-layer
Questions and what’s next
Computer vision and image classification
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1kABVa0GR4xa0aPjHJX5dx3xjHDJLgHQ9.pptx ---


Artificial Intelligence
Neural Networks for Tabular Data
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Classification
01
02
Bottlenecking
Neural Networks for Tabular Data
00
Regression
//
03
04
//
05
06
07
//
//
08
//
//
09
//
We have already talked about regression extensively.
The specifics for tabular data:
Remember how to encode data for regression - nominal data - one-hot encoding,
ordinal - label or ordinal encoding.
Use embedings instead of encoding if you can and the model is not good enought.
Numerical data - scaling - normalize or standardize, especially for SVMs and DNNs.
Remember the distributional properties of targets - if the model underperforms,
check the distribution, use appropriate loss function for that.
Compare to other models
Use semi-automated tunning like GridSearch
Regression
Neural Networks for Tabular Data
A simple NN that can perform regression can also perform classification, we just
need to add a non-linear function, like sigmoid (like we did in part 05) or
softmax.
If we have a binary classifier, then we have only two categories (spam or ham). And
we can have a multiclass classifier (buy, sell, hold).
Whatever classification model you are trying to build the model will output a
probability score for each category that you are predicting (for each discrete
value).
Softmax can be used for binary and multiclass classification.
Classification
Neural Networks for Tabular Data
Output activation for classification - we will use softmax or log softmax.
Why more stable? The range of values is smaller
Classification
Neural Networks for Tabular Data
A measure of how different two probablity distributions are can be obtained with
binary cross entropy.
Classification
Neural Networks for Tabular Data
We evaluated our regression model using the R^2 score, how about classification
models?
We will use a confusion matrix. Which is just an arrangement of predictions vs.
actual values.
We have 4 different metrics: accuracy, precision, recall and specificity.
All 4 of these metrics range from [0,1], the higher for all of them, the better
(the model).
There are many interesting things about the confusion matrix:
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Classification
Neural Networks for Tabular Data
Accuracy is not always enough!
To prevent it check the distribution - is one category much more likelly than the
other?
Classification
Neural Networks for Tabular Data
Precision: when the model predicts yes - how often is it correct? TP/Predicted P =
100/110 = 0.91
Sensitivity / recall: true positive rate when it’s positive, how often do we
predict positive? TP/Actual P = 100/105 = 0.95
Specificity: true negative rate - when it’s negative how often do we predict
negative. TN/Actual N = 50/60 = 0.83. Same as 1 - FPR
If specificity, accuracy, sensitivity and precision are high, we have a good model.
Additionally there is f1 score and confusion matrix.
F1 score, etc - refer to the slides we already discussed this topic in:
https://docs.google.com/presentation/d/1v-QtRy61cARo3E4Bq0dqmJb94WE97UBS/edit?
usp=sharing&ouid=100991015018856999546&rtpof=true&sd=true
Classification
Neural Networks for Tabular Data
A confusion matrix can be NxN for a multiclass classification problem.
There are simple intuitive tests you can do to understand your models using CM:
If diagonal is most prominent - the model predicts things well.
If the data sample for a specific categories are close to each other (like 0
(horse) and 1 (dog)), they should be confused by the model more often (than 0
(horse) and 2 (cat)). If categories that, by our own, human understanding should be
clearly separable, but are confusing to the model - something is pretty wrong.
Classification
Neural Networks for Tabular Data
The concept of type 1 (FP) and type 2 (FN) error - applies to binary classification
(medical diagnosis, legal system, hirring (HR), etc).
Certain problems are much more sensitive to FP vs. FN. I.e: it’s safer to classify
the patient as having HIV even if he does not, because the negative effects of the
treatment are not big (not counting pychological and social effects). Cancer - much
worse if chemo involved. Ebola - FN would be even more catastrophic.
In hirring if HR said the person is good, but was not (false positive), its much
more destructive than the reverse (at least google thinks so). So if you were to
create a classification model for hiring to google you would need to choose a model
that is biased toward type 1 error.
Classification
Neural Networks for Tabular Data
Receiver Operating Characteristic graphs provide a summary for all the tresholds.
AUCs privide a vizualization of classification methodologies - which one is better.
Good explanation: https://www.youtube.com/watch?v=4jRBRDbJemM
Classification
Neural Networks for Tabular Data
Reuters dataset has 46 categories. Can we have a layer of 32 neurons before the
last layer using softmax activation?
We should not, due to possible “bottlenecking” - which is a restriction of
information flow in a network due to a layer that is smaller than the subsequent
layers!
This forces the network to compress feature representations (CNN networks employ
this mechanism in conv layers).
See: https://ai.stackexchange.com/a/4887/42808 and
https://stats.stackexchange.com/questions/262044/what-does-a-bottleneck-layer-mean-
in-neural-networks and https://www.baeldung.com/cs/neural-network-bottleneck
Hypothesis for practical work: if we have many features in our dataset and few of
them are predictive, then bottlenecking might improve accuracy (we could try to
prove that).
Bottlenecking
Neural Networks for Tabular Data
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 157TimJjqH0j3vwmpaa8osnCaZjfPnGaq.pptx ---


Artificial Intelligence
Introduction to Machine Learning
2024
Lecturer
Mindaugas Bernatavičius

Today you will learn


Ensemble Learning Categories
01
02
Ensemble Learning with Scikit
Introduction to Machine Learning
00
Ensemble Learning
Random Forests
03
04
AdaBoost
05
06
07
Boosting
Gradient Boost
Stacking
Ensemble Learning
Wisdom of crowds - if you sample a large number of people the answer will be better
than from a single expert! Similar principle - one individual predictor, even if it
is the best - will not be as effective as a group.
Ensemble learning - group of predictors is called an ensemble; thus, a technique
when the prediction is made by combining multiple predictors (regressors or
classifiers) is called Ensemble Learning, and an Ensemble Learning algorithm is
called an Ensemble method.
Often winning solutions in Machine Learning competitions involve several Ensemble
methods (famously in the Netflix Prize competition: https://www.youtube.com/watch?
v=coeak1YsaYc also:
http://web.archive.org/web/20211004083913/https://www.netflixprize.com/ ).
Ensemble Learning is one of the most powerful ML techniques we have.
Introduction to Machine Learning
Ensemble learning methods classification:
By model / predictor uniformity:
Homogenous - when the same type of models is combined (a set of decision trees -
random forest, most often). However same type of models usually make same type of
mistakes (which can be mitigated in several ways we will see later) hence the next
category. Residual / error analysis would tell you that.
Heterogenous - when different types of underlying models are combined (logistic
regression, svm, decision tree). Even weak learners can become a strong ensemble if
there is enough of them, they are diverse (heterogenous). Ideally they are
perfectly independent and make uncorrelated errors (which is not usually possible
since they are trained on the same data) you can expect the accuracy to approach
75% as the predictor with 51% accuracy count reaches 1000 (this can be proven with
a biased coin 51 (heads) / 49 (tails), the probability that you will get majority
tails becomes lower by the law of large numbers. If it was 10K you could expect
around 91% accuracy. Often the number of models is odd. [TODO :: need research
citations]

By technique combination of intermediate prediction used:


Voting - final prediction is made by combining the predictions from each model. If
the majority decides this is called hard voting, if the average is taken - soft
voting. Soft voting often achieves higher performance than hard voting because it
gives more weight to highly confident votes. Often with diverse ensemble.
Bagging & Pasting - when you want to use the same type of models in your ensemble
you can still diversify at the data level. You can train your models on a different
sample of the same training set. When sampling with replacement (WR design) - we
call that bagging (aka bootstrap aggregatting), without replacement (WOR design) -
pasting. Replacement is explained well here:
https://en.wikipedia.org/wiki/Sampling_(statistics)#Replacement_of_selected_units .
It is understood in the literature that both bagging and pasting produce higher
bias for individual predictors than it would have been if they were trained on al
the data, but the aggregate usually has lower variance and similar bias. Both
bagging and pasting can be done in parallel on all CPU cores, they are very popular
methods, because they scale well. Bootstrapping introduces a bit more diversity in
the subsets that each predictor is trained on, so bagging ends up with a slightly
higher bias than pasting; but the extra diversity also means that the predictors
end up being less correlated, so the ensemble’s variance is reduced. That is why
bagging is often preferred, however both should be tried using CV.
Boosting - ensemble learning technique than combines several weak learners in a
sequence. Two popular methods: AdaBoost and GradientBoost.
Stacking - when a metalearner is trained on top of weak learners as opposed to a
simple rules deciding on a final result.

Ensemble Learning Categories


Introduction to Machine Learning
Voting (hard & soft), bagging and pasting, boosting, stacking.
Ensemble Learning Categories
Introduction to Machine Learning
Ensemble learning and scikit:
There is a wrapper class called: VotingClassifier() that takes a list / tuple of
classifiers and supports hard and soft voting.
In scikit you need all the models in the ensemble to have predict_proba() method
for soft voting to work.
If ensemble has SVC - pass parameter probability=True - training will take longer,
but since the accuracy is often higher it’s often used.
Note: the better your individual predictors are tuned the better the ensemble will
perform. If you have weak learners - dummies - you need to increase their count.
Demo: soft and hard voting on moons dataset.
BaggingClassifier() and BaggingRegressor() are for bagging. If you want pasting
just set bootstrap=False, specify n_jobs parameter to tell how many CPU cores to
use (-1 for all available).
You can also pass the parameter oob_score=True, which instructs scikit to provide
training evaluation scores calculated on the out-of-bag training data or out-of-bag
training instances (remember when bagging the same sample can be selected for
different preductors, but not all samples will be selected). Each predictor never
sees the oob instances during training, it can be evaluated on these instances,
without the need for a separate validation set. You can evaluate the ensemble
itself by averaging out the oob evaluations of each predictor. In general oob score
should be lower than the accuracy on test data.
There is also feature sampling implemented in scikit w/ the BaggingClassifier().
Each predictor will be trained on a random subset of the input features with
max_features and bootstrap_features parameters. This technique is sometimes used
when you are dealing with high-dimensional inputs (such as images). Sampling both
training instances and features is called the Random Patches method, keeping all
training instances but sampling features is called Random Subspaces method.
Demo: bagging, pasting and oob.
Ensemble Learning with Scikit
Introduction to Machine Learning
A Random Forest is meta estimator model, an ensemble of Decision Trees.
Generally trained via the bagging method (or sometimes pasting), often max_samples
set to the size of the training set for each subtree.
Instead of building a BaggingClassifier + DecisionTreeClassifier, you can instead
use RandomForestClassifier class, which is more convenient and optimized for
Decision Trees (also, RandomForestRegressor). RandomForestClassifier has most
hyperparameters of a DecisionTreeClassifier + hyperparameters of a
BaggingClassifier to control the ensemble itself.
The Random Forest algorithm introduces extra randomness when growing trees -
instead of searching for the very best feature when splitting a node it searches
for the best feature among a random subset of features, usually achieving higher
bias, but lower variance.
See: https://www.youtube.com/watch?v=v6VJ2RO66Ag and https://www.youtube.com/watch?
v=J4Wdy0Wc_xQ
We choose the number of features for each sub-tree with the max_features {“auto”,
“sqrt”, “log2”} hyperparameter.
Demo: scikit RandomForestClassifier

Extremely Randomized Trees ensemble - extra-trees. Not only considers a subset of


features, but also randomizes the thresholds for tree splitting (the algorithms no
longer searches for the optimal values). The training is faster due to that and
produces even more diverse model. Available via ExtraTreesClassifier() in scikit.
Feature Importance - easy to obtain with trees and random forests. Scikit-Learn
measures a feature’s importance by looking at how much the tree nodes that use that
feature reduce impurity on average (across all trees in the forest), it’s computed
automatically and can be accessed via feature_importances_ variable. Would feature
importance be available using ExtraTreesClassifier()? You can try to find out!
Demo: feature importance determination.

Random Forests
Introduction to Machine Learning
Boosting (hypothesis boosting) - ensemble method that combines several weak
learners into a strong learner training predictors sequentially, each trying to
correct its predecessor.
There are many boosting methods available, but by far the most popular are AdaBoost
(Adaptive Boosting) and Gradient Boosting.

img ref: https://towardsdatascience.com/what-is-boosting-in-machine-learning-


2244aa196682
Boosting
Introduction to Machine Learning
AdaBoost is a forest of strumps with differential importance.
AdaBoost
Introduction to Machine Learning
Pays more attention to data that predecessor underfitted so, new predictors are
focusing more and more on the hard cases.
The initial predictor is ofter a decision tree (or even a “stump” trees w/ 2
leaves).
The initial weight of each training instance w(i) = 1 / m, m - size of the training
set
nth predictor is trained and it’s error rate is computed
Weight of predictor based on how wrong it was is computed
Weight of misclassified training instances for nth+1 predictor is increased.
Subsequent predictors are then trained, predictions made on the training set,
updates for predictor importance and relative weights misclassified training
instances are adjusted and so it repeats.
Once all predictors are trained, the ensemble makes predictions very much like
bagging or pasting, except that predictors have different weights depending on
their overall accuracy on the weighted training set - so simply compute predictions
of all the predictors and weigh them using the predictor weights. The predicted
class is the one that receives the majority of weighted votes
This technique can not be parralelized or only partially (need clarification).
See: https://www.youtube.com/watch?v=LsK-xG1cLYA
AdaBoost
Introduction to Machine Learning
Scikit-Learn uses multiclass version of AdaBoost called SAMME (Stagewise Additive
Modeling using a Multiclass Exponential loss function).
If the predictors can estimate class probabilities (predict_proba()), Scikit-Learn
can use a variant of SAMME called SAMME.R (for “Real”), which relies on class
probabilities rather than predictions and generally performs better.
DEMO: AdaBoostClassifier(algorithm="SAMME.R", learning_rate=0.5) and
algorithm="SAMME", learning_rate=0.5
For a discussion about learning rate see next slides about Gradient Boosting.
AdaBoost
Introduction to Machine Learning
Gradient Boosting works by sequentially adding predictors to ensemble, each one
correcting its predecessor, but instead of tweaking the instance weights at every
iteration, this method tries to fit the new predictor to the residual errors made
by previous predictor.
When decision trees are used as base predictors they are called Gradient Boosted
Trees - a term that is commonly used. Ensemble’s predictions gradually get better
as trees are added to the ensemble during the GBM (gradient boosted models)
learning process.
The learning_rate hyperparameter scales the contribution of each tree. If you set
it to a low value, such as 0.1, you will need more trees in the ensemble to fit the
training set (if the set is complex especially), but the predictions will usually
generalize better. Research has indicated that. This is a regularization technique
called shrinkage.
See: https://www.youtube.com/watch?v=jxuNLH5dXCs (classification) and
https://www.youtube.com/watch?v=3CC4N4z3GJc (regression).
Demo: Gradient Boosted Regression with chained DecisionTreeRegressors (manually
created GBRT), then GradientBoostingRegressor().
Problem with Gradient Boosting - can easily overfit and you have a tradeoff between
learning rate and number of trees - small learning rate and large number of trees
and it might not converge, increasing learning rate then might cause overfiting. To
deal with that we can use early stopping - a technique popular in Deep Learning.
Early stopping is a technique used to reduce overfitting by prematurely stoping the
training process when an optimal solution is reached. Using scikit you can
implement early stoping it by utilizing (note: in deep learning this will be
stoping the training process actually):
staged_predict() method: returns an iterator over predictions made by ensemble at
each stage of training (with one tree, two, etc.)
warm_start=True, which allows incremental (online) learning. This approach does not
require training a large number of predictors and then checking what is the optimal
number, but rather stops when the error / accuracy does not increase for several
training cycles.
Demo: early stopping. Code can be found here: https://github.com/ageron/handson-
ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb
GradientBoostingRegressor class supports a subsample hyperparameter, which
specifies fraction of training instances to be used for training each tree. If
subsample=0.25 - each tree is trained on 25% of training instances, selected
randomly. Technique trades a higher bias for a lower variance. It also speeds up
training considerably. This is called Stochastic Gradient Boosting (see:
https://jerryfriedman.su.domains/ftp/stobst.pdf )
Optimized implementation of Gradient Boosting - XGBoost (Extreme Gradient
Boosting). Offers: automatic early stopping, great performance and scalability.
Demo: XGBoost (can be found in the same resource). Other libraries:
https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db
There are even more implementations: https://www.youtube.com/watch?v=yO6gJM_t1Bw
comparison: https://www.kaggle.com/faressayah/xgboost-vs...
Student question: can we gradient boost SVM? Yes: https://www.quora.com/Can-I-use-
boosting-algorithms-with-SVM-as-weak-learner and
https://link.springer.com/chapter/10.1007/978-3-642-02326-2_51
Gradient Boost
Introduction to Machine Learning
Great video confirming what we have been talking about:
https://www.youtube.com/watch?v=5CWwwtEM2TA (Can one do better than XGBoost? -
Mateusz Susik)
I wonder how scikit-learn would compare against these.
Gradient Boost
Introduction to Machine Learning
Stacking is an ensemble technique were metalearner (or blender) is trained to
prodict correct value based on all outputs from ensembled weak learners.
Searchable terms: stacked ensemble, stackable ensemble, blending machine learning.
The difference between voting or boosting is that the final decision is not based
on the majority, weighted majority or similar, but it is learned!
To train the blender we use the holdout set - the test set. Good question: why is
it trained on holdout dataset and does this affect important properties for ML
models (like is it more susceptible to bad test-train splits when the statistical
properties on test dataset are different than the ones in train dataset).
Sometimes several blenders are trained producing layered blenders. Process: split
the training set into three subsets: the first one is used to train the first
layer, the second one is used to create the training set used to train the second
layer (using predictions made by the predictors of the first layer), and the third
one is used to create the training set to train the third layer (using predictions
made by the predictors of the second layer). Once this is done, we can make a
prediction for a new instance by going through each layer sequentially.
Scikit now supports (~2020) stacking using sklearn.ensemble.StackingClassifier.
Also supported are multilayered stacks of blenders. See:
https://scikit-learn.org/stable/modules/ensemble.html#stacking
Ref: https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for-machine-
learning-and-deep-learning/ and
https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-
models-win-data-science-competitions/
It is also worth noting that neural network models can also be stacked and
different topologies created.
Stacking
Introduction to Machine Learning

Stacking
Introduction to Machine Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Introduction to Machine Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 14JQWoj7nxYBbix7kmgujN0eOx8_PFu9X.pptx ---


Artificial Intelligence
Recommender Systems
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Dot product distance
01
02
Cosine similarity
Recommender Systems
00
Similarity measurement
User-feature vector
03
04
MovieLens dataset
Summary
07
08
Further explorations
06
When building a real system
05
Content-based Movie Recommender
In content based recommendation systems we want to recommend new items based on the
properties of the items that the user bought/installed previously, essentially
recommending "similar" items. But what does it mean for two items to be similar? Is
an anime movie about world war II closer to a documentary about that same subject
or another anime movie about vampires? We need some similarity metric.
What will the metric calculate and in what space? The modern approach is to use
trained embeddings as a uniform euclidian feature space or we can use predefined
features as vectors. In this feature space we can define several metrics: dot
product, cosine similarity.
Similarity measurement
Recommender Systems
One distance calculation metric
Dot product distance
Recommender Systems
Another distance calculation metric
Cosine similarity
Recommender Systems
Ref: https://en.wikipedia.org/wiki/MovieLens
About the datasets: https://en.wikipedia.org/wiki/MovieLens#Datasets

Other datasets: https://github.com/caserec/Datasets-for-Recommender-Systems


MovieLens dataset
Recommender Systems
Take all the movies + tags that the MovieLens dataset has
Encode tags as a numeric vector
Merge the dataframes appropriately
Calculate TF-IDF for tags
we need a consistent width encoding scheme so we could compare the movies across N
dimensions
we could use OHE, CountVectorization, but TF-IDF will give us the benefit of
weighting the unimportant tags
since the tag for a movie can not be added more than 1 times, that means tag
importance is dependent only & only on how frequent it is across the other movies -
the more the tag is popular the less it’s specific, hence the less it weights (only
IDF)
you could obviously try other schemes to see which ones is best - TF-IDF carries
more information than OHE, CountVectorization, but less than embeddings, embeddings
+ positional encoding (no need for positional encoding is not needed here)
Calculate movie2movie similarity based on cosine distance (try dot prod.)
Test the similarity calculation using an assumption that movies the same series
should be the most similar (Terminator, Toy Story, Avengers)
Then personalize the recommendations by adjusting the tags weights according to
users prefs.
Content-based Movie Recommender
Recommender Systems
So we took some features - those were obtained by feature engineering techniques,
maybe were accumulated over many years - turned them into numbers and constructed
the item-similarity-matrix. We already had a recommender system by that point, but
maybe not a very precise one - not personalized one!
When the user visits a page for a movie, we could recommend similar movies at the
bottom of the page. But we went further and described a category of recommendation
systems - personalized recommendation systems. So we have a “similar items”
recommender already.
We personalized the recommendations by scaling the features by user preferences.
When is this usefull? Well, the similar movies can be recommended in a singular
movie page. But on the title page, it is very common to display: "Recommended for
you" section. This section needs personalized recommendation.
Content-based Movie Recommender
Recommender Systems
We saw how to generate predictions, what to do next?
One would take the calculated user profiles for all users and save those into the
database (could it be vector db?)
Take the TFIDF encoded values and put them into the database for each movie.
When the user likes a new movie, recalculate the user profile vector.
When the user logs into the application show different sections on the page
top recommendations based on popularity
when user accesses a movie page show “similar items” section
show “recommended to you” section (somewhere else maybe)

How to test if your recommendations are effecive?


A/B testing. For some users we show naive recommendations (random) and for others
we show the content based ones. We could use 3 groups: control group, random
recommendations, CB. We can measure: clicks or items sold, time on site, time on
each item page that was recommended, time spent on site (non specific measurement).
Risks: need to control for performance - the pages need to rendered in approx. the
same time, control for sample sizes - 50%/50%.
Incorrect test: showing recommendations vs. showing nothing - this only tests if
the recommends are effective, not which recommendation engine is more effective.
Which one should come first - determining which recommendation system is the most
effective or if we need a recommendation system at all. I would argue that first we
need to know which system is best and then if it’s needed at all.
A/B testing different versions (iterative improvements) to see which ones is more
effective.
This topic touches on experimental design (statistics), hypothesis testing
(statistics), p-values, margin of error.
When building a real system
Recommender Systems
We implemented content based recommender
We saw how tf-idf can be useful for encoding categorical features (movie tags)
because it penalizes commonly used tags
We applied cosine distance to obtain a movie-2-movie (item similarity) matrix form
movie-2-tag (item-2-feature) matrix to get a recommender for “similar items”
section
When we calculated the coefficients to scale the tag importance by user evaluation
of a specific movie and then used that to get the user-tag (user-feature) vector
And then we used cosine similarity to get the highest scoring movies (user-movie
vector) that were our recommendations for “recommended for you” section
(personalized).
Summary
Recommender Systems
Research more datasets that could be used for recommender systems (we have moovie
lens and good books dataset we will see in the next lecture)
Kaggle recommender datasets?
Test how different similarity metrics will perform for the content based
recommender w/ 13Kx1.1K dims?
KNN based recommender: https://towardsdatascience.com/a-marijuana-recommendation-
system-using-tf-idf-and-k-nn-bd472242cb3f
Decision Tree based recommender: https://medium.com/@jwu2/building-a-collaborative-
filtering-model-with-decision-trees-56256b95cb03
Further explorations
Recommender Systems
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Recommender Systems
Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 12hK4NTrpkKtNaKkSGxGnoVxouf1DvGBe.pptx ---


Artificial Intelligence
Introduction to Machine Learning
2024
Lecturer
Mindaugas Bernatavičius

Today you will learn


Dimensionality Reduction
01
02
Principal Component Analysis
Introduction to Machine Learning
00
Unsupervised Learning
Other techniques
03
04
K-Means Clustering
05
06
07
Clustering
Semi-supervised learning
Practical Project 4
Unsupervised Learning
Yann LeCun - “if intelligence was a cake, unsupervised learning would be the cake,
supervised learning would be the icing on the cake, and reinforcement learning
would be the cherry on the cake.”
Somewhat paradoxical: supervised learning currently archives much more recognition
and is more popular, but most datasets are unlabeled. For example identifying
defective product from it’s picture. Would’nt it be great if we would not need
labeled data? That is precisely what unsupervised learning is mean to provide.
There are subtypes of tasks that we can do in unsupervised manner:
Dimensionality Reduction - reduce the feature space making supervised algorithms
faster (and in rare cases more accurate).
Clustering - grouping similar items.
Anomaly detection - differentiate normal and abnormal.
Density Estimation - estimate the probability density function of a process that
generated some data. Ref:
https://scikit-learn.org/stable/modules/density.html#kernel-density-estimation
Association rule mining/learning - TBD
So let’s have some cake, starting from dimensionality reduction.
Image reference:
https://pythonnumericalmethods.berkeley.edu/notebooks/chapter25.01-Concept-of-
Machine-Learning.html
Introduction to Machine Learning
Dimensionality Reduction
Many Machine Learning problems involve thousands features for each training
instance (for example image data). They make training slow, and can also make it
much harder to find a good solution - curse of dimensionality.
Most pixels in images are useless (like in MNIST dataset 28x28=768), some features
in tabular data might also be useless.
Reducing dimensionality comes at a loss of information so in general it is only
useful when you need it and it’s not to be used lightly - it makes the ML pipeline
more complex, less maintainable, creates an additional tunning step. In some cases
it might make the model more accurate, but in general it will not, therefore you
can consider dimensionality reduction as a technique for performance improvement
[?].
Another very important usage of dimensionality reduction techniques is data
visualization. Reducing the number of dimensions down to 2-3 makes it possible to
plot a condensed view of a high-dimensional training set on a graph and often gain
some important insights by visually detecting patterns, such as clusters -
essential to communicate your conclusions.
An interesting detail rarely mentioned in the literature - we can anonymize data
using PCA (credit card translation data for fraud classifier, although it is
limited as it is possible to recover large part of the original data) and that
dimensionality reduction in so much as it decreases the size of data can be
considered a pseudo data compression mechanism.
Commonly mentioned mathematical example: if you pick two points randomly in a unit
square (1x1 lenght), the distance between these two points will be, on average,
roughly 0.52. If you pick two random points in a unit 3D cube, the average distance
will be roughly 0.66. But what about two points picked randomly in a 1,000,000-
dimensional hypercube? The average distance, believe it or not, will be about
408.25. As a result, high-dimensional datasets are at risk of being very sparse:
most training instances are likely to be far away from each other. This also means
that a new instance will likely be far away from any training instance, making
predictions much less reliable than in lower dimensions. Theoretically increasing
training instance count would mitigate that, but with the feature space of 100 we
would need fantastically huge datasets making this an impossible proposition.
Not feature importance based feature selection / elimination.
Introduction to Machine Learning
Dimensionality Reduction
Most datasets training instances are not spread out uniformly across all
dimensions. Many features are almost constant, while others are highly correlated.
As a result, all training instances lie close to a much lower-dimensional subspace
of the high-dimensional space.
So having a 3D dataset, we can project (use projection) the features into a 2D
plane keeping the axes that account for the most variation in tact and the axis /
dimension that accounts for lowest variation removed.
Or we can (check if you can as well) think of an example in 2D projected to 1D.

However projection is not always the optimal solution - like swiss roll projection.
This is because swiss roll is a manifold - a 2D shape that is twisted in 3D space.
Image ref: https://www.cnblogs.com/yangxiaoling/p/10658376.html
This is geometric interpretation of data.
Introduction to Machine Learning
Many dimensionality reduction algorithms work by modeling the manifold on which the
training instances lie - called Manifold Learning.
Relies on the manifold assumption - most real-world high-dimensional datasets lie
close to a much lower-dimensional manifold.
This assumption is very often empirically observed - for example with MNIST dataset
you have way fewer degrees of freedom than if you were just to generate any random
greyscale image - all handwritten digits are centered, connected, have shapes that
are similar or close - this gives the ability to squeeze the manifold to lower
dimensions w/o losing the important information.
Dimensionality Reduction
Introduction to Machine Learning
Most popular dimensionality reduction technique, by far.
PCA identifies axis with largest amount of variance in the training set and uses it
to transform the data (preserving maximum variance).
Selecting the axis that preserves maximum amount of variance, will likely lose less
information than other projections. Another way to phrase it is - the axis that
minimizes the mean squared distance between the original dataset and its projection
onto that axis.
The axis that accounts of the largest variance is called principal component 1 -
PC1. PC2 is always perpendicular to PC1 and so on till PCn.
See: https://www.youtube.com/watch?v=HMOI_lkzW08
You can calculate the PCA by using SVD - singular value decomposition. NumPy’s
svd() function can obtain all the principal components of the training set. We can
then extract the two unit vectors that define the first two PCs. PCA assumes that
the dataset is centered around the origin, so subtracting the mean should be done.
Code example:

X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

See: https://www.youtube.com/watch?v=FgakZw6K1QQ : things to remember - sum of


distances of the origin across a principle component is an eigenvalue, SVD is used
for calculating the PCA plot, PCn was a linear combination of initial dimensions,
centering of the data.
PCA allows us to reduce the dimensionality of any dataset down to any number of
dimensions (not just 1), while preserving as much variance as possible, but a scree
plot can advise us when is it safe to do it.
Principal Component Analysis
Introduction to Machine Learning
Research question: if data varied across x axis less than across PC1 does it
necessarily mean that it will vary less across y axis than PC2?
Principal Component Analysis
Introduction to Machine Learning
Numerous libraries can perform PCA, you can even code it by hand (hough you would
want to understand the math behind it first).
We are going to use scikit, as it has the PCA transformer: https://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Note, scikit automatically centers but does not scale the data when performing PCA.
It is very important to scale your features if they are originally in different
scales (if you want to find out persons life expectancy, and have yearly salary in
10’s of thousands, but weight in 10’s the salary component will disproportionately
contribute to PC1 if not scaled).
The components_ attribute holds the transpose of the vectors defining the principal
component, so to get the vectors we use: pca.components_.T[:, 0]
Also running PCA multiple times on slightly different datasets may result in
different results. In general the only difference is some axes may be flipped - so
you will see the same values, but a minus sign before the numeric value.
Demo: PCA with scikit
Another useful piece of information is the explained variance ratio of each
principal component, available via the explained_variance_ratio_ variable. The
ratio indicates the proportion of the dataset’s variance that lies along each
principal component. A result of array([0.84248607, 0.14631839]) indicates that
84.2% of the original datasets variance lies along the PC1 and 14.6% along PC2,
that leaves around 1% for other principal components. If you have data in 3D and
you request all 3 PCs then it should add up to 100% if you have 3D data, but you
are reducing the number of dimensions to two.
Demo: explained variance ratio with scikit.
About PCA in data pipeline:
- if you are using PCA for model performance improvement you should
standardize/scale the dataset before performing PCA.
- if you are doing it for visualization (for example, for clustering) - it might be
even better not to standardize, but you have to check and see.
- feature engineering/feature elimination then dimensionality reduction.
- PCA requires the full data (sample PCA can give different results than population
PCA)
Principal Component Analysis
Introduction to Machine Learning
As PCA is capable of reducing any number of dimensions, we have to choose. How to
choose the right number of dimensions? Two rules:
If you are reducing dimensionality for vizualization - 2 or 3 dimensions are the
only choices.
If you are reducing dimensionality for improved training performance - choose the
number of dimensions that would account for at least 95% of the variance if
possible (the more, the better). This is just a rule of thumb, so you will have to
decide.
There are a few ways to achieving this:
checking when np.cumsum(pca.explained_variance_ratio_) becomes to low,
plot the explained variance against the number of dimensions using this methodology
better way PCA(n_components=0.95) - set n_components between 0.0 and 1.0,
indicating ratio of variance to preserve.
Demo: choosing dimensionality.
PCA for data compression
After dimensionality reduction - training set takes less space. MNIST, preserving
95% variance with only 20% original size!
This is a reasonable compression ratio, and you can see how this size reduction can
speed up a classification algorithm.
Possible to reverse PCA if you know component count of the original data. You will
not get ideal original, 1 - explained_variance_ratio_ will be lost. The data that
is lost is called reconstruction error. Use: inverse_transform() method.
Demo: PCA for data compression
PCA as preprocessing step - remember: you will need to transform all the X-data -
all the datapoints you are trying to predict. Call pca.fit() or pca.fit_transform()
on X_train. Then call only transform() on X_test as you probably don’t want PCA to
shift to the projection based on X_test when the model is trained.
If you plan to use the PCA weights for a long time, you will need to export them
and load them during application startup (same applies to Scalers). Simplest way to
do that is to serialize the PCA object (and Scalers), see: https://scikit-
learn.org/stable/modules/model_persistence.html
Demo: PCA for visualization - good starting point:
https://www.geeksforgeeks.org/implementing-pca-in-python-with-scikit-learn/
Demo: PCA as a preprocessing step in “real” problem (see: and / or this:
https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/ )
Principal Component Analysis
Introduction to Machine Learning
Many are available, some in scikit. Some popular ones:
Randomized PCA: svd_solver="randomized”, used for performance improvements, an
approximation technique of full SVD based solution.
Incremental PCA: batched PCA that can determine principal components w/o loading
all of the data into memory (out of core).
Kernel PCA (kPCA) - same kernel trick (as discussed with SVMs). Remember a problem
with kernel SVMs - how to choose the kernel? Same problem here and there is no
error function to guide the decision - a common approach to choose the best kernel
is to add kernel PCA into a 2-step pipeline and perform GridSearch on kernel
parameters observing which ones give us the best supervised learning results.
Locally Linear Embedding (LLE) - powerful nonlinear dimensionality reduction
technique, does’t rely on projections. Works by first measuring how each training
instance linearly relates to its closest neighbors then looking for a low-
dimensional representation of the training set where local relationships are best
preserved. Good at unrolling twisted manifolds, when not too much noise. Available:
sklearn.manifold.LocallyLinearEmbedding
Random Projections - might sound unbelievable but random projection can be
effective. Available: sklearn.random_projection.
Multidimensional Scaling (MDS) - a form of non-linear dimensionality reduction.
Reduces dimensionality while trying to preserve the distances between the
instances.
Isomap - creates a nearest neighbor based graph then reduces dimensionality
preserving geodesic distance.
t-Distributed Stochastic Neighbor Embedding (t-SNE) - Reduces dimensionality while
trying to keep similar instances close and dissimilar instances apart (not so for
PCA). Mostly used for visualization, as we will do in future parts of this course.
Available: sklearn.manifold.TSNE.
Linear Discriminant Analysis (LDA) - classification algorithm. During training
learns most discriminative axes between classes that can be used to define a
hyperplane for projection. Advantage - it will keep classes as far apart as
possible (not so for PCA). Good to use before running another classification
algorithm.
Learned Embeddings - deep learning magic, that we will demystify in future
lectures.
As you can see this is a big topic PCA and t-SNE we saw and will see again. Next
one to get more familiarity with would be LDA and only then LLE. Although knowing
PCA and t-SNE would get you a long way.
Other techniques
Introduction to Machine Learning
Clustering (cluster analysis) - the task of grouping observations so that members
of the members that are more similar are assigned to the same group. As a
substitute for an everyday metaphor, let’s state the principle of clustering in a
simplified manner: “you don’t need an expert (highly trained ML model) to tell that
some two things are similar - you might not know the breed of the dog, but you know
most dogs when you see them and can say that they are different from most if not
all cats”.
Like in classification, each instance gets assigned to a group, but unlike
classification, clustering is an unsupervised task, harder than classification. You
might see that in the side image - it is easy to distinguish iris setosa, but how
about that large cloud at the top? How many clusters do you see? You will be
surprised that advanced clustering algorithms make very few mistakes when
clustering this dataset. What helps them? More dimensions/features!
No universal definition of cluster - different algorithms capture different kinds
of clusters:
Some algorithms look for instances centered around a particular point, called a
centroid.
Others, for continuous regions of densely packed instances (of any shape).
Some algorithms are hierarchical, looking for clusters of clusters.
… and there are more. Clustering is a big topic, a good candidate for further
exploration.
Clustering
Introduction to Machine Learning
Data mining - the activity of applying machine learning models on big data and
extracting information from data this way.

Clustering
Introduction to Machine Learning
Applications of clustering:
When exploring datasets - EDA. Run a clustering algorithm when analyzing what you
have and then exploring clusters separately.
Social network analysis - cluster analysis to identify communities, suggest missing
connections between people.
Biology - find groups of genes with similar expression patterns (you have
expression patterns as data, find clusters).
Recommendation systems - identify products or media that might appeal to a user by
clustering the items (same cluster - similar items.) Obviously this would be just
one way to do it out of many.
Marketing - find segments of consumers by their purchasing or browsing behavior,
analyze how they respond to marketing campaigns and targeted discounting campaigns.
Even dimensionality reduction - for some applications we can replace the original
data of dimension N with the cluster affinity metrics (how well the instance fits
it’s cluster) which is K dimensional and usually K << N. Example: only use the
cluster centers when classifying new instances instead of entire dataset!
Anomaly detection - low affinity score (distance from the center is high) can
indicate outlier. Same for outlier elimination.
Semi-supervised learning - if we have few labels we can propagate same labels to
all instances of the cluster.
Search engines - cluster and return similar items by cluster (similar news articles
in something like google news).
To segment an image by clustering pixels according to their color then replacing
the color with the mean color of the cluster. You can read more about it here:
https://www.kdnuggets.com/2019/08/introduction-image-segmentation-k-means-
clustering.html

Clustering
Introduction to Machine Learning
K-Means algorithm - simple, quite powerful and pretty fast. Proposed by Stuart
Lloyd at Bell Labs in 1957, published outside of the company in 1982. In 1965,
Edward W. Forgy had published virtually the same algorithm, so K-Means is sometimes
referred to as Lloyd–Forgy / Lloyd, also naive k-means.
How does it work?
Pick number K (problematic as we will discuss shortly)
K centroids initialy placed randomly (initialization is problematic and will be
discussed more latter)
The random placement is chosen by taking k instances as the centers.
Calculate and update the centroids to minimize sum of distances from each
datapoint.
Repeat the process until the centroids stop moving (sum of distances is minimal).
Algorithm is guaranteed to converge (stop oscilating), but is not guaranteed to
converge optimally.
See: https://www.youtube.com/watch?v=4b5d3muPQmA and https://www.youtube.com/watch?
v=yR7k19YBqiw
Advantages:
Easy to understand
Easy to interpret and verify the results (if dimensionality is low)
Although possible to encounter slow convergence, usually one of the fastest
clustering algorithms (if you encounter slow convergence tune the initialization of
centroids).
Disadvantages:
Sensitive to the initial conditions (initial centroids, even data permutations
gives different results). Solutions are not stable.
Needs tunning for K
Sensitive to clusters w/ varying sizes, different densities, or nonspherical shapes
K-Means Clustering
Introduction to Machine Learning
Scikit and KMC
We can use make_blobs() function to generate simple cluster data for
experimentation.
Demo: Genearting data
Scikit offers a k-means clustering algorithm - sklearn.cluster.KMeans you specify
number of clusters k the algorithm must find. In some cases, it is obvious from
looking at the data what K should be, but in many cases it is not, we will talk how
to choose it soon.
Demo: Simple KMC
If you see that instances are incorrectly assigned you can try soft clustering -
which gives each instance a score per cluster - score can be the distance between
the instance and the centroid (transform() method measures the euclidian distance
from each instance to every centroid). Soft clustering can be used as
dimensionality reduction technique, although it would be an advanced topic.
This is an alternative to hard clustering - when the algorithm assigns a single
cluster - we discussed it above.
Demo: Hard Clustering vs Soft Clustering
K-Means Clustering
Introduction to Machine Learning
Centroid initialization methods
K-Means++ - default in scikit. This is an improvement to k-means with a smarter
initialization step that tends to select centroids that are distant from one
another, and this improvement makes the K-Means algorithm much less likely to
converge to a suboptimal solution (init='k-means++').
Run KMC with random initialization multiple times, keep best solution (use
init="random" parameter). Which solution is best? kmeans.inertia_ - is the
performance metric used by scikit. The algorithm is run n_init (10 by default,
centroid seeds) times, best score is kept. kmeans.score(X) returns negative inertia
because of the “greater is better” rule in scikit when using scoring methods, like
score.
Manual initialization - you can pass approximate initialization for the centroids
by visual inspection:

kmeans = KMeans(n_clusters=5, init=np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2],
[0, 2]]) , n_init=1)

Demo: centroid initialization. Let’s observe how random initialization can produce
suboptimal solutions.
Choosing K:
Choosing K is not so simple, as inertia is not a good performance metric when
trying to choose k because it keeps getting lower as we increase k (the more
clusters there are, the closer each instance will be to its closest centroid,
lowering the innertia). It could be used, but it will not result in optimal
solution.
More precise approach (more computationally expensive) is to use the silhouette
score (sklearn.metrics.silhouette_score()), which is the mean silhouette
coefficient over all the instances. An instance’s silhouette coefficient is equal
to (b – a) / max(a, b), where a is the mean distance to the other instances in the
same cluster (i.e., the mean intra-cluster distance) and b is the mean nearest-
cluster distance (i.e., the mean distance to the instances of the next closest
cluster, defined as the one that minimizes b, excluding the instance’s own
cluster). Silhouette coefficient can vary between –1 and +1:
close to +1 means that the instance is well inside its own cluster and far from
other clusters,
close to 0 means that it is close to a cluster boundary,
close to –1 means that the instance may have been assigned to the wrong cluster.
Demo: Choosing K

K-Means Clustering
Introduction to Machine Learning
//
K-Means Clustering
Introduction to Machine Learning
//
K-Means Clustering
Introduction to Machine Learning
PCA can be performed before doing the clustering - this would help vizualize the
clusters on a highly dimensional data since PCA will alow us to reduce the
dimensionality. Note that feature scaling (normalization or standardization)
usually is applied before clustering as well as PCA. We do not want our clusters to
be determined just because the values are on a different scale, we want them to be
determined by geometric distance and closeness to each other in a situation where
the scale of the features would not matter (one of many reasons is that the
absolute values are not important - the differences between values are: it does not
matter that I earn 10K€ in the year 2025 as much as it would have mattered in 2018.
This is an important, general point in data analysis - the values do not matter,
the comparison does).

Original data → Scaling (Standartization / Normalization, if needed (ofter it is))


→ PCA Transform → PC elimination → Clustering → … → Instance Cluster Membership as
a feature (feature engineering) → Test/Train split → Modeling
K-Means Clustering
Introduction to Machine Learning
The variations of data are numerous and so many clustering methods are developed to
address the numerous situations we can get into with data. It is advisable to test
multiple clustering algorithms when clustering is an important step in the ML
pipeline of the product.
Other methods
Introduction to Machine Learning
4th learning type (together with supervised, unsupervised, reinforcement).
Semi-supervised learning is an approach to machine learning that combines a small
amount of labeled data with a large amount of unlabeled data during training,
special instance of weak supervision. Goal of SSL: increase the amount of labeled
data we have in a semi-automatic way then through incorporate the most confidently
predicted labels into the supervised/labeled dataset and retrain (iterative
refinement).
In short: mostly label generalization / propagation.
What we need to know is the definition and some examples:
ML model is trained on small dataset, then labels are applied, then model is
retrained on a larger or complete dataset (pseaudo-labeling).
Supervised ML model trained on small dataset and combined with another model (KMC)
to apply learned labels for the same cluster.
See more: https://machinelearningmastery.com/what-is-semi-supervised-learning/ ,
there are not that many good short videos on this, this might do for now:
https://www.youtube.com/watch?v=b-yhKUINb7o&
Do not confuse this with self-supervised learning.
Semi-supervised learning (SSL)
Introduction to Machine Learning
TBD
Anomaly Detection
Introduction to Machine Learning
This practical project - PP4, requirements:
Choose any dataset (can be found in kaggle, scikit inbuilt datasets, or find them
anywhere on the internet, or even use generated - any source is acceptable. The
only requirement for the generated datasets is that they must have noise >=0.3 in
them)
Choose a problem you want to solve. The problem must be a supervised learning
problem - regression or classification.
Use k-fold cross validation (CV) and/or GridSearch/RandomizedSearch tune a
model/ensemble until you get best possible result and model selection.
Include proof of the process of tunning and model selection (for example two cells
where you tried some model parameters and another cell where you tried others).
Expect questions: why certain parameters or parameter ranges were chosen.
Visualize results - please be sure to visualize the final predictions: decision
boundary (classific.) or the regression line (regr.) need to be added.
Provide a scoring metric - choose as appropriate (rmse/mse/mae/sse/r2_score, f1
score/auROC and so on).
Provide a short paragraph (5-15 sentences) on why do you think the parameters you
found were the best? Was it related to data distribution, shape, clear decision
boundary, dimensionality, etc.?
Note - if you have participated in the class exercises you can just use the same
code, just include all the missing pieces to satisfy requirements.
Provide google collab link (or github link with .ipynb file).
* Extra bonus: tune the model with and without PCA (eliminating a few dimensions) -
answer the question: was PCA useful for your specific situation to achieve higher
accuracy. Why?

** SMOTE experimentation using scientific paper or synthetic datasets.


Practical Project 4
Introduction to Machine Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Introduction to Machine Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 159OhBt7LXpNHJ0-QbhHzaLpInkSoRDx4.pptx ---


Artificial Intelligence
Sequential Data Analysis
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Structure of Part 9
01
02
Sequential data - definitions
Sequential Data Analysis
00
Intro
Identifying time series data
03
04
Tasks with time series data
Storing time series data
07
08
Generating time series data
06
Obtaining sequential data
Visualizing time series data
09
10
Preprocessing and EDA for time series data
05
Time series data patterns
We finished talking about computer vision tasks for now - but what a journey it
was!
In this part of the course we are going to talk about processing sequential data
and RNNs. We will define what both of those terms mean in great detail.
The problem we will focus on: time series data prediction in general and stock
price prediction in particular.
So in summary 3 main topics: sequential data, RNNs and stock price prediction.
Intro
Sequential Data Analysis
However we will see more material / problems than this. What follows is an outline
of this part of the course:
Talk about the properties of sequential data. We will define what it is and the
types of sequential data there are.
Identify common tasks performed on sequential data, like: generation,
visualization, trend identification, classification, denoising, smoothing, etc.
After identifying the common problems / tasks with the data we will see some non-
DL/non-ML classical solutions (e.g.: ARIMA models for forecasting). This will give
us the ability to eventually compare RNNs and better understand what we can do with
sequential data.
After seeing some classical techniques we will solve a forcasting problem with
FCFFNN. This will allow us to compare RNNs with FCFFNNs latter on.
Then we will be ready to introduce RNNs, talk about their types and properties,
situations where they should be used.
Then we will delve deeper into Simple RNNs, LSTMs, GRUs.
Finally we will talk about stock market prediction problem and it's solutions.
Structure of Part 9
Sequential Data Analysis
Sequential Data - data where the order of the data matters. For this general kind
of sequential data the time stamp is irrelevant or it doesn’t matter. (Example: DNA
/ RNA sequence. As you see the concept of time is irrelevant, so the order is not
temporal. NLP - sequence of words). Order matters, but time scale does not. Earlier
observations provide information about the future (observations from the previous
10 days can inform our decisions for tomorrow).
Temporal Sequence - in addition to the order of data, the time stamp also matters.
(Example: Data collected from customers’ shopping behaviour, considering their
transaction time stamp as the temporal dimension). Not only ordering but ordering
in time. Text data can be though of sequential non-temporal or sequential temporal
if you have a way to measure time (audio).
Time Series (TS) - data is in order, order is time-wise, but what makes it diffrent
the fixed time-difference between occurrences of successive data points (every x).
Examples: Time series of the temperature of a surface being recorded every 120
seconds, stock market price, ECG diagram. The characteristics that describe this
data are: temporal dimension (some say not necessarilly time) and a fixed interval
in that dimension. Most of the data in engineering and science is time series data.
Some people define any sequential data that is real valued (as opposed to simbolic,
like a set of words in a sentence) as a time series data. TS is arguably the most
common sequential type of data out there. And one of the most common ones in
general.
Sequential data - definitions
Sequential Data Analysis
Examples in a list form
Stock market price - time series data.
DNA sequence - sequential.
ECG - time series.
Video - a stream of pictures with constant rate (framerate) - ?
Human speech - does it depend on the task? For example predicting the next word in
a sentence given a sentence vs. measuring the rate of words pronounced in a fastest
speaker competition. A pause between words are important!
Music - we can speak about music in at least two different vocabularies: music
theory, and physics. But it's not possible to talk about it w/o talking about it.
And it's also task dependent (predict the next note given a bunch of notes vs.
generate the next sound).
Sensor data (wheather / preasure in a vessel) - ?
Coin flips recorded with the timestamp when the flip was performed - ? Does a
coinflip provide information about the future coinfilps?

The data can be univariate or multivariate


Univariate - single variable plotted against time
Multivariate - multiple variables (codependent or not)
Sequential data - definitions
Sequential Data Analysis
//
Sequential data - definitions
Sequential Data Analysis
This is a data to examplain name calling based on 2 variables - is it sequential /
ts?
Identifying time series data
Sequential Data Analysis
How about this data?
Identifying time series data
Sequential Data Analysis
Task groups:
Forecasting: understanding the future based on the past, plan, prognosis.
“Retro” tasks: understanding the past (pattern finding, anomalies, filling missing
values, max changes, classification and so on).

Tasks - these are the tasks related to the actuall bussines problem solving (as
opposed to other tasks like generating and visualizing TS data which are related to
data, not to the bussines problem):
Regression - predicting a continous variable in time series.
Classification - given a set of time series with class labels, can we train a model
to accurately predict class of new time series? Example: ECG cardiograms classified
by different issues of the heart.
Predicting the next word/letter.
Forecasting - financial asset prices in a temporal space, ussually refering to
predicting multiple values, not one continous variable. Forecasting can be
described a prediction of the future based on the past.
Seasonality detection - detecting cyclical variations in time series data that then
can be used latter for intepretation of the data.
Action modeling in sports - predict the next action in a sporting event like
soccer, football, tennis etc.
Speach synthesis - generating speech or text (can be thought of as prediction).
Music composition - https://konstilackner.github.io/LSTM-RNN-Melody-Composer-
Website/
Next image/frame generation - https://arxiv.org/pdf/1502.04623.pdf
Sentiment analysis - classification task ussually (this should be considered as a
sequential data only if we are not using BOW models)
Anomaly detection - detecting data points that violate the general pattern that the
data follows.
Tasks with time series data
Sequential Data Analysis
These are the patterns TS data can exhibit:
Uptrend / horizontal / steady / downtrend, possitive/negative secular/upwards
trend. Example.
Regular variation - seasonality (weekly sales). Example.
Irregular variation - cyclical, peaks and trofs are not regular. Example.
Combination: positive secular trend with seasonality. Example.
Random - no identifyable variation (trends sometimes are not counted). Example.

Question - what is the difference between seasonal and cyclical: A seasonal pattern
exists when a series is influenced by seasonal factors (e.g., the quarter of the
year, the month, or day of the week). ... A cyclic pattern exists when data exhibit
rises and falls that are not of fixed period. An example you can remember -
shorterm and longterm credit cycles - we “know” that these cycles exist, but we
don’t know when each cycle will end/begin.
Time series data patterns
Sequential Data Analysis
Explanatory video: https://www.youtube.com/watch?v=rPrJ7sSbTqM
Explanatory SO: https://stats.stackexchange.com/a/234601/162267
VU slides:
http://web.vu.lt/mif/a.buteikis/wp-content/uploads/2019/02/Lecture_03.pdf
Time series data patterns
Sequential Data Analysis
Let’s practice to identify patterns
Time series data patterns
Sequential Data Analysis
Let’s practice to identify trends
Time series data patterns
Sequential Data Analysis
Let’s practice to identify trends
Time series data patterns
Sequential Data Analysis
When TS Analysis is not used
Values are constant.
Values are generated using functions (except when you don't know that).
In short - we are interested in real world data when performing analysis.
Time series data patterns
Sequential Data Analysis
Some raw datasets:
The UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php
The UEA and UCR Time Series Classification Repository:
http://www.timeseriesclassification.com/
Governmental:
https://www.ncdc.noaa.gov/cdo-web/datasets - publishes a variety of time series
data relating to temperatures and precipitation at granularities as fine as every
15 minutes for all weather stations across the country.
https://www.cdc.gov/flu/weekly/overview.htm - publishes weekly flu case counts
during the flu season.
https://fred.stlouisfed.org/ - economic time series data
https://www.comp-engine.org/# - “self organizing database of time-series data” has
more than 25,000 time series databases, almost 140 million individual data points:
“website allows you to upload time-series data and interactively visualize similar
data that have been measured by others”
https://cran.r-project.org/web/packages/Mcomp/index.html (R only)
Note: It can be difficult to learn model turning on these data sets because they
present complicated problems. However, they are good for learning cleaning,
visualization, etc. For example, many economists spend their entire careers trying
to predict the unemployment rate in advance of its official publication, with only
limited success.
For some easier datasets we have these:
Python library and a company: https://www.quandl.com/, now nasdaq (see:
https://github.com/Nasdaq/data-link-python) - this enables faster experimentation
with the concepts.
Things like google trends, bitcoin / etherium exchange rate monitoring tools are
also ways we interact with time series data. APIs are the ussual choises. There are
APIs, that give data in time intervals (for the entire time interval) and there are
APIs that give only current data (you will need to collect that data yourself over
a time period).
Tables rows - not hard to generate, images - very hard to generate manually, time
series - not hard (for simple cases).
Obtaining sequential data
Sequential Data Analysis
Time series data can be stored in may ways, just to name a few:
File without a format at all.
File with a custom format (.lisp.7z)
File formated to standard formats: csv (text), xls (binary), xlsx (binary).
APIs (SOAP, REST, GQL)
Different general databases (Mysql, Postgresql)
Specific databases for TS
Dedicated TSDB solutions: influx db
Tradional relational databases like PostgreSQL can be used as time series databases
with plugins, like: https://www.timescale.com/
Other nosql options: cassandra (more often used for it’s multimaster clustering
capabilities), mongo
Storing TS data @scale is a complex topic (as anything @scale), if you need to do
it I would recommend starting with Postgresql + Timescale, researching influx (it
was worse than Timescale in 2016, we researched it in company I worked in) and
other nosql solutions only if you really, really need them.
Recommended reading: https://stackoverflow.com/questions/27002903/efficiently-
storing-time-series-data-mysql-or-flat-files-many-tables-or-files and
https://medium.com/@neslinesli93/how-to-efficiently-store-and-query-time-series-
data-90313ff0ec20
Storing time series data
Sequential Data Analysis
Generating data is an important skill to have:
https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-
data-scientists-915896c0c1ae
When you are training a model, testing a database you need simple data with
controlled noise injected into the data (or any other phenomenon carefully
crafted), simulate different patterns and so on.
What of generating the data:
By hand / hardcoded - simple list or dictionary
Numpy / pandas. For example: pd.date_range(start='1/1/2018', end='01/08/2018',
freq='H')
Utilities like TimeSeriesMaker: https://github.com/mbonvini/TimeSeriesMaker or
TimeSynth: https://github.com/TimeSynth/TimeSynth
… and more that can be found on the internet. Sometimes from research papers.
Generating time series data
Sequential Data Analysis
Visualization plays an important role in time series analysis and forecasting.
Plots of the raw sample data can provide valuable diagnostics to identify temporal
structures like trends, cycles, and seasonality that can influence the choice of
model. We will see 6 different plots:
Line Plots.
Histograms and Density Plots.
Box and Whisker Plots.
Heat Maps.
Lag Scatter Plots.
Autocorrelation Plots.
The three that are mostly associated with time series data are in bold above.
The focus is on univariate time series, but the techniques are just as applicable
to multivariate time series, when you have more than one observation at each time
step.
Visualizing time series data
Sequential Data Analysis
Box and Whisker Plots - plot draws a box around the 25th and 75th percentiles of
the data that captures the middle 50% of observations.
IQR - interquartile range.

In the context of stock pricing similar looking charts mean a different thing.
Candle stick chart.
Visualizing time series data
Sequential Data Analysis
Box and whisker plots do indicate skewness.
Visualizing time series data
Sequential Data Analysis
Lag Scatter Plots - a datapoint at time t has a particular value and it depends on
t-1 point. The points prior to a particular point are called lags.
We can plot t and t+1 on x and y axis for each graph point - lag plot.
Pandas has lag plotting functionality: lag_plot(series, lag=n)
When trying to understand the lag plots, we can imagine a sliding window moving
over the time series of width that is equal to the lag size, let’s call it wt,t+1.
So between w1,2 and w2,3 we get that both values (the y at t and y at t+1) are
increasing then we have a positive correlation. And the lag plot will have an
increasing slope:

… if the values are decreasing going between w1,2 and w2,3 then we have a negative
correlation and the lag plot will have a negative slope (see above).
If the time interval is 1 year, then the biggest correlation will be at lag=1 or
lag=365, the largest negative correlation at lag=180 (approx). Closest to 0 at 90
(not tested).
Visualizing time series data
Sequential Data Analysis
Autocorrelation Plots - two variables are said to be possitivelly correlated if
when one increases the other increases as well and negativelly correlated when one
increases the other decreases. The magnitude of the correlation is at max -1 / 1.
Correlation values, called correlation coefficients, can be calculated for each
observation and different lag values. Once calculated, a plot can be created to
help better understand how this relationship changes over the lag. This type of
plot is called an autocorrelation plot or autocorrelation function plot (ACP or ACF
plot) and Pandas provides this capability built in, called the
autocorrelation_plot() function.
Importantly - autocorrelation plots and (pACF plots which we will see latter)
answer the question - how much data do I need to collect untill I have captured all
of the behaviors or the system (well, at least predictable behaviors). Think of a
sprinkler w/ rotation speed of 1 min per rotation. If it has constant water flow
and preasure after 1 minute there will be no additional information we can capture.
Partial autocorrelation plots - is a summary of the relationship between an
observation in a time series with observations at prior time steps with the
relationships of intervening observations removed. The partial autocorrelation at
lag k is the correlation that results after removing the effect of any correlations
due to the terms at shorter lags. PACF only describes the direct relationship
between an observation and its lag.
Another important thing why PACF and ACF's are important - we will be able to
choose AR, MA or ARMA models based on PACF and ACF distributions.
Visualizing time series data
Sequential Data Analysis
As with any data analysis task, cleaning and properly processing data is often the
most important step of a data pipeline. Fancy techniques can’t fix messy data
(GIGO). Most data analysts will need to find, align, scrub, and smooth their own
data either to learn time series analysis or to do meaningful work in their
organizations. As you prepare data, you’ll need to do a variety of tasks, from
joining disparate columns to resampling irregular or missing data to aligning time
series with different time axes. We will start with trend detection.
Some techniques:
Smoothing (Hodrick–Prescott) and filtering
Downsampling (common task in time series preprocessing)
Enrichment (adding context e.g. adding GPS coordinates from Wi-Fi positioning
system)
Rolling window statistics
Missing / corrupt data handling: deletion (can’t delete one!), imputation,
interpolation
Missing / corrupt data handling with forecasting
Stationarity: https://www.youtube.com/watch?v=oY-j2Wof51c - stationary time series
is one that has fairly stable statistical properties over time, particularly with
respect to mean and variance. Synonymous with “stability”.
STL - Seasonal and Trend decomposition using Loess (next lecture)
Preprocessing and EDA for time series data
Sequential Data Analysis
Explore additional time series / sequential data preprocessing techniques, like
smoothing techniques.
Additional time series visualization techniques, like time series heatmaps,
calendar heatmaps, see:
http://www.columbia.edu/~sg3637/blog/Time_Series_Heatmaps.html
Further explorations
Sequential Data Analysis
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Sequential Data Analysis


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1nl6lwf0T7mDt3L6Wr8i6uS7m3wrEi39u.pptx ---


Artificial Intelligence
Introduction to Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Definition of Deep Learning
01
02
Definition of Neural Network
Introduction to Deep Learning
00
Introduction to part 5
Artificial Neuron Model
03
04
Functions
05
06
07
Scalars, vectors, matrices & tensors
BSF, Linear
08
Non Linear Functions, Sigmoid, TanH
Creating Perceptron from numpy-scratch
09
Creating Perceptron with scikit
In this lecture series we will build a foundation for what is to come next
according to the curriculum.
We will cover all the concepts and definitions that will be used in the rest of the
course.
We will also use the foundation we have built in previous parts - we know many
datasets, regression and classification tasks, evaluation metrics. In other words
we will be able to compare deep learning approach to more “classical” ML
approaches.
Solve some practical challenges. With these we will try not to go too deep into the
topics that will be covered in depth further on (Image processing, NLP and so on).
We will introduce tools we will use in future sections: Tensorflow, Keras, Pytorch,
Fast.ai.

The basic structure will be this:


A more static look at Deep Learning - we will focus on the structures or the
components of deep learning. We’ll answer the questions like: what are the
component parts of deep learning. How do the parts fit into a whole? These
structures include: Neurons / perceptrons, Functions (activation and loss),
Hyperparameters.
Then we will get more “dynamic” - we will cover the processes, like learning,
tunning, backpropagation and forward propagation.
Finally, in this overview part we will talk about types of neural network, their
applicability and cloud offerings that can make product creation on prototyping
faster.
Introduction to part 5
Introduction to Deep Learning
What is deep learning?
Deep learning is a form of artificial intelligence that uses a type of machine
learning algorithm(s) called an artificial neural network(s) with multiple hidden
layers which learns hierarchical representations of the underlying data in order to
make predictions given new data (progressive generalization).
Deep Learning if you were to imagine all of machine learning as one country, deep
learning would be a province in that country that dedicates itself to the study of
Neural Networks. So, an informal and simplified definition we can keep in our minds
is: deep learning is the study and application of deep neural networks in their
various forms.
Definition of Deep Learning
Introduction to Deep Learning
There is sometimes a distinction between Neural Networks and Deep Learning. Neural
networks can be shallow or deep. Deep learning concerns itself with Deep NNs.
Definition of Deep Learning
Introduction to Deep Learning
What is a Neural Network? An artificial neural network (ANN) is a machine learning
algorithm based on a very crude approximation of a biological neural network in a
brain. Artificial neural networks work quite differently than real biological
neural networks; however, they were inspired by their biological counterpart.
The analogy between a deep neural network and the brain comes from:
neural networks are build from layers that can represent / learn increasingly
abstract “concepts” from simple ones. Studies suggest that mammalian visual system
works in this “layered” way
synaptic connections - like our brain neural network is compose of neurons that are
also in a way connected (each neuron communicates with a set of other neurons)
massive interconnectivity - the connection count can be very big. Each neuron in a
neural network can connect to every other neuron in the next layer, called: Fully
Connected Neuron.
Definition of Neural Network
Introduction to Deep Learning
Illustrations from: Deep Learning Illustrated: A Visual, Interactive Guide to
Artificial Intelligence, see: https://www.amazon.com/Deep-Learning-Illustrated-
Intelligence-Addison-Wesley/dp/0135116694
Neural Networks
A set of connected mathematical “neurons” is called a neuronal network.
Composed of edges (connections) and nodes (neurons) (graph, DAG).
Have an input layer, output layer and >= 1 hidden layer - deep neural network.
There are also neural networks that are “shallow”, rarely used.
To train it we use two steps:
Forward Propagation - cost computed
Back Propagation - cost minimized
We represent weights and data as vectors, metrices, tensors.
We will abreviate neural network(s) as NN(s).
Definition of Neural Network
Introduction to Deep Learning
One important thing to note. In classical ML we specify the features that we think
are important.
This is not how DL models work. In DL it the the job of the neural network to find
out through the process of “learning” the most important features.
This type of learning where the important parameters for prediction are not pre-
specified is called representation learning.
This becomes more apparent when discussing Convolutional Neural Networks, which are
a type of neural network created for image processing where will be able to draw
heatmaps on images where CNN pays the most attention.
See more: https://www.quora.com/What-is-representation-learning-in-deep-learning
Now let’s build our understanding of neural networks from the ground up.
Definition of Neural Network
Introduction to Deep Learning
Artificial Neuron Model
Although there are similarities between the brain and neural network, there are
also differences - neurons assumed to either fire or not (binary), in ANN that is
not the case. This difference is ussually not stressed, but it’s important. Because
ANN’s “have continous range activations” we can use the process of backpropagation.
A single neuron in ANN discussion is called a perceptron.
The structure of the perceptron in general terms: inputs of predefined form
(constant feature count) come in → they are multiplied and summed with the
trainable weights + a bias term → the result of this is passed through an
activation function → whatever the result of this function it is propagated to
other neurons/perceptrons.
As you can see a single artificial neuron is nothing else than two functions -
summation and activation.
We’ll discuss different types of activation functions soon, for now, let’s combine
many of these artificial neurons.
Artificial Neuron Model
Introduction to Deep Learning
Because we're working with computers rather than a biological brain, we represent
this network of nodes using computationally efficient data structures.
Scalars: variables like, 1 or 5010.956
Vectors: one-dimensional arrays of values.
Matrices: two-dimensional arrays of values.
Tensors: arrays with an arbitrary number of dimensions. In mathematics lingo it’s a
generalization of the concept of matrix into higher dimensions. The order or tensor
rank is the dimension of the tensor. 3rd order tensor in the image on the side
image.
Groups of (... groups of) tensors - are also tensors. They can go up to any
dimension you like (remember 3D tensor can store RGB images, a video is a group of
images - 4D tensor, if you split this video into segments, a group of such segments
will be 5D tensor and so on).
What do we represent this way in the artificial neuron model? The weights and the
input and output data will be tensors, flowing though the network of connected
artificial neurons.
Scalars, vectors, matrices & tensors
Introduction to Deep Learning
Before discussing how neurons connect into networks as structures comprising the
deep learning as a topic we can discuss what types of functions we will need to
differentiate in deep learning.
We have 3 big topics here:
Activation functions
Loss / error / cost functions
Optimizers

Before discussing anything in particular we need to note, that there is a huge


variety of activation functions and optimizer topic is also a bit complex. Their
number is big as they are hyperparameters that we can choose.
But there are also good news - the error / loss function comes directly from the
classic machine learning model we talked about, like r^2, mean squared error used
for regression, log loss / cross entropy loss used for classification are common
ones.
Functions
Introduction to Deep Learning
What is an activation function?
Function is attached to each neuron in the network, determines the output.
Activation functions also help normalize the output of each neuron to a range
between 1 and 0 or -1 and 1.
Activation functions is that they must be computationally efficient because they
are calculated across millions of neurons for each data sample.
Saturation and active reagion - activation function has a portion that has a
defined slope. It’s called the active reagion. For the neuron to be able to learn
the values should be in the active reagion or the neuron is said to be dead. This
is because we need to calculate derivatives of the function w.r.t to each of it’s
contributions to the total error. Hence the function needs to be differentiable.

Function classes:
(Binary) Step Function
Linear Function
Non linear
Sigmoid
Hiperbolic Tangent
Rectified Linear Unit - ReLU
Others...

Ref: https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-
activation-functions-right/

Activation functions
Introduction to Deep Learning
Binary step function aka Heaviside step function
threshold-based activation function. If the input value is above or below a certain
threshold, the neuron is activated and sends exactly the same signal to the next
layer.
problems: does not allow multi-value outputs, so cannot support classifying the
inputs into one of several categories, or making continuous predictions for
regression. This makes a single neuron less powerful than it potentially might be
(if y can be 1 or 0, that means a NN of 6 neurons has ~6! states, if it is
continuous then it’s much more, so the power to represent different states of the
world is increased) Second: it is not differentiable at 0.
Linear
y = mx + b
It takes the inputs, multiplied by the weights for each neuron, and creates an
output signal proportional to the input. In one sense, a linear function is better
than a step function because it allows multiple outputs, not just yes and no.
Problems: unusable backpropagation (gradient descent) to train the model—the
derivative of the function is a constant, and has no relation to the input, X. So
not possible to understand which weights in the input neurons can provide better
prediction. Also, all layers of the neural network collapse into one — with linear
activation functions, no matter how many layers in the neural network, the last
layer will be a linear function of the first layer (because a linear combination of
linear functions is still a linear function). So a linear activation function turns
the neural network into just one layer. A neural network with a linear activation
functions is simply a linear regression model (no matter the size, assuming all
activations are the same).
BSF, Linear
Introduction to Deep Learning
Modern neural network models primarily use non-linear activation functions. They
allow model to create complex mappings between the network’s inputs and outputs,
which are essential for learning and modeling complex data, such as images, video,
audio, and data sets which are non-linear or have high dimensionality.
They allow backpropagation because they have a derivative function which is related
to the inputs.
They allow “stacking” of multiple layers of neurons to create a deep neural
network. Multiple hidden layers of neurons are needed to learn complex data sets
with high levels of accuracy.
The representational power of the models is huge compared to linear or bsf
functions. Way more internal states.
Common functions falling into this category
Sigmoid
Hiperbolic Tangent (tanh)
Rectified Linear Unit - ReLU (next time)
Leaky ReLU (next time)
Parametric ReLU and others (next time)
Softmax (next time)
Radial Basis Function - RBF (next time)
Swish (next time)
Non Linear Functions, Sigmoid
Introduction to Deep Learning
One of the most popular non-linear activation functions.
We will understand the vanishing gradient and output not being zero centered as
problems in the future. For now let’s note that these are problems for the
algorithm that is used to train the neural network - backpropagation with gradient
descent. Vanishing gradient will not allow large inputs to be translated into
proportional outputs and absence of zero centeredness (a.k.a the y intercept is not
at 0 so the mean of y values is guaranteed to be non 0 (possitive)) can cause
gradient descent to “zig-zag” as described here:
https://stats.stackexchange.com/q/237169/162267
Non Linear Functions, Sigmoid, TanH
Introduction to Deep Learning
Abbreviated Tanh, stands for hyperbolic tangent.
Very popular.
A variation of the sigmoid function, sometimes called “shifted sigmoid”
Non Linear Functions, Sigmoid, TanH
Introduction to Deep Learning
We will create a simple perceptron w/ sigmoid activation function to abtain a
result that ranges (0, 1).
The demo is mostly derived from here: https://www.youtube.com/watch?v=kft1AJ9WVDk
but we will go in more depth.
Creating Perceptron from numpy-scratch
Introduction to Deep Learning
Note, parentheses mean non-inclusivity “( )“ and these “[]” mean inclusivity. We
say that sigmoid function asymptotically approaches 0 and 1, but never actually
reaches those values, that’s why we don’t write [0,1].
In scikit they have a Perceptron class. It is essentially a linear classifier.
See: https://scikit-learn.org/stable/modules/linear_model.html#perceptron
Also:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.h
tml
We can see that scikit perceptron is capable of classifying the digits dataset - so
it is more powerful than one might expect. This is not a real percetron it appears,
as it should not theoretically be that powerful - how can we very that? Simply
compare this to LogisticRegression results.

After creating a single neuron - perceptron, we will next turn into training it. We
will discuss loss functions, more activation functions and so on next time.
Creating Perceptron with scikit
Introduction to Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Python Crash Course


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1R0NN0lobnLImQ5WnnuhHjRCNsD4x6B5g.pptx ---


Artificial Intelligence
Generative Deep Learning
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Examples
01
02
Common task categories
Generative Deep Learning
00
Intro and definitions
Business Usecases
03
Sampling strategy
05
04
Text Generation
06
Further explorations
Two aspects of deep learning
automating what was not possible with other techniques.
generating new content. With this generative deep learning differentiates itself
from ML.
The potential of artificial intelligence to emulate human thought processes goes
beyond passive tasks such as object recognition and mostly reactive tasks such as
driving a car. It extends well into creative activities. In recent history:
In the summer of 2015, we were entertained by Google’s DeepDream algorithm turning
an image into a psychedelic mess of dog eyes and pareidolic artifacts;
in 2016, we used the Prisma application to turn photos into paintings of various
styles (see a few slides forward).
In the summer of 2016, an experimental short movie, Sunspring, was directed using a
script written by a Long Short-Term Memory (LSTM) algorithm—complete with dialogue.
Maybe you’ve recently listened to music that was tentatively generated by a neural
network.
GPT-3 - NLP explosion (text to text).
Stable Diffusion - text to image.

Granted, the artistic productions we’ve seen from AI so far have been fairly low
quality. AI isn’t anywhere close to rivaling human screenwriters, painters, and
composers. But replacing humans was always beside the point: artificial
intelligence isn’t about replacing our own intelligence with something else, it’s
about bringing into our lives and work more intelligence—intelligence of a
different kind. In many fields, but especially in creative ones, AI will be used by
humans as a tool to augment their own capabilities: more augmented intelligence
than artificial intelligence.
Intro and definitions
Generative Deep Learning
A large part of artistic creation consists of simple pattern recognition and
technical skill. And that’s precisely the part of the process that many find less
attractive or even dispensable. That’s where AI comes in. Our perceptual
modalities, our language, and our artwork all have statistical structure. Learning
this structure is what deep-learning algorithms excel at. Machine-learning models
can learn the statistical latent space of images, music, and stories, and they can
then sample from this space, creating new art-works with characteristics similar to
those the model has seen in its training data.
Naturally, such sampling is hardly an act of artistic creation in itself. It’s a
mere mathematical operation: the algorithm has no grounding in human life, human
emotions, or our experience of the world; instead, it learns from an experience
that has little in common with ours. It’s only our interpretation, as human
spectators, that will give meaning to what the model generates. But in the hands of
a skilled artist, algorithmic generation can be steered to become meaningful—and
beautiful. Latent space sampling can become a brush that empowers the artist,
expands the space of what we can imagine. What’s more, it can make artistic
creation more accessible by eliminating the need for technical skill and practice—
setting up a new medium of pure expression, factoring art apart from craft (we can
think about what it means). Iannis Xenakis, a visionary pioneer of electronic and
algorithmic music, beautifully expressed this same idea in the 1960s, in the
context of the application of automation technology to music composition:
It is not easy to imagine the usage of deep learning in music generation (we can
read about it), but abstractly, it’s possible that in the space of all possible
melodies, sounds, etc. there are unexplored combinations that deep learning can
help su find (like electronic music unlocked a large space of new sounds, melodies
thus giving rise to new style). Dummy example: give a NN pleasant and unpleasant
melodies and ask it to generate something pleasant - tune until something original
is found.
Intro and definitions
Generative Deep Learning
In summary:
AI will probably become an additional form of art rather than replace art/artists
that use other means of expression. Cultural forms will coexist, like they ussually
do.
Creating art with the help of computers can be done via brute force, but with
Generative AI we will try to do something better: give some input that will act as
a constraint for the generative model.
Intro and definitions
Generative Deep Learning
https://impakter.com/art-made-by-ai-wins-fine-arts-competition/
Deep Dream https://www.youtube.com/watch?v=3hnWf_wdgzs and
https://www.youtube.com/watch?v=oyxSerkkP4o and https://www.youtube.com/watch?
v=T5jaCr4RAJc
Prisma (based on DeepArt: https://en.wikipedia.org/wiki/DeepArt):
https://www.youtube.com/watch?v=xky0NoxronY and https://www.youtube.com/watch?
v=U5OCXVvXlh0 - https://play.google.com/store/apps/details?
id=com.neuralprisma&hl=en&gl=US
Sunspring: https://www.youtube.com/watch?v=LY7x2Ihqjmc
AWS Deep Composer: https://aws.amazon.com/blogs/machine-learning/collaborating-
with-ai-to-create-bach-like-compositions-in-aws-deepcomposer/
OpenAi Musenet: https://openai.com/blog/musenet/
Newest thing: text-to-music (https://suno.com/). Generation is not the only task
that is interesting in music - we could created a music producer that would
potentially know which song will become very popular/viral (what Timbaland does).
Examples
Generative Deep Learning
Even though the field of generative deep learning is not old, we already have
several different tasks or kinds of tasks we can perform:
Text Generation - ussually performed with/using RNNs/Transformer (SSMs/Mamba).
Deep Dreaming - performed with/using CNN.
Neural Style Transfer - performed with/using CNN.
Image/video generation - GAN/Transformers/Diffusion models (text-to-image, text-to-
video).
SORA
Common task categories
Generative Deep Learning
Generative AI can be used in the following situations:
Generating speach with chatbots.
Generating music, text, pictures - original art pieces (NFTs), international
competitions.
Improvement of traditional ML/DL models though geneartion of new realistic data
(images for cancer detection)
Business Usecases
Generative Deep Learning
Multiple resources online:
LSTM based:
https://keras.io/examples/generative/lstm_character_level_text_generation/
Transformer: https://github.com/fchollet/deep-learning-with-python-notebooks/blob/
master/chapter12_part01_text-generation.ipynb
Miniature GPT:
https://keras.io/examples/generative/text_generation_with_miniature_gpt/
FNET, that is supposed to improve on attention and make the training process faster
(new topic in DL): https://keras.io/examples/nlp/text_generation_fnet/

Important things:
Generative deep learning requires good data cleaning (no longer true?)! In
classification, we did not care that there are tokens like <br>, <hr>, \n, “ ”
(two spaces) and so on. But with text generation the model should not think that
these tokens are human language related.
The last dense layer will output a softmax probability distribution for each char,
so the dense layer will have len(number of chars) units:

Dense(len(chars), activation="softmax")
Text Generation
Generative Deep Learning
When generating text, the way you choose the next character is important.
Naive approach - greedy sampling: always choose the most likely next character. But
such an approach results in repetitive, predictable strings that don’t look like
coherent language.
A more interesting approach makes slightly more surprising choices: it introduces
randomness in the sampling process, by sampling from the probability distribution
for the next character. This is called stochastic sampling (recall that
stochasticity is what we call randomness in this field). In such a setup, if e has
a probability 0.3 of being the next character, according to the model, you’ll
choose it 30% of the time.
Sampling probabilistically from the softmax output of the model is neat: it allows
even unlikely characters to be sampled some of the time, generating more
interesting - looking sentences and sometimes showing creativity by coming up with
new, realisticsounding words that didn’t occur in the training data.
Issue with this strategy: it doesn’t offer a way to control the amount of
randomness in the sampling process. Why would you want more or less randomness?
Consider an extreme case: pure random sampling, where you draw the next character
from a uniform probability dis- tribution, and every character is equally likely.
This scheme has maximum randomness; in other words, this probability distribution
has maximum entropy. Naturally, it won’t produce anything interesting. At the other
extreme, greedy sampling doesn’t produce anything interesting, either, and has no
randomness: the corresponding probability distribution has minimum entropy.
Sampling from the “real” probability distribution—the distribution that is output
by the model’s softmax function—constitutes an intermediate point between these two
extremes. But there are many other intermediate points of higher or lower entropy
that you may want to explore. Less entropy will give the generated sequences a more
predictable structure (and thus they will potentially be more realistic looking),
whereas more entropy will result in more surprising and creative sequences. When
sampling from generative models, it’s always good to explore different amounts of
randomness in the generation process.
Sampling strategy
Generative Deep Learning
In order to control the amount of stochasticity in the sampling process, we’ll
introduce a parameter called the softmax temperature that characterizes the entropy
of the probability distribution used for sampling: it characterizes how surprising
or predictable the choice of the next character will be.
Given a temperature value, a new probability distribution is computed from the
original one (the softmax output of the model) by reweighting it in the following
way (explanation from F. Chollet, Deep Learning with Python):

NOTE: the assumption is that the reweighted distribution still produces most likely
words and most likely words are the words that make sense, not random words.
Sampling strategy
Generative Deep Learning
Compare LSTM and Transformer based generative models.
Implement the remaining tutorials on the same dataset and analyze which approach
has better characteristics, like: best results, easiest to use, easiest to
understand and so on.
Are there any sampling strategies/temperature used for models like GPT-2/GPT-3 when
generating text?
Try to think about why resampling is needed i.e. why does the neural network
produce repetitive sequences on it’s own. Error propagation? Is this the nature of
representing something complex in a limited space (number of weights)?
Further explorations
Generative Deep Learning
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Generative Deep Learning


Detailed course plan
Slides, tasks and so on
Additional information

--- Content from 1db8KfuE2RsbHMWvUiS1uVym6CKSm8Y4d.pptx ---


Artificial Intelligence
Advanced Computer Vision
2021
Lecturer
Mindaugas Bernatavičius

Today you will learn


Classification + Localization
01
02
Drawing bounding boxes
Advanced Computer Vision
00
Intro. Beyond classification
Step by step
03
IoU metric and loss
07
04
Model for bounding box regression
08
Simple approach OpenCV / CVlib
06
Datasets for object detection / localization
09
Multi-class Object Detection w/ RetinaNET
10
Summary
05
Problems with real data
11
Further explorations
We need to know the distinction between the following computer vision tasks
Semantic Segmentation: Given an image, can we classify each pixel as belonging to a
particular class?
Classification + Localization: We were able to classify an image as a cat. Great.
Can we also get the location of the said cat in that image by drawing a bounding
box around the cat? Here we assume that there is a fixed number(commonly 1) in the
image.
Object Detection: A More general case of the Classification+Localization problem.
In a real-world setting, we don’t know how many objects are in the image
beforehand. So can we detect all the objects in the image and draw bounding boxes
around them?
Instance Segmentation: Can we create masks for each individual object in the image?
It is different from semantic segmentation. How? If you look in the 4th image on
the top, we won't be able to distinguish between the two dogs using semantic
segmentation procedure as it would sort of merge both the dogs together.
Note, there are some tutorials, that sometimes mix these concepts up for example
mixing objet detection and classification + localization. That’s why it’s important
to know these.
This time we will focus on the Object Detection and Localization (since
localization can be thought of as object detection for single object).
See the slides here: https://www.slideshare.net/darian_f/introduction-to-the-
artificial-intelligence-and-computer-vision-revolution
Intro. Beyond classification
Advanced Computer Vision
A box can be mathematically fully described by four coordinates, let's call them:
(x, y, w, h)
The localization is solved by treating it as a regression problem.

In terms of input and output


The dataset is composed of image { 1.jpg, … , N.jpg } + class {dog, cat, … , N } +
coordinates (x, y, w, h)
X - images, Y - class + coordinates
Classification + Localization
Advanced Computer Vision
If we were to have a neural net that predicts the location of the object we would
need to check if it is correct. Therefore we need to know how to draw bounding
boxes.
We can use matplotlib with certain extensions to draw bounding boxes: import
matplotlib.patches as patches

We also should know how to add a label on the bounding box


Drawing bounding boxes
Advanced Computer Vision
We will build a model slowly, taking baby-steps to understand it better. The steps
will be these:
Detect synthetic white shape over black background
Detect a cat over a black background
Detect a cat over natural background
Detect if there is a cat over natural background
Detect objects from multiple classes

Step by step
Advanced Computer Vision
The model is a simple TF model with the head replaced for regression of 3 numbers.

vgg = tf.keras.applications.VGG16(input_shape=[128, 128, 3], include_top=False,


weights='imagenet')
x = Flatten()(vgg.output)
x = Dense(3, activation='sigmoid')(x)
model2 = Model(vgg.input, x)
model2.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001))

If this works, we can think about the case when the object is missing from the
picture entirely. We will need to use a custom loss function if we want to dedicate
a single number in our output to indicate a missing object

def custom_loss(y_true, y_pred):


location_bce = tf.keras.losses.binary_crossentropy(y_true[:, :3], y_pred[:, :3])
is_there_bce = tf.keras.losses.binary_crossentropy(y_true[:, -1], y_pred[:, -1])
return location_bce + is_there_bce

We can extend this argument to more classes and detect dogs, things, etc.
Model for bounding box regression
Advanced Computer Vision
But now we have a problem: the dataset does not have bounding boxes. So, we need to
add them ourselves. This is often one of the hardest and most costly parts of a
Machine Learning project: getting the labels. It’s a good idea to spend time
looking for the right tools. To annotate images with bounding boxes, you may want
to use an open source image labeling tool like VGG Image Annotator, LabelImg,
OpenLabeler, or ImgLab, or perhaps a commercial tool like LabelBox or Supervisely.
You may also want to consider crowdsourcing platforms such as Amazon Mechanical
Turk if you have a very large number of images to annotate. However, it is quite a
lot of work to set up a crowdsourcing platform, prepare the form to be sent to the
workers, supervise them, and ensure that the quality of the bounding boxes they
produce is good, so make sure it is worth the effort. If there are just a few
thousand images to label, and you don’t plan to do this frequently, it may be
preferable to do it yourself. Adriana Kovashka et al. wrote a very practical paper
about crowdsourcing in computer vision ( https://arxiv.org/abs/1611.02145 ). I
recommend you check it out, even if you do not plan to use crowdsourcing.
Problems with real data
Advanced Computer Vision
Some of the most popular project:
VGG Image Annotator (VIA): https://www.robots.ox.ac.uk/~vgg/software/via/
CVAT: https://blog.roboflow.com/cvat/
VoTT: https://blog.roboflow.com/vott/
LabelImg: https://blog.roboflow.com/labelimg/
LabelMe: https://blog.roboflow.com/labelme/
OpenLabeler
ImageLab
LabelBox
Supervisely
Coco annotator: https://www.youtube.com/watch?v=OMJRcjnMMok
Amazon Mechanical Turk: https://www.mturk.com/get-started
Best ones - we usually care overviews that tell us which tools have best features:
https://www.folio3.ai/blog/labelling-images-annotation-tool/
DEMO: VGG Image Annotator (VIA)
Problems with real data
Advanced Computer Vision
DEMO: VGG Image Annotator (VIA)
Problems with real data
Advanced Computer Vision
Here are the datasets with bounding boxes:
https://public.roboflow.com/object-detection
https://lionbridge.ai/datasets/20-best-bounding-box-image-and-video-datasets-for-
machine-learning/
https://cocodataset.org/#home COCO dataset (COCO2017 118k images)
Datasets for object detection / localization
Advanced Computer Vision
The MSE often works fairly well as a cost function to train the model, but it is
not a great metric to evaluate how well the model can predict bounding boxes. The
most common metric for this is the Intersection over Union (IoU): the area of
overlap between the predicted bounding box and the target bounding box, divided by
the area of their union.
In tf.keras, it is implemented by the tf.keras.metrics.MeanIoU class.
IOU is ussually an evaluation metric, good explanation:
https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-
detection/
Also it can be used as a loss function as well: https://arxiv.org/abs/2101.08158
IoU metric and loss
Advanced Computer Vision
A very simple approach is available with OpenCV / CVlib
We have common object detection from the Coco dataset, face detection and gender
detection: https://github.com/arunponnusamy/cvlib . Coco dataset recognizes the
following categories:
https://github.com/arunponnusamy/object-detection-opencv/blob/master/yolov3.txt
Simple approach OpenCV / CVlib
Advanced Computer Vision
Object detection a very important problem in computer vision. Here the model is
tasked with localizing the objects present in an image, and at the same time,
classifying them into different categories. Object detection models can be broadly
classified into "single-stage" (YOLO, SSD) and "two-stage" (R-CNN and familly)
detectors (see: https://www.jeremyjordan.me/object-detection-one-stage/ ). Two-
stage detectors are often more accurate but at the cost of being slower (although
it is common to say that single stage is the way to go).
Here in this example, we will implement RetinaNet, a popular single-stage detector,
which is accurate and runs fast. RetinaNet uses a feature pyramid network to
efficiently detect objects at multiple scales and introduces a new loss, the Focal
loss function, to alleviate the problem of the extreme foreground-background class
imbalance for single-stage object detection (see: https://medium.com/analytics-
vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-
3d2e1c4da8d7 )
Ref: https://keras.io/examples/vision/retinanet/
Things to remember about this model:
Pyramid CNN w/ ResNet backbone.
RetinaNetBoxLoss + RetinaNetClassificationLoss
IOU loss is also used
Multi-class Object Detection w/ RetinaNET
Advanced Computer Vision
There are many ways to solve object detection and this single notebook does not do
justice to all of them.
Sometimes you need something simple - simple to run, train and explain to clients
or yourself. For this - you might want to consider bounding box regression using a
multihead CNN for classification+localization. But sometimes you need SOTA models
for multiclass object detection in real time - for this consider YOLOv7 (or newer).
Overviews:
A great overview paper can be found here: https://arxiv.org/pdf/1807.05511.pdf
Wiki contains a list of groups of methods:
https://en.wikipedia.org/wiki/Object_detection
Summary
Advanced Computer Vision
If you ask, why is there no bounding box regression here, remember that bounding
box regression is considered a single object classification+localization problem
(not an architecture!). Also, it’s not real object detection - the task of
classifying and localizing multiple objects in an image is called object detection.
Check if CVlib/OpenCV supports multiobject detection. Create a MWE.
Compare the accuracy (and speed) of CVLib implementation vs. RetinaNet w/ Keras,
using the same images (if RetinaNet uses imagenet weights and CVLib was trained on
COCO, then which categories of items overlap in both datasets - use those).
Further explorations
Advanced Computer Vision
Course plan
You can get familiar with it using this link
https://www.codeacademy.lt/programavimo-kursai/dirbtinio-intelekto-studijos/

Advanced Computer Vision


Detailed course plan
Slides, tasks and so on
Additional information

You might also like