0% found this document useful (0 votes)
39 views

COMP5310 Notes

USYD 5310

Uploaded by

liuziguang98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

COMP5310 Notes

USYD 5310

Uploaded by

liuziguang98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1 of 10

L2-L3: Nominal Data : Values are names; No ordering is


implied; Eg jersey numbers; industry worked in; key
experience you have

Ordinal Data: Values are ordered No distance is implied –


Eg rank, agreement – central tendency can be measured by
mode or median – the mean cannot be defined from an
ordinal set – dispersion can be estimated by the Inter-
Quartile Range (IQR) The IQR is the difference between the
first and third quartile
Relational data model is the most widely used model today
Interval Data Interval scales provide information about – Main concept: relation, basically a table with rows and
order, and also possess equal intervals – Values encode columns – Every relation has a schema, which describes
differences – equal intervals between values – No true zero the columns, or fields
– Addition is defined – Eg Celsius temperature central
tendency can be measured by mode, median, or mean Not all tables qualify as a relation:

Ratio Data – Values encode differences – Zero is defined – – Every relation must have a unique name.
Multiplication defined – Ratio is meaningful – Eg length, – Attributes (columns) in tables must have unique names.
=> The order of the columns is irrelevant.
weight, income
– All tuples in a relation have the same structure;
Level of measurement constructed from the same set of attributes
– Every attribute value is atomic (not multivalued, not
composite).
– Every row is unique (can’t have two rows with exactly
the same values for all their fields)
– The order of the rows is immaterial

Measure of central tendency ETL Process: Capture/Extract - Data Cleansing - Transform


– Load

The fact and dimension relations linked to it looks like a


star; – this is called a star schema

Measure of Dispersion

Fact constellations: Multiple fact tables share dimension


tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation.

DDL (Data Definition Language)

CREATE TABLE name ( list_of_columns )

DML (Data Manipulation Language) for retrieval of


information also called query language SELECT … FROM …
WHERE
2 of 10
SELECT sitename, commence, organisation
FROM Station JOIN Organisation
ON orgcode = code; (inner join)

SELECT uos_code as unit_of_study, AVG(mark)


F

ROM Assessment NATURAL JOIN UnitOfStudy


WHERE credit_points = 6
GROUP BY uos_code
HAVING COUNT(*) > 2

om
t.c
as
In which time period were all the measurement done?
SELECT MIN(date), MAX(date) FROM Measurement;

yL
How many distinct Stations the temperature were
ud
measured
SELECT COUNT(DISTINCT station)
St
FROM Measurement WHERE sensor = 'temp';
m

How many measurements of distinct stations were


done per each sensor?
fro

SELECT sensor, COUNT(DISTINCT station)


FROM Measurement
GROUP BY sensor
d

ORDER BY count DESC;


de

How many measurements we have done?


self join - lists all film sub-categories and their
oa

SELECT COUNT(*) FROM Measurement


corresponding parent categories
List top five measurements ordered by date in descending
order
nl

SELECT * FROM Measurement ORDER BY date DESC limit 5;


ow

e.g1: SELECT * FROM TelescopeConfig


WHERE ( mindec BETWEEN -90 AND -50 ) AND ( maxdec >=
D

-45 ) AND ( tele_array = 'H168' )

e.g2 SELECT * FROM TelescopeConfig Determines the usage of Film categories throughout our
WHERE tele_array LIKE 'H%'; database

EXTRACT(year FROM startDate)

TO_DATE(’01-03-2012’, ‘DD-Mon-YYYY’)

‘2012-04-01’ + INTERVAL ’36 HOUR’

SELECT gid, band, epoch FROM Measurement WHERE


intensity IS NULL
5 + null returns null
3 of 10

lists all Actor nationalities and how many actors are of


each nationality. Only show nationalities with at least 2
associated actors.

lists every Film which has at least five actors playing in it. Increase the power of a significance test
– Obtain a larger sample
– Larger N means more reliable statistics
– Less likely to have errors
– Type I: Reject true H0
Hypothesis Testing – Type II: Fail to reject false H0
Unpaired or independent : separate individuals
Paired: same individual at different points in time.

Unpaired Student’s t-test


null hypothesis that two population means are equal
Assumes
– The samples are independent
– Populations are normally distributed
– Standard deviations are equal
– Note – Multiply two-tailed p-value by 0.5 for one-tailed
p-value (e.g., to test A>B, rather than A>B OR A<B)
scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_poli
cy='propagate', alternative='two-sided'

Mann-Whitney U test
Nonparametric version of unpaired t-test
Assumes
– The samples are independent
– Note – N should be at least 20
scipy.stats.mannwhitneyu(x, y, use_continuity=True, alter
native=None

Analysis of variance (ANOVA)


null hypothesis two or more groups have the same
population mean
Assumes:
– The samples are independent
– Populations are normally distributed
– Standard deviations are equal
scipy.stats.f_oneway(*args, axis=0

Kruskall-Wallis H-test
4 of 10
Nonparametric version of ANOVA Determine which classifier is better. (paired t-test)
– Assumes samples are independent stats.ttest_rel(sys1_scores, sys2_scores).pvalue*0.5
– also one-way ANOVA on ranks – as the ranks of the data
values are used in the test rather than the actual data • Would you expect this variation in a real experiment?
Note: Average scores should only change if the sample is not
– Not recommended for samples smaller than 5 fixed, or if folds are sampled randomly.
– Not as statistically powerful as ANOVA
– Both ANOVA and Kruskall-Wallis H-test are extension of • What does this variation say about reliability of
the Mann-Whitney test and Unpaired Student’s t-test used experiments?
to compare the means of more than two populations. The variation highlights the fact that we always need to be
scipy.stats.kruskal(*args, nan_policy='propagate careful generalising results to unseen data.
# It also highlights the importance of selecting samples
Paired Student’s t-test that are representative of the population.
null hypothesis that two population means are equal
Assumes • How can we increase reliability?

om
– The samples are paired Significance testing helps us quantify reliability. Larger
– Populations are normally distributed sample sizes help ensure reliability.

t.c
– Standard deviations are equal
Multiply two-tailed p-value by 0.5 for one-tailed p-value

as
(to test A>B, rather than A>B OR A<B)
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate',
alternative='two-sided

yL
Wilcoxon signed-rank test Linear Regression
Nonparametric version of paired t-test ud
– Assumes
– The samples are paired
St
– Note – Often used for ordinal data, e.g., Likert ratings
– N should be large, e.g., ≥20
m

scipy.stats.wilcoxon(x, y=None, zero_method='wilcox',


correction=False, alternative='two-sided', mode='auto'
fro

Error/residual: difference
between the observed
d

value and predicted value


de

(𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)
oa
nl
ow

Holdout method – Splits the data randomly into two


independent sets
D

• Training set (e.g., 2/3) for model construction


• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained R2: ratio of explained variation in y to total variation in y
Range from 0 to 1
Cross-validation (k-fold, where k = 10 is most popular) goodness of fit not precision
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size Standard error (S):prediction accuracy
– Leave-One-Out is a particular form of cross-validation: Expressed in units of the response
• k folds where k = # of tuples, for small sized data variable
Prediction interval: range that should
Tutorial 6: Compute significance for H1 sys1 > sys2 contain the response value of a new observation
5 of 10
If sample size is large enough useful rule-of-thumb:
approximately 95% of predictions should fall
within
Suppose: S = $2k and requirement is predictions within $5k
– S must be <= $2.5k to produce a sufficiently narrow 95%
prediction interval

Multiple linear regression P482 (L9 – p25)

If 𝛼 is small, gradient descent can be slow


If 𝛼 is too large, gradient descent might overshoot the
minimum

predicted y Batch descent


Slow but more accurate, costly
stochastic gradient descent
fast, may not converged to the min. training set is large.
Gradient Descent
Make sure features are on a similar scale
Gradient descent is a first-order iterative optimization
algorithm for finding a local minimum of a differentiable
function. The idea is to take repeated steps in the
opposite direction of the gradient (or approximate
gradient) of the function at the current point, because this
is the direction of steepest descent.
Logistic Regression
Example:

Convex cost function - Cross – Entropy (Log Loss) P509 P9-


50

Tutorial:
If R = 0.39: The value here is 0.329. This suggests that our
model only partly explains the data so there must be other
factors at play.

If R=0.755: Yes, r_squared indicates that our model


explains the data reasonable well. But we should look at
standard error as well
6 of 10
According to the 95% prediction interval, how close will
our predictions be to the actual value? What if we
calculate over the test data instead?
Answer: Note we have a fairly small data set (339 in
training, 167 in test). So this value will vary depending on
our split.

Unstructured Data− Naïve Bayes

Tokenisation
– Split a string (document) into pieces called tokens
– Possibly remove some characters, e.g., punctuation

Normalisation
Map similar words to the same token

om
– Stemming/lemmatisation
– Avoid grammatical and derivational sparseness

t.c
– E.g., “was” => “be”
– Lower casing, encoding
Text Classification

as
– E.g., “Naïve” => “naive”

yL
ud
St
Term frequency weighting
m
fro

But the word “close” does not exist in the category Sports, thus 𝒑
(𝒄𝒍𝒐𝒔𝒆| 𝑺𝒑𝒐𝒓𝒕𝒔 )= 𝟎, leading to 𝒑 (𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆 𝑺𝒑𝒐𝒓𝒕𝒔) = 0
d

Laplace smoothing
11 : how may words in
de

class Sports

14: how many words in


oa

TFIDF Weighting whole datasets without


repetition
nl
ow
D

Naïve Bayes

Decision Tree (P522)


Information Gain (IG)
IG calculates effective change in entropy after making a
decision based on the value of an attribute.
IG(Y|X) = H(Y) – H(Y|X)
where Y is a class label
7 of 10
X is an attribute • What is the best value of max_depth based on this
H(Y) is the entropy of Y plot
H(Y|X) is the cconditional entropy of Y given X max_depth=8. This gives
the best generalisation
error with lower model
complexity and less risk
– Higher entropy => higher uncertainty of overfitting.
– Lower entropy => lower uncertainty
• Why doesn't generalisation error increase on the right
The algorithm has other mechanisms to prevent
overfitting. And overfitting does seem to hurt
generalisation too much on this data. Nevertheless,
decision trees can overfit so use with caution.

• Would it be useful to collect more training data?


Yes, almost always. However, it looks like both classifiers
are close to their asymptotes. So the benefit might not be
worth the cost. The decision tree would benefit more from
additional data.

• The decision tree has a larger spread between training


and generalisation error. Why is this?.
The decision tree suffers more from overfitting.The
random forest on this particular data has 0 training error.
This is a bit of a surprise as random forests tend to increas
bias. With high bias, we would expect underfitting which
tends to be characterised by both high training and high
generalisation error. However, random forests generally
also reduce variance enough to cancel out any increase in
bias. Here we end up with a nice generalisation error plot
that seems to be close to its asymptote and not too
different from the training error.
Setting up a reliable evaluation (P549 – L10 p29)
• When is it OK to use the held-out test data from our
Generalization error should model application as closely
train/dev/test split?
and reliably as possible • Sample must be representative
As little as possible. Ideally only once for our final
• Larger sample better
generalisation error/accuracy calculation.
Data drift (non-stationary data)
L8b – PCA (P446)
Aim: transforming the original data from high dimensional
space into lower dimensional space.

Principal components (PC)


The new variables in the lower dimensional space
corresponds to a linear combination of the originals
Building a good solution
Build a simple model first, evaluate, iterate PCA helps in
Ensembles of predictors often do very well -Visualization
-random forest (bootstrap many trees, more biased, lower -uncovering clusters
variance, lose explain ability of trees, boost performance) -dimensionality reduction
– PCA method is particularly useful when the variables
Tutorial: within the data set are highly correlated.
• Does training or generalisation error level out first? – Correlation indicates that there is redundancy in the
Why? data.
that higher values of max_depth may lead to overfitting. – Correlation is captured by the covariance matrix1 .
– PCA is traditionally performed on covariance matrix or
correlation matrix.
8 of 10

Covariance Matrix
three attributes (x,y,z):
The covariance
between one
dimension and itself
is the variance
– cov(x,y) = cov(y,x) hence matrix is symmetrical about the
diagonal Hierarchical clustering A method of cluster analysis which
– N-dimensional data will result in NxN covariance matrix seeks to build a hierarchy of clusters. It produces a set of
Covariance Matrix Example nested clusters organized as a hierarchical tree
Agglomerative (bottom up), Divisive (top down)
Partitional clustering A division data objects into non-
overlapping subsets (clusters) such that each data object is
in exactly one subset

om
Agglomerative

t.c
Initial – Each point in its own cluster, until: single cluster

as
PCA Example
– PCA creates uncorrelated PC variables (eigenvectors)

yL
having zero covariations and variances (eigenvalues)
sorted in decreasing order. ud
– The first PC captures the greatest variance , the second
greatest variance is the second PC, and so on.
St
– By eliminating the later PCs we can achieve
dimensionality reduction. K-Means Clustering
m

Complexity is O( n * k * i * d )
n = number of points, k = number of clusters, i = number of
fro

iterations, d = number of attributes (or dimensions)


– The 1st PC accounts for or "explains" 1.651/3.448 =
47.9% of the overall variability; – the 2nd one explains
d

35.4% of it; the 3rd one explains 16.7% of it.


de

L8 – Clustering Measures of Cluster Validity


oa

Group data points into clusters such that – External Index: Measure the extent to which cluster
– Data points in one cluster are more similar to one labels match externally supplied class labels (e.g., accuracy,
nl

another. precision, recall, F1-score)


– Data points in separate clusters are less similar to one – Internal Index: Measure the goodness of a clustering
ow

another. structure without respect to external information (e.g.,


– Distance function specifies the “closeness” of two Sum of Squared Error)
D

objects. – Relative Index: Compare two different clusterings or


clusters (often an external or internal index is used)

Homogeneity ranges from 0 to 1, measuring whether


clusters contain data points that are part of a single class
(analogous to precision, P = TP / (TP+FP) )
Completeness ranges from 0 to 1, measuring whether
classes contain data points that are part of a single cluster
(analogous to recall, R = TP / (TP+FN) )
V-measure is the harmonic mean of homogeneity and
completeness (analogous to F1 score = 2PR / (P+R))

Internal: Sum of squared Error (SSE, Inertia)


9 of 10
is a collection of one or more items e.g {Milk,Bread,Diaper}
A k-itemset is an itemset containing k items, 3-itemset

Support count (σ) Support (s)


A frequent itemset has s ≥ min_support

Silhouette Coefficient Example P38


Using Silhouettes to choose k An association rule is an implication of the form XY where
High average silhouette indicates points far away from X and Y are itemsets {Milk,Diaper}{Beer}
neighbouring clusters Confidence (c) c ≥ min_conf

Pre-Processing for Clustering


Data cleansing/ Data Transformation/Data
normalisation/Dimensionality Reduction / choice or
projection of dimensions

L7 Association Rule Mining (P343)


Application of ML:
Creating and using models that are learned from data
– Predicting whether an email is spam or not
– Discovering hidden rules in complex datasets
– Predicting whether a credit card transaction is fraudulent
– Predicting tumour cells as benign or malignant
Support count of {beer,diaper}/ support count of {beer}
Support count of {beer,diaper}/ support count of {diaper}
Supervised vs. Unsupervised Learning
Supervision: The training data are accompanied by labels
Mining Association Rules
indicating the class of the observations
1. Frequent itemset generation – Generate all itemsets
Unsupervised learning (e.g. clustering and association
with s ≥ min_support
rules)
2. Rule generation – Generate high-confidence rules from
– The class labels of training data is unknown
each frequent itemset – Each rule is a binary partitioning of
– Given a set of measurements, observations, etc. with the
a frequent itemset
aim of • Establishing the existence of classes or clusters
Easy! But brute force enumerate is computationally
in the data • Discovering hidden patterns or rules
prohibitive.
Itemset
10 of 10
Apriori Principle Concretely, you start by filling θ with random values (this
If an itemset is frequent, then all of its subsets are also is called random initialization), and then you improve it
frequent gradually, taking one baby step at a time, each step
attempting to decrease the cost function (e.g., the MSE),
anti-monotone property of support until the algorithm converges to a minimum.
If an itemset is infrequent, then its supersets are also
infrequent

om
An important parameter in Gradient Descent is the size of

t.c
the steps, determined by the learning rate
hyperparameter. If the learning rate is too small, then the

as
algorithm will have to go through many iterations to
converge, which will take a long time.

yL
On the other hand, if the learning rate is too high, you
ud
might jump across the valley and end up on the other side,
possibly even higher up than you were before. This might
make the algorithm diverge, with larger and larger values,
St
failing to find a good solution.
m
fro
d
de
oa

Gradient Descent
nl

Gradient Descent is a very generic optimization algorithm


capable of finding optimal solutions to a wide range of
ow

problems. The general idea of Gradient Descent is to


tweak parameters iteratively in order to minimize a cost
D

function.

Suppose you are lost in the mountains in a dense fog; you


can only feel the slope of the ground below your feet. A
good strategy to get to the bottom of the valley quickly is
to go downhill in the direction of the steepest slope. This is
exactly what Gradient Descent does: it measures the local
gradient of the error function with regards to the
parameter vector θ, and it goes in the direction of
descending gradient. Once the gradient is zero, you have
reached a minimum!

You might also like