COMP5310 Notes
COMP5310 Notes
Ratio Data – Values encode differences – Zero is defined – – Every relation must have a unique name.
Multiplication defined – Ratio is meaningful – Eg length, – Attributes (columns) in tables must have unique names.
=> The order of the columns is irrelevant.
weight, income
– All tuples in a relation have the same structure;
Level of measurement constructed from the same set of attributes
– Every attribute value is atomic (not multivalued, not
composite).
– Every row is unique (can’t have two rows with exactly
the same values for all their fields)
– The order of the rows is immaterial
Measure of Dispersion
om
t.c
as
In which time period were all the measurement done?
SELECT MIN(date), MAX(date) FROM Measurement;
yL
How many distinct Stations the temperature were
ud
measured
SELECT COUNT(DISTINCT station)
St
FROM Measurement WHERE sensor = 'temp';
m
e.g2 SELECT * FROM TelescopeConfig Determines the usage of Film categories throughout our
WHERE tele_array LIKE 'H%'; database
TO_DATE(’01-03-2012’, ‘DD-Mon-YYYY’)
lists every Film which has at least five actors playing in it. Increase the power of a significance test
– Obtain a larger sample
– Larger N means more reliable statistics
– Less likely to have errors
– Type I: Reject true H0
Hypothesis Testing – Type II: Fail to reject false H0
Unpaired or independent : separate individuals
Paired: same individual at different points in time.
Mann-Whitney U test
Nonparametric version of unpaired t-test
Assumes
– The samples are independent
– Note – N should be at least 20
scipy.stats.mannwhitneyu(x, y, use_continuity=True, alter
native=None
Kruskall-Wallis H-test
4 of 10
Nonparametric version of ANOVA Determine which classifier is better. (paired t-test)
– Assumes samples are independent stats.ttest_rel(sys1_scores, sys2_scores).pvalue*0.5
– also one-way ANOVA on ranks – as the ranks of the data
values are used in the test rather than the actual data • Would you expect this variation in a real experiment?
Note: Average scores should only change if the sample is not
– Not recommended for samples smaller than 5 fixed, or if folds are sampled randomly.
– Not as statistically powerful as ANOVA
– Both ANOVA and Kruskall-Wallis H-test are extension of • What does this variation say about reliability of
the Mann-Whitney test and Unpaired Student’s t-test used experiments?
to compare the means of more than two populations. The variation highlights the fact that we always need to be
scipy.stats.kruskal(*args, nan_policy='propagate careful generalising results to unseen data.
# It also highlights the importance of selecting samples
Paired Student’s t-test that are representative of the population.
null hypothesis that two population means are equal
Assumes • How can we increase reliability?
om
– The samples are paired Significance testing helps us quantify reliability. Larger
– Populations are normally distributed sample sizes help ensure reliability.
t.c
– Standard deviations are equal
Multiply two-tailed p-value by 0.5 for one-tailed p-value
as
(to test A>B, rather than A>B OR A<B)
scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate',
alternative='two-sided
yL
Wilcoxon signed-rank test Linear Regression
Nonparametric version of paired t-test ud
– Assumes
– The samples are paired
St
– Note – Often used for ordinal data, e.g., Likert ratings
– N should be large, e.g., ≥20
m
Error/residual: difference
between the observed
d
(𝑦𝑎𝑐𝑡𝑢𝑎𝑙 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)
oa
nl
ow
Tutorial:
If R = 0.39: The value here is 0.329. This suggests that our
model only partly explains the data so there must be other
factors at play.
Tokenisation
– Split a string (document) into pieces called tokens
– Possibly remove some characters, e.g., punctuation
Normalisation
Map similar words to the same token
om
– Stemming/lemmatisation
– Avoid grammatical and derivational sparseness
t.c
– E.g., “was” => “be”
– Lower casing, encoding
Text Classification
as
– E.g., “Naïve” => “naive”
yL
ud
St
Term frequency weighting
m
fro
But the word “close” does not exist in the category Sports, thus 𝒑
(𝒄𝒍𝒐𝒔𝒆| 𝑺𝒑𝒐𝒓𝒕𝒔 )= 𝟎, leading to 𝒑 (𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆 𝑺𝒑𝒐𝒓𝒕𝒔) = 0
d
Laplace smoothing
11 : how may words in
de
class Sports
Naïve Bayes
Covariance Matrix
three attributes (x,y,z):
The covariance
between one
dimension and itself
is the variance
– cov(x,y) = cov(y,x) hence matrix is symmetrical about the
diagonal Hierarchical clustering A method of cluster analysis which
– N-dimensional data will result in NxN covariance matrix seeks to build a hierarchy of clusters. It produces a set of
Covariance Matrix Example nested clusters organized as a hierarchical tree
Agglomerative (bottom up), Divisive (top down)
Partitional clustering A division data objects into non-
overlapping subsets (clusters) such that each data object is
in exactly one subset
om
Agglomerative
t.c
Initial – Each point in its own cluster, until: single cluster
as
PCA Example
– PCA creates uncorrelated PC variables (eigenvectors)
yL
having zero covariations and variances (eigenvalues)
sorted in decreasing order. ud
– The first PC captures the greatest variance , the second
greatest variance is the second PC, and so on.
St
– By eliminating the later PCs we can achieve
dimensionality reduction. K-Means Clustering
m
Complexity is O( n * k * i * d )
n = number of points, k = number of clusters, i = number of
fro
Group data points into clusters such that – External Index: Measure the extent to which cluster
– Data points in one cluster are more similar to one labels match externally supplied class labels (e.g., accuracy,
nl
om
An important parameter in Gradient Descent is the size of
t.c
the steps, determined by the learning rate
hyperparameter. If the learning rate is too small, then the
as
algorithm will have to go through many iterations to
converge, which will take a long time.
yL
On the other hand, if the learning rate is too high, you
ud
might jump across the valley and end up on the other side,
possibly even higher up than you were before. This might
make the algorithm diverge, with larger and larger values,
St
failing to find a good solution.
m
fro
d
de
oa
Gradient Descent
nl
function.