Ca20 Part02 NLP
Ca20 Part02 NLP
https://commons.wikimedia.org
§ Concepts
• Basics from linguistics, statistics, and machine learning
§ Methods
• How to develop and evaluate data-driven algorithms
https://pixabay.com
• Standard techniques used in machine learning
• Types of analyses used in computational linguistics
https://pixabay.com
• Computational linguistics
§ Disclaimer
• The basics selected here are all but complete and only revisited high-level.
For a more comprehensive overview, see e.g. the slides of my bachelor‘s course ”Introduction to Text Mining“.
IX. Conclusion
https://de.wikipedia.org
• Intersection of computer science and linguistics
https://pixabay.com
• Technologies for natural language processing
• Models to explain linguistic phenomena,
based on knowledge and statistics
§ Observations
• All applications need to ”understand“ language à linguistics needed
• None of these applications works perfectly à empirical methods needed
IX. Conclusion
https://en.wikipedia.org
Basics of Natural Language Processing, Henning Wachsmuth 8
Linguistic text units
Phonemes ð ə m ə n s a ɪ d ɪ t s r e ɪ n ɪ ŋ k æ t s æ n d d ɑ g z h i f ɛ l t
Morphemes The man sigh ed It s rain ing cat s and dog s he felt
Tokens The man sighed . It 's raining cats and dogs , he felt .
POS tags DT NN VBD . PRP VBZ VBG NNS CC NNS , PRP VBD .
Phrases The man sighed . It 's raining cats and dogs , he felt .
phrase types NP VP . NP VP NP , NP VP
Clauses The man sighed. It's raining cats and dogs, he felt.
Sentences The man sighed. It's raining cats and dogs, he felt.
Paragraphs The man sighed. It's raining cats and dogs, he felt.
§ Lemma
• The dictionary form of a word.
Example: ”cat“ for ”cats“, ”run“ for ”ran“
§ Wordform
• The fully inflected surface form of a lemma as it appears in a text.
Example: ”cats“ for ”cats“, ”ran“ for ”ran“
§ Stem
• The part of a word(form) that never changes.
Example: ”cat“ for ”cats“, ”ran“ for ”ran“
§ Token
• The smallest text unit in NLP: A wordform, number, symbol, or similar.
Example: ”cats“, ”ran“, and ”.“ in ”cats ran.“ (whitespaces are usually not considered as tokens)
§ Phrases
• A contiguous sequence of related words, functioning as a single meaning unit.
• Phrases often contain nested phrases.
• Types. Noun phrase (NP), verb phrase (VP), prepositional phrase (PP).
Sometimes also adjectival phrase (AP) and adverbial phrase (AdvP).
§ Clause
• The smallest grammatical unit that can express a complete proposition.
• Types. Main clause and subordinate clause.
§ Sentence
• A grammatically independent linguistic unit consisting of one or more words.
Basics of Natural Language Processing, Henning Wachsmuth 11
Main semantic concepts
§ Lexical semantics
• The meaning of words and multi-word expressions.
Different senses of a word, the roles of predicate arguments, ...
§ Compositional semantics
• The meaning of the composition of words in phrases, sentences, and similar.
Relations, scopes of operators, and much more.
§ Entities
• An object from the real world.
• Named entities. Persons, locations, organizations, products, ...
For example, ”Jun.-Prof. Dr. Henning Wachsmuth”, “Paderborn”, “Paderborn University”
§ Relations
• Semantic. Relations between entities, e.g., organization founded in period.
• Temporal. Relations describing courses of events, e.g., as in news reports.
§ Coreference
• Two or more expressions in a text that refer to the same thing.
• Types. Pronouns in anaphora and cataphora, coreferring noun phrases, ...
Examples: ”Apple is based in Cupertino. The company is actually called Apple Inc., and they make hardware.“
§ Speech acts
• Linguistic utterances with a performative function. more details in
the lecture on basics
§ Communicative goals of argumentation
• Specific functions of passages within a discourse.
• Specific effects intended to be achieved by an untterance.
§ Ambiguity is pervasive
• Phonetic. ”wreck a nice beach”
• Word sense. ”I went to the bank”.
• Part of speech. ”I made her duck.”
• Attachment. ”I saw a kid with a telescope.”
• Coordination. ”If you love money problems show up.“
• Scope of quantifiers. ”I didn’t buy a car.”
• Speech act. “Have you emptied the dishwasher?”
§ Other challenges
• World knowledge. ”Trump must rethink capital punishment”
§ Possible interpretations
• I never said she stole my money.
Someone else said it, but I didn’t.
IX. Conclusion
§ Evaluation criteria
• Effectiveness. The extent to which the output of an algorithm is correct.
• Efficiency. The consumption of time (or space) of an algorithm on an input.
• Robustness. The extent to which an algorithm remains effective (or efficient)
across different inputs, often in terms of textual domains.
§ Evaluation measures
• Quantify the quality of an algorithm on a specific task and text corpus.
• Algorithms can be ranked with respect to an evaluation measure.
• Different measures are useful depending on the task.
https://pixabay.com
• A collection of real-world texts with known properties,
compiled to study a language problem.
• The texts are often annotated with meta-information.
• Corpora are usually split into datasets for developing (training) and/or
evaluating (testing) an algorithm.
§ Types of annotations
more details
• Ground-truth. Manual annotations, often created by experts. in the part on
• Automatic. NLP algorithms add annotations to texts. acquisition
Basics of Natural Language Processing, Henning Wachsmuth 18
Evaluation of effectiveness in classification tasks
§ Instances in classification tasks
• Positives. The output instances (annotations) an algorithm has created.
• Negatives. All other possible instances.
correct
§ Accuracy false
negatives
• Used if positives and negatives true
(FN)
are similarly important. positives
false (TP)
TP + TN positives
Accuracy = (FP) true
TP + TN + FP + FN
created negatives
(TN)
§ Precision, recall, and F1-score
• Used if positives are in the focus.
TP TP 2•P•R
Precision (P) = Recall (R) = F1-score =
TP + FP TP + FN P+R
• In multi-class tasks, micro- and macro-averaged values can be computed.
§ Balancing of datasets
• A balanced distribution of target classes in the training set is often preferable.
• Undersampling. Removal of instances from majority classes.
• Oversampling. Addition of instances from minority classes.
• In machine learning, an alternative is to weight classes inverse to their size.
text
corpus ... ... ...
§ Training set
• Known instances used to develop or statistically learn an algorithm.
• The training set may be analyzed manually and automatically.
text
corpus ... ... ... ... ... ...
§ Variable
• An entity that can take on different numeric or non-numeric values.
• Independent. A variable X that is expected to affect another variable.
• Dependent. A variable Y that is expected to be effected by others.
• Other. Confounders, mediators, moderators, ...
§ Scales of variables
• Nominal. Values that represent discrete, separate categories.
• Ordinal. Values that can be ordered/ranked by what is better.
• Interval. Values whose difference can be measured.
• Ratio. Interval values that have an absolute zero.
Basics of Natural Language Processing, Henning Wachsmuth 25
Descriptive statistics
§ Descriptive statistics
• Measures for summarizing and comprehending distributions of values.
• Used to describe phenomena.
§ Measures of dispersion
• Range. The distance between minimum and maximum in a sample.
• Variance. The mean squared difference between each value and the mean.
• Standard deviation. The square root of the variance.
IX. Conclusion
§ Example pipeline
• Extraction of the founding dates of companies
§ Alternatives
• Joint model. Realizes multiple analysis steps at the same time.
• Neural network. Often works on the raw input text.
§ Types of approaches
• Supervised. Training instances with known output used in development.
• Unsupervised. No output labels/values used in development.
... and some others
§ Types of techniques
• Rule-based. Analysis based on manually encoded expert knowledge.
Knowledge includes rules, lexicons, grammars, ...
IX. Conclusion
§ Example
• (0?[1-9]|[10-31])\.(0?[1-9]|[10-12])\.(19|20)[0-9][0-9]
matches German dates, such as 8.5.1945 or 30.04.2020.
(1,3)
NP
(1,1) (2,2) (3,3) (4,4)
N N V N
IX. Conclusion
§ Two-way relationship
• The output information of NLP serves as the input to machine learning.
• Many NLP algorithms rely on machine learning to produce output information.
( , , ..., )
input output
...
data ... information
data mining ( , , ..., )
representation generalization
machine learning
instances patterns
§ Feature value
• The value of a feature of a given input, usually real-valued and normalized.
Example: The feature representing ”is“ would have the value 0.5 for the sentence ”is is a word“.
§ Feature type
• A set of features that conceptually belong together.
Example: The relative frequency of each known word in a text (this is often called ”bag-of-words“).
§ Feature vector
• A vector x(i) = (x1(i), ..., xm(i)) where each xj(i) is the value of one feature xj.
Example: For two feature types with k and l features respectively, x(i) would contain m = k+l values.
3. Keep only features whose counts lie within some defined thresholds.
(a) “the”, “a”, . . . , “engineeeering”
Ng (2018)
• Each y assigns one weight wj
to each feature xj.
• y is evaluated on training data
against a cost function L.
• Based on the result, the weights are adapted to obtain the next model.
• The adaptation relies on an optimization procedure.
§ Hyperparameters
• Many learning algorithms have parameters that are not optimized in training.
• They need to be optimized against a validation set.
Basics of Natural Language Processing, Henning Wachsmuth 45
Generalization
Optimal fitting?
§ Fitting X2
ua les
unknown
s
sq irc
re
training instance
c
instances
• A decision boundary y is learned that decides training
instances
the class of unknown instances.
X1
regression
§ Regression C
training
• Assign an instance to the most likely value of a instances
regression
continuous target variable. model
§ Ensemble methods
• Meta-algorithms that combine multiple classifiers/regressors.
§ Clustering
• The grouping of a set of instances into a possibly but not
necessarily predefined number of classes.
• The meaning of a class is usually unknown in advance.
cost
• Silhouette analysis. Find k that maximizes distances
between clusters (and balances their size). 1 best k number of clusters |X|
§ Euclidean distance v X2
Euclidean distance
um
uX (1)
euclidean(x , x ) = t
(1) (2) (2)
|x i xi | 2
i=1
X1
§ Manhattan distance (aka city block distance) X2
Manhattan distance
m
X (1) (2)
manhattan(x(1) , x(2) ) = |xi xi |
i=1
X1
§ Semi-supervised learning
• Derive patterns from little training data, then find similar patterns in
unannotated data to get more training data.
§ Reinforcement learning
• Learn, adapt, or optimize a behavior in order to maximize some benefit,
based on feedback provided by the environment.
§ Recommender systems
• Predict missing values of entities based on values of similar entities.
§ Process steps
• Corpus acquisition. Acquire a corpus (and datasets) suitable to study the task.
• Text analysis. Preprocess all instances with existing NLP algorithms, in order
to obtain information that can be used in features.
• Feature engineering. Identify helpful features on training set, compute feature
vectors for each instance on all datasets.
• Machine learning. Train algorithm on training set and evaluate on validation
set, optimize hyperparameters. Finally, evaluate on test set.
§ Domain dependency
• Many algorithm work better in the domain of training texts than in others.
X2 X2
domain A trained domain B applied
classifier classifier
domain
transfer
X1 X1
IX. Conclusion
§ Efficiency challenges
• Large amounts of data may need to be processed, possibly repeatedly.
• Complex, space-intensive models may be learned.
• Often, several time-intensive text analyses are needed.
§ Robustness challenges
• Datasets for training may be biased.
• Many text characteristics are domain-specific.
• Learned algorithms often capture too much variance (i.e., they overfit).
https://en.wikipedia.org
• Linguistic knowledge from phonetics to pragmatics.
• Empirical methods for development and evaluation.
• Rule-based and statistical (machine-learned) algorithms.
...
data ... information
data mining ( , , ..., )
machine learning
§ Goals of NLP
https://de.wikipedia.org
https://pixabay.com
• Technology that can process natural language.
• Empirical explanations of linguistic phenomena.
• Solutions to problems from the real world.
§ Wachsmuth (2019). Henning Wachsmuth. Introduction to Text Mining. Lecture slides. Winter term, 2019.
https://cs.upb.de/css/teaching/courses/text-mining-w19/