NLP unit-1-introduction-and-word-level-analysis
NLP unit-1-introduction-and-word-level-analysis
SNLP
UNIT-1
NLP is essentially the way humans communicate with a machine to allow them to
perform the tasks required. This means that the earliest traces of NLP should be found
right at the time where computers were starting to be developed. However, language
processing is not only limited to a computer but can also be used for other machines as
well. Around 1950 the Turing Machine became famous (Developed by Alan
Turing to interpret the encrypted signals being sent by the Germans). Turing built a
machine that could use logic and various other key aspects to understand the logic
behind each message and crack the code. The first use of NLP was seen right there
while communicating with the Turing Machine.
Right after this invention researchers started to use machine intelligence to perform
several tasks that could make life a lot easier for people. To ensure that the machine
performed the tasks that were given to the people needed to feed in appropriate data.
The 1960’s seen a lot of work that was led by Chomsky and other renowned
researchers who worked on this language to develop a universal syntax that could be
used by people around the world to communicate with machines.
Development in this field was not rapid but there was a growing interest in developing
this field further to automate a lot of manual activities. People had a learning curve that
was very slow which only allowed people to have a keen interest in this field around
the year 1990. The 90’s seen a surge of people developing models using artificial
intelligence. TO do the same they relied heavily on language processing which seen a
boost around the time.
Downloaded by Yash ([email protected])
lOMoARcPSD|21913187
All of this gave an incentive for people to use NLP along with machine learning and
artificial intelligence to develop probabilistic models that relied heavily on data. To let
the machine read and interpret data NLP became key right up to the present time. A lot
of data relied on speech recognition which allowed people to transfer audio files and
store them in the form of data on the internet. Due to the Internet Age language
processing is used in almost any area around you. Social media platforms rely heavily
on language processing which forms a major part of recent generations. In short,
without NLP we would not have achieved all the technological advancements that we
have in recent times.
NLP is one of the major ways in which a person can communicate with machines to
solve problems does come with a plethora of advantages. However, as advanced as the
technology may be there are always a few challenges that lie ahead of the following
which needs to be conquered to solve them. A little about each of the challenges are
given below.
1. Text Summarization
This is simply a challenge that lies ahead for readers who are working on extracting
data and going through a lot of data in a short time. They would prefer to have a
summary of the same. AI uses NLP to interpret and structure a summary by using
all of the important points. However, there are a lot of points that do get missed out. This
is one of the challenges that could solve a lot of problems that use data inspection.
These are so frequently used by people while surfing a website to navigate the site
appropriately to find the thing that they would be needing most. Replies do get easy
based on questions that are asked frequently. However, when we use this on a much
higher level with more complex reasonings, not every chatbot implemented with AI
would be able to provide reliable answers. AI intends to allow everyone to use this
feature to solve their daily problems.
Key Takeaways
NLP has definitely made advancements in the world but still faces many
challenges that could eventually limit the use of it.
Language models (LM’s) are used to estimate the likelihood of different words or
phrases relative to other phrases or words to improve other NLP applications. Most of
the models that are based on language modelling use probabilistic
inferences to calculate the probability of the next word that the user might be
intending to note down. It can also be used to find the probability of a certain word or
phrase in data that is already existing.
where ‘n’ is the total number of occurrences in the given data and i is the instance that
we need to find out the occurrence of.
As the name rightly suggests this model is based on the grammatical inferences that
have been given out to the machine to make an appropriate prediction based on the
probabilistic inferences that the sentence has occurred in a similar event as before.
We can be sure that the language that we need the user to interpret a certain event is
based on the past happenings in terms of that data. Let us understand this model with
the help of an example further.
To construct the above sentence or predict it we need to see all the possible
combinations based on the rule that is given below.
1. P(This ) = P(w1)
2. P(glass | This ) = P(w2)
3. P( is | This glass ) =P(w3)
4. P(transparent | This glass is ) = P(w4)
If the model needs to calculate the probability of the above sentence it would have to
follow the same formula as given above.
This would give us the probability of the grammatically correct statement which would
then be predicted and shown to the user. Based on this one could build a sentence or
phrase completion model using NLP.
As the name suggests this model is based on statistical inferences made from data that
has been entered into the machine that is supposed to make the final inferences. This
language model relies heavily on data and the input in a correct and appropriate form.
The probability distribution is given over required word sequences which are then
used over the textual data to gain inferences based on the values given to them.
This model is also known to have a probabilistic mechanism that is used to generate
text and is also known as a ‘generative model’. All of the values are sampled from the
data that has been entered to gain appropriate context about the happenings that
surround the entire probability distribution. Statistical LM also happens to be context-
dependent. This simply means that the way a certain data is entered and interpreted
also affects the outcome of the probabilistic outcome.
This model is extremely useful to quantify uncertainties that might be lurking around in
the data that has been entered in the form of speech or simply computer language.
This can be seen clearly in the example below.
This is an example where the language model needs to predict the next word that the
person is going to say or type to save time. 99 percent of the people would agree that
“happy” would be the correct option. However, the language model does not have the
same process of thinking as a normal English-speaking human has. LM’s simply
use the probability density of the word in the option to predict the next word.
Key Takeaways
Grammar based modelling uses the smallest syllable in order to construct the
sentence.
Regular Expression is a language that has been developed to help the user specify a
search string to be found in the data. The pattern needs to be specified to get the
desired results.
E.g. If a person wishes to find 10 numbers from the data but does not know the exact
numbers, he/she could simply specify the pattern that those numbers are
following to find them.
We can divide the following regular expression into 2 parts which is the corpus and the
pattern.
Pattern: This is the required pattern that can be entered in the search to obtain the
required search results. Basic expressions that are used for the following are given
below.
Brackets are used to specify a certain disjunction of characters. For instance, the
bracket being used on a regular ‘s’ or ‘S’ can be used to return a capital or lower case
‘S’.
Dash (“-”)
Dash is a regular expression that can specify the range for a particular search. If we
put a dash between two letters it would match all the letters between them.
E.g. /[A-Z] = All upper-case letters between A and Z would match the search.
/[a-z] = All lower-case letters between a and z would match the search.
Caret (“^”)
As normal syntax, this sign is also used to indicate a negation in the search that is
conducted using this regular expression. It could also simply be used as a ‘caret’
symbol.
/[^Dd] = This would indicate the search is for neither ‘d’ nor ‘D’
/[e^] = Used as a simple caret simple as there is nothing to negate after ‘e’
The question mark symbol is essentially used to display the optionality for the search
that has been conducted in the previous search. Putting a “?” before a certain letter
would indicate that the user would like to return the search with an option consisting of
that letter and one without that letter.
Period(“*”)
The asterisk symbol “*” is used to match a certain character that is between an
expression in the search. If the user has an expression that would make sense after
entering another letter this period mark is very useful.
E.g. /[beg*n] = Will give out any word that makes sense like ‘begin’
Anchors
Anchors can be used to match a certain assertion in the text that has been searched.
Character Classes
Key Takeaways
Regular Expression is a language that has been developed to mainly match
different patterns in the data.
M = { Q, ∑, ∂, q0, F }
Fig. Description 1: These are 3 steps of the formation in the process of FSA
Key Takeaways
English Morphology is essentially the process of structuring the entire language to form new
words from other base words that can be used further in the field of NLP. This type of morphology
is also known as ‘derivational morphology’. We can achieve the following in so many different
ways as follows:
We can use the stem (root word) to pair with different suffixes and affixes to make a new word that
may have relevance based on our language processing. A derived word from on that is pre-
existing is known as morphology. The English language has a lot of words that were made through
morphology. English derivation, however, is complex. The reasons are:
1. Different meanings: A lot of words do have the same root word but differ when it comes to
the true meaning of the word. For example. Let us take the word “conform”. A person can
add 2 suffixes to form the words “conformity” and “conformation”. However, both of the
above words have completely different meanings despite stemming from the same root
word.
2. Non-uniform Effects: The words that are formed may not always have lasting effects on
other words that are formed in the same manner. Take for example the word summarizes, if
one adds the verb form – “ation” it forms another word with very similar meaning. The
English language is vast but does not form the same throughout the entire plethora of the
language.
E.g.
Key Takeaways
English Morphology is used by the system in order to structure the entire language and
make some sense of it.
(dogs)
A finite-state transducer T = P(in) X P(out) defines a certain relation that is between 2 languages
which are in the terms of input and output. FST has a few applications which are given below as
follows.
1. Recognizer
The transducer plays an important role in taking a pair of strings and deciding whether to accept
them or reject them based on a certain amount of criteria.
Generator
The transducer can also give a pair of strings (viz. L(input) and L(output)) as output for a certain
language that is required.
Translator
This process involves reading a string that has been placed as input by the user. The FST then
uses “Morphological parsing” to translate the input in the form of morphemes.
Relater
FST’s are key in being used as a medium that can help relationships between different types of
sets being used around.
Lexicon
Every word in a language is made up of existing words that have meaning. This would essentially
be from a base word while adding an affix to the same. The result would be a word that is formed
through morphology. Most of these are governed by rules which are also known as “Orthographic
rules”. They form the main basis on which Morphological Parsing takes place. This is done
through the process of morphological parsing which is used to form or break a complex
structured form of representation into ‘morphemes. These morphemes are further used to make
words that have meaning and sense.
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that would suit
them to form a new word. A lexicon should constitute the list of words that have the base to form a
structure to be converted to different types of words.
Key Takeaways
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that would
suit them to form a new word.
1.7 Tokenization
Before a user happens to proceed with processing the language, the input text in any form needs
to undergo a process of normalization. This ensures that the text is easily distinguished and easy
to work on before any process goes through. The first process that is part of the entire
normalization process is also known as Tokenization or ‘segmentation’ of words.
Word tokenization can be achieved very easily with the help of a UNIX command line for a text
that is solely in the form of the English language. The process of breaking a given string into
smaller parts (tokens) is known as tokenization. Some of the basic commands that are used to
proceed with the following are as follows.
1. “tr”: This command is used to make systematic changes to the characters that are already
part of the input.
2. “sort”: This command as the name suggests can help to sort the lines which are part of the
input in alphabetical order for better understanding and inference.
3. “uniq”: This command is simply used to analyse the identical lines input and allow them to
collapse based on changes and modifications made to the system.
A simple method that can be used to initialize the process of tokenization is to split it o
whitespace. In the Python programming language, one should simply split this through the NLTK
library with the command raw.split(). Using regular expression makes it impossible to match
new characters in the space to the string. We need to match tabs and lines while using this
process.
Problems with tokenization:
Tokenization may be a great way to analyse and break down complex language structures but
may pan out to be a difficult task while executing. A single solution may not always serve the
purpose as efficiently as one would expect.
Deciding what should be grouped as a token and what should not, is key to enabling users
to work easily with the help of tokenization.
Having raw manually tokenized data to compare with data that has been given back as
output from the tokenizer helps to inspect the flaws and finally choose an appropriate
token in the process.
Contractions serve as a major problem e.g. “couldn’t”. While analysing a sentence
meaning this would not pose too much of a problem and hence should be broken down
into 2 separate tokens potentially.
Key Takeaways
While typing text into the computer it is a must that people are able to type the words in correctly.
Most of the times there is appropriate entry of words in the form of a corpus. However, if at a
certain point of time people do not understand the word being dictated out to them. This would
cause them to have an incorrect spelling attached to them. The computer would now have to work
on this right there in order to allow the corpus to be examined for analysis further. This is one of
the main reasons why the detection of spelling errors is essential to any corpus.
In order to get an idea about what errors could occur we need to make a list of them. These are all
the errors that could occur in a corpus.
Ignoring grammar rules
There are also different errors that are important to know while typing. The machine has to get a
good idea of the following before letting any of the data into the sorting algorithm.
Deletion – This means that a letter is missing from the string
Substitution – This means that a certain letter or letter in the string needs to be replaced by
another letter or letters in the string
Transposition – This means that all the letters are present but in a different order. The
letters in the string could be swapped between each other to get a meaningful word.
In order to detect the issue with the spelling there are 2 models that can be employed based on
the corpus that is in question. They are:
Direct Detection
After working with a lot of corrections the system is trained to make a list of all the words that
have been corrected before. By running the corpus against the list could get rid of many words
that are commonly misspelled. If the words check out they will be considered eligible.
Key Takeaways
Some of the basic operators that are used in regular expressions are Brackets, Dash, Caret,
Question mark, Concatenation symbols, Anchors, etc.
Just as the name rightly suggests this phenomenon is about the comparison between two
different types of strings. NLP constitutes a major part of measuring the similarities between two
or more phrases (strings). An example of this can be seen in everyday life. While typing a
message on your mobile you notice a certain autocorrect feature. In short, if the user types the
word ‘kangaroo’ instead of ‘kangaroo’ the system would intend to know what exactly did the user
mean. This would mean that the user has made a mistake and meant a word that was similar to this
one. Based on simple NLP the system deciphers ‘kangaroo’ to be the appropriate spelling and
consciously makes the change to save the user time and effort.
At times the spelling is not the only thing that is different about the string. Many a time people
have two strings that differ by only a single word. E.g.
1. “ Andrew is a football player.”
2. “Andrew is a good football player.”
On closer observation, we notice that both the sentences are very similar to each other and differ
by only a single letter. This tells us that both these strings might be referent. Now that we have an
idea about the similarities that can occur between two strings we can move on to the definition of
“Edit Distance” or simply “Minimum Edit Distance.”
Edit Distance is the distance between two strings in terms of their similarities and differences. It
also refers to the minimum number of operations that the user would have to execute to make two
or more strings alike (transform S1 into S2). The operations could include deletion and addition of
words as well as the substitution in certain cases to achieve the end string product.
We can look at the following example to understand the case between the 2 words ‘execution’
and ‘intention’.
The alignment between the 2 words is something that has been depicted in the above figure. We
can notice that there are a lot of letters that bear resemblance to each other in a very similar
position as the word we intend to transform it to. We notice that the transformation can be made
with the help of 3 distinct operations viz. deletion, insertion, and substitution. This is a visual
depiction of how knowledge of the minimum distance can help in-text transformation and editing.
We now have a clear idea about exactly is the minimum distance in NLP. However, to calculate the
same, we need to get familiar with an algorithm known as the “Minimum Edit Distance
Algorithm”. We need to find the shortest path that can get us to the intended word with simple
processes as mentioned above.
Key Takeaways
The minimum Edit distance refers to the least number of actions that it takes to convert
string 1 into string 2. We can achieve this through the minimum edit distance algorithm.
Predictions could be made very easily by doing the same. Users would benefit because the next
word in completing the statement would be one that has appeared more time because of the
appropriate probability density that has been assigned to the same. E.g. If we have a sentence,
“Please submit your ______.” It is most likely that the next word would more likely be ‘assignment’
rather than ‘house’.
Spellings are something that needs to be changed from time to time depending on the typing
speed of the person. Having a feature like autocorrect could be based more on the occurrence of
the word after a certain one. E.g. If we have a sentence, “drink mlik” due to some error while
typing, the machine will correct it to ‘milk’ and in turn change the spelling. This is done because
the word ‘milk’ has a higher chance of coming after the word ‘drink’.
Now that we understand the use of an N-gram model let us move forward and get an idea about
the formal understanding of an N-gram model. An N-gram model is essentially used to predict
the occurrences of the previous (N-1) words. To understand this better we have a few terms
introduced which help us understand the type of model we are going to be working with. This
helps us figure out our next word by handpicking the required data from each of them.
Bi-Gram Model – This means that there are 2 words (N=2) and we need to go back to (N-1)
= (2-1) = 1 word to predict the next word accurately.
Tri-Gram Model – This means that there are 3 words (N=3) and we need to go back to (N-1)
= (3-1) = 2 words to predict the next word accurately.
To see how the model works further we need to work with assigning a certain probabilistic
density to each of the following words in a model. Depending on the sentence the user can
choose the model which could be bigram, trigram, or even further for more complex strings. To
illustrate we need a ‘corpus’ or a large set of small sentences. Let us look at a few small sentences
to illustrate this property.
Key Takeaways
The N-gram model uses probability to make assumptions based on the corpus that is used
to train the function.
When we have a model like the N-gram model we must know how effective the model is in
practice as well as hypothetical cases. If the application of the program improves after time and
spatial expansion it is certain that the performance is improving. If we want to analyse and go
through the model, we need to conduct something which is known as ‘extrinsic evaluation’.
Extrinsic evaluation is a method that works from one end of the model that is the start to the other
end which is the output. We can simply run the model twice with a slight tweak in the model to get
a whole description of the analysis that is going to be conducted by the model. Once we get both
of the tests done we can simply compare them. However, the worst part about having to do a test
like this is the cost. Running an NLP system twice just for testing could cost a lot.
Due to the cost and other factors that would cause the previous evaluation to be accurate but not
every effective in the long run a new method was decided. This one simply decided a certain
metric that would be constant for all of the models. Once the model works it would simply
compare similar statistics even without having a certain application of the model. Such an
application and testing of the model is known as ‘Intrinsic evaluation’.
Now while using the above ‘Intrinsic evaluation’ we need to train the corpus that we have and
hence we need something which is known as a ‘test set’. Just as a lot of other tests that are present
in the world of statistics the N-gram model also relies heavily on testing the model with the help
of a corpus. So, this allows us to have a ‘training set’ and a ‘test set’ with data that is unseen by the
system. This allows the user to train the data with other features not prevalent in the current
dataset but might appear in future datasets.
Once we train our model with the help of the sets that are given above we notice that our
algorithm gets tuned up explicitly to the features that are present in these sets. This causes it to
become redundant with respect to a new feature that could appear in datasets in the future. This is
why people have come up with something which is known as the ‘development test set’ or a
‘devset’ to solve this problem. A general distribution by users who work with the data is 80%
which foes for the training data, 10% which goes for the test data, and the remaining 10% which
goes for the development set.
Key Takeaways
We can train our N-gram model with the help of an algorithm for the same.
1.12 Smoothing
A lot of times while testing our model we might come in contact with words that might be
unknown to us or those which simply have meaning but have occurred in the dataset only once.
Due to their single presence, the algorithm might assign them with simply a zero probability. If
this happens it might cause our result set to be null and void due to the negligible importance
given to those words. The process that allows us to assign some mass to the events is known
as smoothing or discounting. There are a few ways to do this smoothing which are given below.
Laplace Smoothing
Before we complete the normalization of the probability densities, we can add the number 1 to
every bigram count. We would notice that all the counts would be upped by a single value. E.g. 2
would be 1, 3 would be 4, etc. This is the simplest way of smoothing which may not be of great use
in newer N-gram models but does a great job in getting one familiar with concepts that might be
key in learning different types of smoothing.
Let us get an idea of this smoothing with the help of a unigram model
Where, w = word count
c = normalized count
N = number of words/tokens (after tokenization)
Therefore, P(w) = c / N
Now as explained above, the process of Laplace smoothing allows each word count to be
increased by the value 1. After making this increment, we need to also notice that an adjustment
has to be made to the denominator to make up for the W words in the observation list.
w = word count
c = normalized count
Therefore, P(Laplace)(w) = c +1 / N +W
Key Takeaways
1.13 Interpolation and Back off – Word Classes, Part-of-Speech Tagging, Rule-
based, Stochastic, and Transformation-based Tagging
We learned a lot about the smoothing that takes place to get rid of the problem that involves not
assigning any probability to new words. However, different problems need to be taken care of
which are shown below. Some of the problems that are solved by involving a trigram can be
solved instead by using bigram. We can also solve a few bigram models by using unigram
models.
We can use lesser context as a good example that allows the user to generalize the model. There
are 2 different types of ways that we can use the following
There is also a little more complex version of the same in which the λ is given as per the context
in which it is issued. We have a little assumption made here that the word counts for the trigram
model would be more accurate as compared to the bigram and unigram models. Hence, we get a
new modified equation as shown below.
In the case of having a certain non-zero word count the process of interpolation uses the data
which is given by lower n-gram models.
In the case of having a certain non-zero word count the process of back off does not use the data
which is given by lower n-gram models.
One can create an interpolated version of a back off algorithm and the same goes with creating
a back off version of the interpolated algorithm.
1.13.3 Word Classes
When we look at the topic of word classes, we find a very simple understanding of the same in
the English language. Having a good knowledge of the different word classes could help us
understand the morphology and other lexical analysis’ happening in language processing. We
can start with the simple examples of word classes which are given below and then move
forward.
Noun – N
Verb – V
Adjective – A
Adverb – a
Closed Classes
Determiner DT The, a, an
Dollar Sign $ $
Pound Sign # #
Left Quote “ ‘, “
Right Quote ” ’, ”
Comma , ,
Stage Two – There is a list of rules which are used to finalize every word to select and assign
them with a particular POS.
Properties
The following type of tagging is based on all the statistical inferences that have been
made to tag words that are part of the corpus.
A training set needs to be assigned to the model to get an idea of the instances that have
occurred in the past.
A probability would also be assigned to the words that may not appear in the training
corpus.
The training and testing corpus are different from each other.
It simply uses the tags which appear most frequently during the training process using a
training corpus set.
Key Takeaways
POS is a feature that helps tag the different tokens that have been formed after the process
of tokenization.
Rules for Rule-Based tagging
The taggers that are used are all driven by knowledge.
The rules involved in finalizing the POS are all made manually.
1.14 Issues in PoS Tagging – Hidden Markov and Maximum Entropy Models
These are the events that we know are happening and can physically or virtually picture them. In
this case, it is the tossing of coins. We are sure that coins are being tossed with an option of either
showing ‘heads’ or ‘tails’ as the outcome.
Hidden Events
These are the factors that are completely hidden from the user. In this case, they could be the
number of coins that are used in the problem or the order in which the coins are tossed up in case
they are different or weighted. One could conclude so many different models to understand this
process better. One of the forms of the model is given below.
Fig. No. 2
Fig. Description 2: These are the hidden events during the transformation as shown in the
diagram.
Tags in each of the following are sequences that generate tags for the outputs being given in the
hidden states. These states have tags that could give an output that is observable to the user.
1.14.2 Maximum Entropy Markov Model (MEMM)
This model is used as an alternate form for the above Markov Model. In this model, the sequences
which are taking place are replaced by a single function that we can define. This model can
simply tell us about the happenings in the predicted state ‘s’ while knowing the past state ‘s’ as
well as the current observation o. P (s | s’, o)
The main difference that we can get ourselves familiar with is the part where HMM’s are only
dependent on the current state whereas the MEMM’s have their inferences which can be based
on the current as well as the past states. This could involve a better form of accuracy while going
through the entire tagging process. We can even see the difference in the diagram which is given
below.
Fig. Description: Comparison between the Hidden Markov Model and the Maximum Entropy
Markov Model
Disadvantages of MEMM’s
1. MEMM models have a problem due to the label bias problem. Once the state has been
selected, the observation which is going to succeed the current one will select all of the
transitions that might be leaving the state.
There are a few examples in words such as “rib” and “rob” which when placed in the algorithm in
their raw form have a little different path as compared to the other words. When we have “r_b”
there is a path that goes through 0-1 and 0-4 to get the probability mass function. Both the paths
would be equally likely not allowing the model to select one fixed word that could be predicted
by the model correctly.
HMM, and MEMM models are used to tag and have their pros and cons which can be
decided based on the model that needs to be adapted.
Perplexity
One of the methods that are put to use to make sure that the evaluation of models is done
correctly is the process of ‘perplexity’. We cannot use probabilities that are raw for a metric while
evaluating languages. A variant in this is known as perplexity (PP). Perplexity of a certain NLP
model on a set that is used for testing is normalized by the number of words and the probabilistic
density assigned to it.
Let us say that for a set W = w1.w2.w3.w4……..wn
The following process is done by working on the probability expansion with the help of the chain
rule. This is followed by the assumption that the model is a 2-gram ort a bigram model as given
above. The result is perplexity which gives us a clear understanding while evaluating a model.
These are the different formulas as given below:
The final result that we can conclude is that because the probability is an inverse type of
probability, the higher the conditional probability of W the lower if the perplexity (PP).
We could even try to understand the concept of perplexity with the help of a factor called the
‘weighted average branching factor’. This simply is a factor that tells the user about the number
of words that may follow the given word. Let us take a simple example of the first 20 digits in the
normal number system. Hence the set S = {1,2,3,4, ……..20}. The probability that any of the
digits occurs is simply P = 1/20. However, if we consider the following as a language we notice
something a little different about the measure of the perplexity. As mentioned before the
perplexity is simply the inverse which is precisely why the answer would be inverted as shown
below.
We know, Perplexity = PP(S) = P(s1.s2.s3…….sN) ^ (-1/N)
= { [ (1/20)^N ] ^ (-1/N) }
= ( 1/20 ) ^ -1
= 20
Key Takeaways
A Hidden Markov Model (HMM) is a model that is a modified stochastic tagging model with
the main process being underlined throughout the main process.
They are based on 2 main factors viz. visible events and hidden events.
References
1. Fundamentals of Speech Recognition by Lawrence Rabiner
2. Speech and Language Processing by Daniel Jurafsky and James H. martin