0% found this document useful (0 votes)
8 views13 pages

Natural Language Processing 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Natural Language Processing 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 1: NLP

Aarti Dharmani
Estimate bigram probabilities
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

P(I|<s>) =
P(Sam|<s>) =
P(am|I) =
P(</s>|Sam) =
P(Sam|am) =
P(do|I) =
Given no. of bigrams and unigrams count of a
dataset
i want to eat chinese food lunch spend
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

i want to eat chinese food lunch spend


2533 927 2417 746 158 1093 341 278
Calculate the probability of a sentence
• P(I want chinese food to eat) = ?

• P(I) x P(want|I) x P(chinese|want) x P(food|chinese) x P(to|food) x


P(eat|to) = ?
Regular Expressions
Regular expressions provide a powerful, flexible, and efficient method
for processing text.
The extensive pattern-matching notation of regular expressions enables
you to quickly parse large amounts of text to:
• Find specific character patterns.
• Validate text to ensure that it matches a predefined pattern (such as
an email address).

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Elements of Regular Expressions
1. Repeaters ( *, +, and { } )
These symbols act as repeaters and tell the computer that the preceding character
is to be used for more than just one time.

2. The asterisk symbol ( * )


It tells the computer to match the preceding character (or set of characters) for 0 or
more times (upto infinite).

3. The Plus symbol ( + )


It tells the computer to repeat the preceding character (or set of characters) at
atleast one or more times(up to infinite).
4. The curly braces { … }
It tells the computer to repeat the preceding character (or set of characters) for as
many times as the value inside this bracket.

5. Wildcard ( . )
The dot symbol can take the place of any other symbol, that is why it is called the
wildcard character.
6. Optional character ( ? )
This symbol tells the computer that the preceding character may or may not be
present in the string to be matched.

7. The caret ( ^ ) symbol ( Setting position for the match )


The caret symbol tells the computer that the match must start at the beginning of
the string or line.

8. The dollar ( $ ) symbol


It tells the computer that the match must occur at the end of the string or before \n
at the end of the line or string.
9. Character Classes
A character class matches any one of a set of characters. It is used to match the
most basic element of a language like a letter, a digit, a space, a symbol, etc.
10. [^set_of_characters] Negation:
Matches any single character that is not in set_of_characters. By default, the match
is case-sensitive.

11. [first-last] Character range:


• Matches any single character in the range from first to last.

12. The Escape Symbol ( \ )


If you want to match for the actual ‘+’, ‘.’ etc characters, add a backslash( \ ) before
that character. This will tell the computer to treat the following character as a
search character and consider it for a matching pattern.
13. Grouping Characters ( )
A set of different symbols of a regular expression can be grouped together to act as
a single unit and behave as a block, for this, you need to wrap the regular
expression in the parenthesis( ).

14. Vertical Bar ( | )


Matches any one element separated by the vertical bar (|) character.
Write Regular Expressions for the following cases
1. Mobile number:should start with 8 or 9 and total number of
digits:10
2. First Character uppercase, contains lower case alphabets, only one
digit allowed in between
3. Email id, say [email protected]

You might also like