0% found this document useful (0 votes)
11 views6 pages

Assignment 2 Vinay Kumar Chandra

Uploaded by

Ayush Saseendran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

Assignment 2 Vinay Kumar Chandra

Uploaded by

Ayush Saseendran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment 2

Vinay Kumar Chandra

Topic : Regular expression matching


Topic: Regular Expression Matching
1. Write more regular expressions that correct false positives
or fit false negatives. Do as best as you can using the epatterns
and ppatterns lists in the program. For each regular expression that
you write or extend, list:
o Example(s) of email or phone numbers that match that
pattern.
o Including example(s) of the obfuscated text that was matched
from the file.
o A short English description of the expressions the pattern
matches, demonstrating your understanding of regular
expressions.
o The results (TP, FP, FN) before and after you added the
expression.
Answer: For this part, I utilized 14 different regular expressions to parse
the obfuscated email addresses and phone numbers—10 for emails and 4
for phone numbers.
Emails:
1. '([A-Za-z0-9.]+)\s*&#[A-Za-z0-9.]+;\s*([A-Za-z0-9._]+)\.edu':
o Example: ada@graphics.stanford.edu.
o Matched Obfuscated Text: lID @ domain.edu.
o Description: This pattern captures email addresses with
obfuscated at signs (@).
2. '([A-Za-z.]+)\s+@\s+([A-Za-z.]+)\.[A-Za-z]+':
o Example: ullman @ cs.stanford.edu.
o Matched Obfuscated Text: ullman @ cs.stanford.edu.
o Description: This pattern identifies email addresses with
added spaces around the "@" symbol.
3. '([A-Za-z.]+)@([A-Za-z.]+)\.[A-Za-z]+':
o Example: [email protected].
o Matched Obfuscated Text: [email protected].
o Description: Basic pattern for emails with lowercase and
uppercase letters and dots before "@".
4. '([a-z.]+)\b[<][a-zA-Z&; ]+[>].?@([a-z.]+).edu':
o Example: asandra<del>@cs.stanford.edu.
o Matched Obfuscated Text: Lowercase words followed by
open brackets and special characters.
o Description: This pattern detects emails enclosed in anchor
brackets, with a domain ending in .edu.
5. '([A-Za-z0-9.]+)\s*\([\s*A-Za-z0-9.&;#]*["|;}]@([A-Za-z0-9._]+)\.edu':
o Example: Email with quotes and special characters.
o Description: Finds emails with a combination of
alphanumeric characters and symbols before the
@domain.edu part.
6. '^([a-z]+).?\bat\b\s(\W.+).edu+':
o Example: vladlen at <!-- die!--> stanford.
o Matched Obfuscated Text: Lowercase followed by "at" or a
special character.
o Description: Captures obfuscated emails where "at" is used
in place of "@".
7. (\w+)\b.[A-Z].*\b(stanford).[A-Za-z]+.edu':
o Example: engler WHERE stanford DOM edu.
o Description: Matches emails using uppercase words like
"WHERE" and "DOM" instead of "@" and ".".
8. '([a-z]+).at <!--.+>.(stanford).+edu':
o Example: Email starting with lowercase letters and ending
with special characters.
o Description: Captures emails with unique HTML comments
and spaces in between.
9. '([A-Za-z0-9.]+)\s*at\s*([A-Za-z0-9.]+)\.EDU':
o Example: example at example.EDU.
o Description: This matches emails with “at” instead of "@"
and uppercase ".EDU".
10. '([a-zA-Z0-9]+)\s*[<][a-zA-Z0-9 .]+[>]\s*([a-zA-Z0-9.]+)([a-zA-Z0-
9.]+)edu':
o Example: <at symbol> inside email obfuscation.
o Description: Designed to capture special symbols inside the
tags.
Phone Numbers:
1. '.+(\d{3}).[^0-9](\d{3})[^0-9](\d{4})':
o Example: 123-456-7890.
o Description: Matches phone numbers separated by non-
numeric characters.
2. '.?(\d{3})[^0-9](\d{3})[^0-9](\d{4})':
o Example: Similar to the above but with optional special
character matching.
o Description: Handles cases where phone numbers are
partially obfuscated.
3. '(\d{3})-(\d{3})-(\d{4})':
o Example: Phone numbers separated by dashes.
4. '(?:[(])([0-9]{3})(?:[)])[ ]*([0-9]{3})-([0-9]{4})':
o Example: (123) 456-7890.
o Description: Matches phone numbers with area codes inside
parentheses.

2. (Optional) List the examples that you found you could not
match with the current regular expressions with two
extracted parts, ending in .edu. For each example or set of
examples that fit the same pattern, explain briefly why it won’t
work. If you can make expressions in RegExpal that match, but
don’t work in the program to extract the email or phone numbers,
list those here. If you had any false positives in Part 1, include a
discussion here of why that rule generated them.
Answer:
Out of all cases, I had 108 true positives (TP), 0 false positives (FP), and 9
false negatives (FN).
True Positives: A list of 103 emails and phone numbers were correctly
matched

{('ashishg', 'e', '[email protected]'),


('ashishg', 'e', '[email protected]'),
('ashishg', 'p', '650-723-1614'),
('ashishg', 'p', '650-723-4173'),
('ashishg', 'p', '650-814-1478'),
('balaji', 'e', '[email protected]'),
('bgirod', 'p', '650-723-4539'),
('bgirod', 'p', '650-724-3648'),
('bgirod', 'p', '650-724-6354'),
('cheriton', 'e', '[email protected]'),
('cheriton', 'e', '[email protected]'),
('cheriton', 'p', '650-723-1131'),
('cheriton', 'p', '650-725-3726'),
('dabo', 'e', '[email protected]'),
('dabo', 'p', '650-725-3897'),
('dabo', 'p', '650-725-4671'),
('engler', 'e', '[email protected]'),
('eroberts', 'e', '[email protected]'),
('eroberts', 'p', '650-723-3642'),
('eroberts', 'p', '650-723-6092'),
('fedkiw', 'e', '[email protected]'),
('hager', 'p', '410-516-5521'),
('hager', 'p', '410-516-5553'),
('hager', 'p', '410-516-8000'),
('hanrahan', 'e', '[email protected]'),
('hanrahan', 'p', '650-723-0033'),
('hanrahan', 'p', '650-723-8530'),
('horowitz', 'p', '650-725-3707'),
('horowitz', 'p', '650-725-6949'),
('jurafsky', 'p', '650-723-5666'),
('kosecka', 'e', '[email protected]'),
('kosecka', 'p', '703-993-1710'),
('kosecka', 'p', '703-993-1876'),
('kunle', 'e', '[email protected]'),
('kunle', 'e', '[email protected]'),
('kunle', 'p', '650-723-1430'),
('kunle', 'p', '650-725-3713'),
('kunle', 'p', '650-725-6949'),
('lam', 'p', '650-725-3714'),
('lam', 'p', '650-725-6949'),
('latombe', 'p', '650-721-6625'),
('latombe', 'p', '650-723-0350'),
('latombe', 'p', '650-723-4137'),
('latombe', 'p', '650-725-1449'),
('levoy', 'e', '[email protected]'),
('levoy', 'e', '[email protected]'),
('levoy', 'p', '650-723-0033'),
('levoy', 'p', '650-724-6865'),
('levoy', 'p', '650-725-3724'),
('levoy', 'p', '650-725-4089'),
('manning', 'e', '[email protected]'),
('manning', 'e', '[email protected]'),
('manning', 'p', '650-723-7683'),
('manning', 'p', '650-725-1449'),
('manning', 'p', '650-725-3358'),
('nass', 'e', '[email protected]'),
('nass', 'p', '650-723-5499'),
('nass', 'p', '650-725-2472'),
('nick', 'e', '[email protected]'),
('nick', 'p', '650-725-4727'),
('ok', 'p', '650-723-9753'),
('ok', 'p', '650-725-1449'),
('ouster', 'e', '[email protected]'),
('ouster', 'e', '[email protected]'),
('pal', 'p', '650-725-9046'),
('psyoung', 'e', '[email protected]'),
('rajeev', 'p', '650-723-4377'),
('rajeev', 'p', '650-723-6045'),
('rajeev', 'p', '650-725-4671'),
('rinard', 'e', '[email protected]'),
('rinard', 'p', '617-253-1221'),
('rinard', 'p', '617-258-6922'),
('serafim', 'p', '650-723-3334'),
('serafim', 'p', '650-725-1449'),
('shoham', 'e', '[email protected]'),
('shoham', 'p', '650-723-3432'),
('shoham', 'p', '650-725-1449'),
('subh', 'p', '650-724-1915'),
('subh', 'p', '650-725-3726'),
('subh', 'p', '650-725-6949'),
('thm', 'e', '[email protected]'),
('thm', 'p', '650-725-3383'),
('thm', 'p', '650-725-3636'),
('thm', 'p', '650-725-3938'),
('tim', 'p', '650-724-9147'),
('tim', 'p', '650-725-2340'),
('tim', 'p', '650-725-4671'),
('ullman', 'e', '[email protected]'),
('ullman', 'p', '650-494-8016'),
('ullman', 'p', '650-725-2588'),
('ullman', 'p', '650-725-4802'),
('vladlen', 'e', '[email protected]'),
('widom', 'e', '[email protected]'),
('widom', 'e', '[email protected]'),
('widom', 'p', '650-723-0872'),
('widom', 'p', '650-723-7690'),
('widom', 'p', '650-725-2588'),
('zelenski', 'e', '[email protected]'),
('zelenski', 'p', '650-723-6092'),
('zelenski', 'p', '650-725-8596'),
('zm', 'e', '[email protected]'),
('zm', 'p', '650-723-4364'),
('zm', 'p', '650-725-4671')}
False Negatives: The following 14 emails were missed:

{('dlwh', 'e', '[email protected]'),


('engler', 'e', '[email protected]'),
('hager', 'e', '[email protected]'),
('jks', 'e', '[email protected]'),
('jurafsky', 'e', '[email protected]'),
('lam', 'e', '[email protected]'),
('latombe', 'e', '[email protected]'),
('latombe', 'e', '[email protected]'),
('latombe', 'e', '[email protected]'),
('pal', 'e', '[email protected]'),
('serafim', 'e', '[email protected]'),
('subh', 'e', '[email protected]'),
('subh', 'e', '[email protected]'),
('ullman', 'e', '[email protected]')}

Issues with False Negatives:


1. Domain Variations:
Emails such as [email protected] and [email protected] have
subdomains not accounted for by the regular expressions designed
for .edu domains.
2. Unnecessary Capturing Groups:
False negatives arose when extra capturing groups in the regex
unnecessarily divided the email address into multiple parts.
3. Complex Domain Structures:
Some emails have subdomains like cs.stanford.edu, which the
current patterns did not fully capture.

3. (Optional) Search the web and find a couple of additional


examples of obscured email addresses or phone numbers
and report on them, or try to design a way to obscure an
email address that would be extremely difficult for
ContactFinder to match with a regular expression.
Answer:
One method to obscure an email that is difficult to parse is by using
JavaScript to dynamically generate the email address, for example:

<SCRIPT LANGUAGE="JavaScript">
user = 'name';
site = 'domain.com';
document.write('<a href=\"mailto:' + user + '@' + site + '\">');
document.write(user + '@' + site + '</a>');
</SCRIPT>
In this approach, the actual email address is hidden within a script,
making it challenging for regular expression-based scrapers to identify the
email. Another effective method would be to use CAPTCHA, requiring user
interaction to retrieve the email address, which bots typically cannot
bypass. However, this reduces accessibility and can inconvenience users.

You might also like