Tutorial 4

This document provides instructions for a series of exercises on screen scraping and extracting email addresses from HTML documents. It begins with defining the tasks of writing functions to do case-insensitive string comparisons and searches. It then asks the student to write functions to extract links, names and emails from sample HTML code provided. Finally, it asks the student to integrate these functions to build up the ability to extract a specific contact's email from a URL by name. The exercises start simply and build complexity, with the goal of creating a program that can search webpages for a person's email address.

Uploaded by

VineetSingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

303 views6 pages

Tutorial 4

Uploaded by

VineetSingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Screen-scraping

Informatics 1 Functional Programming: Tutorial 4

Due: The tutorial of week 6 (29/30 Oct.)

Please attempt the entire worksheet in advance of the tutorial, and bring with you all
work, including (if a computer is involved) printouts of code and test results. Tutorials
cannot function properly unless you do the work in advance.
You may work with others, but you must understand the work; you cant phone a friend
during the exam.
Assessment is formative, meaning that marks from coursework do not contribute to the
final mark. But coursework is not optional. If you do not do the coursework you are
unlikely to pass the exams.
Attendance at tutorials is obligatory; please let your tutor know if you cannot attend.

Basic Screen Scraper

A screen scraper is a tool used to extract data from web sites, by looking at their source. In this
exercise, you will write one of the most hated screen scrapers: one that extracts email addresses.
Why is it hated? Because people use screen scrapers like that to collect email addresses to send
spam to. However, in this exercise we will show you a useful purpose of the email screenscraper!
We are going to be extracting names and emails from web pages written in HTML (HyperText
Markup Language). For instance, from the following HTML:
<html>
<head>
<title>FP: Tutorial 4</title>
</head>
<body>
<h1>A Boring test page</h1>
<h2>for tutorial 4</h2>
<a href="http://www.inf.ed.ac.uk/teaching/courses/inf1/fp/">FP Website</a> 
Lecturer: <a href="mailto:[email protected]">Don Sannella</a> 
TA: <a href="mailto:[email protected] ">Karoliina Lehtinen</a>
</body>
</html>

We are going to extract a list of the <a> elements, which contain URLs (Uniform Resource
Locators). If a URL begins with http: it is an address of a web page; if it begins with mailto: the
rest of it is an email address. For the document above, here is the list of links (each one contains
some extra data at the end, which is an artifact of the technique we use):
["http://www.inf.ed.ac.uk/teaching/courses/inf1/fp/\">FP Website</a> Lecturer: "

,"mailto:[email protected]\">\Don Sannella</a> TA: "

,"mailto:[email protected]\">Karoliina Lehtinen</a></body></html>"]

From this list, we will in turn extract a list of names and email addresses:
[("Don Sannella","[email protected]"),
("Karoliina Lehtinen","[email protected]")]
The file tutorial4.hs contains the test html-document and the lists above: testHTML, testLinks,
and testAddrBook.
Notice that the type of testLinks is [Link] and the type of testAddrBook is [(Name,Email)]. In
other words: testLinks is a list of Links, and testAddrBook is a list of tuples containing both a
Name: and an Email. These appear to be new types which we have not encountered before, but
if you look in the file tutorial4.hs you will find the following type expressions:
type
type
type
type
type

Link = String
Name = String
Email = String
HTML = String
URL = String

These type declarations simply define aliases for the very familiar type String. Aliases are not
strictly necessary, but they make your program more readable.
Note: If you want to know more about HTML, have a look at: http://www.w3schools.com/html/.

Exercises
1. Write a function sameString :: String -> String -> Bool that returns True when two
strings are the same, but ignores whether a letter is in upper- or lowercase. For example:
*Main> sameString "HeLLo" "HElLo"
True
*Main> sameString "Hello" "Hi there"
False
Warning: Unintuitively, the mapping between upper and lower case characters is not one-toone. For example, the greek letter and the micro sign map to the same upper case letter.
What does your code do on sameString "\181" "\956"? In this case either behaviour is
acceptable, as long as the tests dont fail on input containing these characters!
2. Write a function prefix :: String -> String -> Bool that checks whether the first string
is a prefix of the second, like the library function isPrefixOf that you used before, but this
time it should be case-insensitive.
*Main> prefix "bc" "abCDE"
False
*Main> prefix "Bc" "bCDE"
True
Check your function using the predefined test property prop_prefix.
3. (a) Write the function contains as in tutorial 2, but case-insensitive. For example:
*Main> contains "abcde" "bd"
False
*Main> contains "abCDe" "Bc"
True
2

(b) Write a test property prop_contains :: String -> Int -> Int -> Bool to test your
contains function. You can take inspiration from prop_prefix.
4. (a) Write a case-insensitive function takeUntil :: String -> String -> String that returns the contents of the second string before the first occurrence of the first string. If
the second string does not contain the first as a substring, return the whole string. E.g.:
*Main> takeUntil "cd" "abcdef"
"ab"
(b) Write a case-insensitive function dropUntil :: String -> String -> String that returns the contents of the second string after the first occurrence of the first string. If
the second string does not contain the first as a substring, return the empty string. E.g.:
*Main> dropUntil "cd" "abcdef"
"ef"
5. (a) Write a case-insensitive function split :: String -> String -> [String] that divides the second argument at every occurrence of the first, returning the results as a
list. The result should not include the separator. For example:
*Main> split "," "comma,separated,string"
["comma","separated","string"]
*Main> split "the" "to thE WINNER the spoils!"
["to "," WINNER "," spoils!"]
*Main> split "end" "this is not the end"
["this is not the ",""]
Your function should return an error if the first argument, the separator string, is an
empty list. You will find your functions takeUntil and dropUntil useful here.
(b) Write a function reconstruct :: String -> [String] -> String that reverses the
result of split. That is, it should take a string and a list of strings, and put the list of
strings back together into one string, with the first string everywhere in between (but
not at the start or at the end).
(c) Look at the predefined test function prop_split and explain what it does. Use it to
test your split function.
6. Use your function split to write a function linksFromHTML :: HTML -> [Link]. You can
assume that a link begins with the string <a href=". Dont include this separator in the
results, and dont include the stuff in the HTML that precedes the first link. Example:
*Main> linksFromHTML testHTML
["http://www.inf.ed.ac.uk/teaching/courses/inf1/fp/\">FP Website</a> Lecturer: ",
"mailto:[email protected]\">\Don Sannella</a> TA: ",
"mailto:[email protected]\">Karoliina Lehtinen</a></body></html>"]
Note: to include the character " in a string, precede it with a backslash (\), as \".
Use testLinksFromHTML to test your function on the given sample data. Note that this test
does not require QuickCheck, since it does not depend on randomly generated input.
7. Write a function takeEmails :: [Link] -> [Link] which takes just the email addresses
from a list of links given by linksFromHTML. Example:
*Main> takeEmails testLinks
["mailto:[email protected]\">\Don Sannella</a> TA: ",
"mailto:[email protected]\">Karoliina Lehtinen</a></body></html>"]
8. Write a function link2pair :: Link -> (Name, Email) which converts a mailto link into
a pair consisting of a name and the corresponding email address. The name is the part of the
link between the <a href="..."> and </a> tags; the email address is the part in the quotes
after mailto:. Add an appropriate error message if the link isnt a mailto: link. Example:
3

*Main> link2pair "mailto:[email protected]\">John</a>"

("John","[email protected]")
9. Combine your functions linksFromHTML, takeEmails and link2pair to write a function
emailsFromHTML :: HTML -> [(Name, Email)] that extracts all mailto links from a webpage, turns them into (Name, Email) pairs, and then removes duplicates from that list.
Example:
*Main> emailsFromHTML testHTML
[("Don Sannella","[email protected]"),
("Karoliina Lehtinen","[email protected]")]
Note: the library function nub :: [a] -> [a] removes duplicates from a list.
You can test your function with testEmailsFromHTML.

Pulling in live URLs

In tutorial4.hs a test URL is predefined, testURL. Since it is just a string, you can ask GHCi to
display it. Do this, and copy-paste the link into your web browser to see what page it refers to. To
see the HTML of the page right-click and select view page source, or a similar option depending
on your browser.
*Main> testURL
"http://www.inf.ed.ac.uk/teaching/courses/inf1/fp/testpage.html"
The function emailsFromURL, which is already defined in tutorial4.hs, extracts email addresses
from a URL using your very own emailsFromHTML. Test your function emailsFromHTML by testing
it on real URLs of your choice.
As you will have seen, emailsFromURL sometimes produces a rather long list of names and email
addresses. Sometimes you have a vague idea of who it is you are looking for and in that case, you
do not want to go through the entire list of names one-by-one. Over the next few exercises you will
be implementing a function emailsByNameFromURL in order to find the email address of a person
whose name you know.
Exercises
10. Write a function findEmail :: Name -> [(Name,Email)] -> [(Name,Email)] which given
(part of) a name and a list of (Name,Email) pairs, returns a list of those pairs which match
the name. Example:
*Main> findEmail "Karoliina" testAddrBook
[("Karoliina Lehtinen","[email protected]")]
*Main> findEmail "San" testAddrBook
[("Don Sannella","[email protected]")]
*Main> findEmail "Fred" testAddrBook
[]
11. Define the function emailsByNameFromHTML :: HTML -> Name -> [(Name, Email)]. This
function should take an HTML string and (part of) a name, and return all (Name,Email)
pairs which match the name.
*Main> emailsByNameFromHTML testHTML "Karoliina"
[("Karoliina Lehtinen","[email protected]")]
The function emailsByNameFromURL, which is already defined in tutorial4.hs, uses your very own
emailsByNameFromHTML function to extract the email address of a certain person from a live URL.
Maybe you can try it on your own webpage, if you have one.
4

Optional Material
Searching for strings
In the previous section you have written functions to find email addresses which belong to people
whose name contains the input string. You will now write code to select names which match more
elaborate criteria.
Exercises
12. Write a function hasInitials :: String -> Name -> Bool which returns true if the initials of the second argument are exactly the first argument.
*Main> hasInitials "DS" "Don Sannella"
True
*Main> hasInitials "MKL" "Karoliina Lehtinen"
False
13. Write a function emailsByMatchFromHTML :: (Name -> Bool) -> HTML -> [(Name,Email)] .
It should find all the emails belong to people whose name match the criterion set out by the
first argument. Note the type of the first argument of this function (the brackets are important!).
Then write a funciton emailsByInitialsFromHTML :: String -> HTML -> [(Name,Email)]
which finds emails of people whose initials match the first argument.
14. Write a function myCriteria :: Name -> Bool which tests whether a name matches a criterion of your choice. If you are stuck for ideas, match names of which the initials contain a
reference string, in the right order but not necessarily in consecutive positions. For example
Don T. Sannella matches DS. You may want your function to take more than one argument, in which case you can adjust its type. Use this function and the previous ones to write
emailsByMyCriteriaFromHTML :: HTML -> [(Name,Email)] which finds emails belonging
to people whose names match your criterion.

Pretty printing
We often want to look at the output of a function (say emailsFromHTML) in a slightly nicer way.
This is called pretty printing. In emailsFromURL the output of emailsFromHTML is currently being
pretty printed by a function called ppAddrBook. In this exercise, you will be rewriting that function
to make emailsFromURL produce a different output.
You will need two pieces of information to complete this exercise. First of all, you may assume
that if a name has more than two words, the first name is the first word and the last name is the
remaining words1 . Second, all of the names should line up and all of the email addresses should line
upno matter how long the names are. For example:
Lehtinen, Karoliina
Sannella, Don

[email protected]
[email protected]

In order to print a block of text like this to the screen, we cant simply return it from a function,
because GHCi will faithfully escape all the funny characters in the string, such as newlines. The
function putStr takes a string and prints it to the screen, which involves turning newline characters
'\n' into actual new lines. For example:
*Main> putStr "First Line\nSecond Line\nThird Line\n"
First Line
1 Note

that this is the way the British classification system works, but that it does not provide a correct classification
for many non-English names.

Second Line
Third Line
Exercises
15. Rewrite the function ppAddrBook :: [(Name,Email)] -> String so that it lines up the
names and email addresses in two separate columns. For example:
*Main> putStr (ppAddrBook testAddrBook)
Sannella, Don
[email protected]
Lehtinen, Karoliina [email protected]
You will find, in general, that some names are listed in surname, first name format and
some are given in the regular first name surname format. Make sure your function can
cope with both formats.

Advanced PHP Strings Text Analysis, Generation, and Parsing Via. Laravel (Johnathon Koster) (Z-Library)
No ratings yet
Advanced PHP Strings Text Analysis, Generation, and Parsing Via. Laravel (Johnathon Koster) (Z-Library)
749 pages
Unit-3_Collections and Sequences
No ratings yet
Unit-3_Collections and Sequences
157 pages
Session22 To 24 PYTHON COLAB
No ratings yet
Session22 To 24 PYTHON COLAB
128 pages
8.3 String Manipulation
No ratings yet
8.3 String Manipulation
32 pages
UNIT 3-5
No ratings yet
UNIT 3-5
47 pages
Tutorial 6
No ratings yet
Tutorial 6
12 pages
Build Simple Websites Using Commercial Programs
100% (1)
Build Simple Websites Using Commercial Programs
51 pages
4_Adding an Element to a Specified(75-98)
No ratings yet
4_Adding an Element to a Specified(75-98)
24 pages
sZqNV9HyBC23D9pj
No ratings yet
sZqNV9HyBC23D9pj
21 pages
Lecture 3@StringDataStructure
No ratings yet
Lecture 3@StringDataStructure
25 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
Ch11-ManipulatingTextWithMethodsAndFiles (1)
No ratings yet
Ch11-ManipulatingTextWithMethodsAndFiles (1)
54 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
PythonRevisionTour II Notes
No ratings yet
PythonRevisionTour II Notes
57 pages
Python Revision Tour - II
No ratings yet
Python Revision Tour - II
20 pages
2023-fall-final 2
No ratings yet
2023-fall-final 2
12 pages
Sessions
No ratings yet
Sessions
147 pages
6 - Strings
No ratings yet
6 - Strings
19 pages
2023 - Unit4 Strings
No ratings yet
2023 - Unit4 Strings
18 pages
nishant html
No ratings yet
nishant html
29 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
38 pages
Python Module 4
No ratings yet
Python Module 4
12 pages
Python Imp 3
No ratings yet
Python Imp 3
10 pages
Fy Bsc.it Python Index[1]
No ratings yet
Fy Bsc.it Python Index[1]
4 pages
prac-5
No ratings yet
prac-5
3 pages
Python Strings Practice Opportunity Solution
No ratings yet
Python Strings Practice Opportunity Solution
3 pages
hw2
No ratings yet
hw2
4 pages
EXP1 C3 3 Manndutiya 54
No ratings yet
EXP1 C3 3 Manndutiya 54
12 pages
Class 12 CS supplementary exam 2024 QP
No ratings yet
Class 12 CS supplementary exam 2024 QP
14 pages
Chapter 2 Python Revision Tour Ii Notes
No ratings yet
Chapter 2 Python Revision Tour Ii Notes
36 pages
Sheet No.2+solution+summary
No ratings yet
Sheet No.2+solution+summary
5 pages
2. Class XII CS MS Set-1
No ratings yet
2. Class XII CS MS Set-1
10 pages
Python Unit - 3( a )String
No ratings yet
Python Unit - 3( a )String
43 pages
TM120 Project+Management+Plan
No ratings yet
TM120 Project+Management+Plan
3,271 pages
1330_2022A
No ratings yet
1330_2022A
11 pages
Keshav Dey (E-Commerce)
No ratings yet
Keshav Dey (E-Commerce)
43 pages
Lecture 7 Strings, Dictionaries and Sets
No ratings yet
Lecture 7 Strings, Dictionaries and Sets
34 pages
Week 7 Practice Questions-1
No ratings yet
Week 7 Practice Questions-1
4 pages
gc_2025_01_07
No ratings yet
gc_2025_01_07
16 pages
Portfolio (Final Project Ukit) by Mudassir Punjwani
No ratings yet
Portfolio (Final Project Ukit) by Mudassir Punjwani
32 pages
XII CS AK Set 3
No ratings yet
XII CS AK Set 3
10 pages
Assignment 1700480105
No ratings yet
Assignment 1700480105
34 pages
Guru Nanak Dev Engg College: Web Technologies Laboratory (LPCIT - 107)
No ratings yet
Guru Nanak Dev Engg College: Web Technologies Laboratory (LPCIT - 107)
51 pages
CHENNAI PUBLIC SCHOOL CHAPTER 1 PYTHON REVISION
No ratings yet
CHENNAI PUBLIC SCHOOL CHAPTER 1 PYTHON REVISION
3 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
String Manipulation
No ratings yet
String Manipulation
29 pages
CS Lsbr in Rsbr Lsbr Trust Wallet Comma ExodusWeb3 Rsbr 49.205.122.99.Rar 6722bb10-f681-4032-b412-A718ac478711 Password File (1)
No ratings yet
CS Lsbr in Rsbr Lsbr Trust Wallet Comma ExodusWeb3 Rsbr 49.205.122.99.Rar 6722bb10-f681-4032-b412-A718ac478711 Password File (1)
137 pages
Sample-Paper-Computer-Science-Paper
No ratings yet
Sample-Paper-Computer-Science-Paper
11 pages
SEO PowerSuite Workflow PDF
No ratings yet
SEO PowerSuite Workflow PDF
103 pages
21CSL46
No ratings yet
21CSL46
35 pages
C4 Week3
No ratings yet
C4 Week3
22 pages
Core_Python_Syllabus
No ratings yet
Core_Python_Syllabus
11 pages
4 XSS
No ratings yet
4 XSS
17 pages
Python-Unit2
No ratings yet
Python-Unit2
36 pages
Robots
No ratings yet
Robots
21 pages
string python
No ratings yet
string python
8 pages
Python Day- 3
No ratings yet
Python Day- 3
10 pages
Backend Development: SWE 432, Fall 2016 Design and Implementation of Software For The Web
No ratings yet
Backend Development: SWE 432, Fall 2016 Design and Implementation of Software For The Web
32 pages
Unit 3 PDF
No ratings yet
Unit 3 PDF
17 pages
HTML Quiz Qa
No ratings yet
HTML Quiz Qa
17 pages
10 Lists
No ratings yet
10 Lists
153 pages
Sample-Paper-Computer-Science-Paper-Sol
No ratings yet
Sample-Paper-Computer-Science-Paper-Sol
11 pages
Docker For Local Web Development, Part 5 - HTTPS All The Things
No ratings yet
Docker For Local Web Development, Part 5 - HTTPS All The Things
31 pages
Iae 2 Answer Key
No ratings yet
Iae 2 Answer Key
4 pages
Google Apps Script Web Apps_ Comprehensive Guide
No ratings yet
Google Apps Script Web Apps_ Comprehensive Guide
9 pages
6 Strings11
No ratings yet
6 Strings11
14 pages
Paypal
100% (1)
Paypal
6 pages
Mooc Seminar: Name Rajat Kushwaha St. Id 200211241 Section I'
No ratings yet
Mooc Seminar: Name Rajat Kushwaha St. Id 200211241 Section I'
46 pages
Python Lab Manual 21EC643
No ratings yet
Python Lab Manual 21EC643
41 pages
strings and string methods
No ratings yet
strings and string methods
5 pages
Html5 With JSF 2.0
No ratings yet
Html5 With JSF 2.0
37 pages
Smart College Event Management System Using MERN Stack
No ratings yet
Smart College Event Management System Using MERN Stack
7 pages
Habib
No ratings yet
Habib
1 page
Common Application Form (CAF) User Manual: Andhra Pradesh Student Academic Management System Apsams
No ratings yet
Common Application Form (CAF) User Manual: Andhra Pradesh Student Academic Management System Apsams
11 pages
Spec 161 3 0
No ratings yet
Spec 161 3 0
8 pages
IST 2102 Group Project Assignment
No ratings yet
IST 2102 Group Project Assignment
2 pages
Class11_1
No ratings yet
Class11_1
9 pages
Website Vulnscan Sample Report
No ratings yet
Website Vulnscan Sample Report
6 pages
A Case Study For Improving The Performance of Web Application
No ratings yet
A Case Study For Improving The Performance of Web Application
4 pages
1 Write A Jquery Code To Check Whether Jqueery Is Loaded or Not
No ratings yet
1 Write A Jquery Code To Check Whether Jqueery Is Loaded or Not
38 pages
Useful String Operators
No ratings yet
Useful String Operators
1 page
21EC643
No ratings yet
21EC643
4 pages
Digital & Social Media Marketing
0% (1)
Digital & Social Media Marketing
5 pages
Annamacharya Institute of Technology & Sciences::Kadapa
No ratings yet
Annamacharya Institute of Technology & Sciences::Kadapa
3 pages
Unit-Iii Chapter-1: Python Strings Revisited
100% (2)
Unit-Iii Chapter-1: Python Strings Revisited
49 pages
Social Media Dashboard: Social Media Followers Google Analytics Social Media Conversions
No ratings yet
Social Media Dashboard: Social Media Followers Google Analytics Social Media Conversions
1 page
Python Quick Reference
100% (1)
Python Quick Reference
3 pages
Google Search
No ratings yet
Google Search
2 pages
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet

Tutorial 4

Uploaded by

Tutorial 4

Uploaded by

Screen-scraping

Informatics 1 Functional Programming: Tutorial 4

Due: The tutorial of week 6 (29/30 Oct.)

Basic Screen Scraper

,"mailto:[email protected]\">\Don Sannella</a><br><b>TA:</b> "

*Main> link2pair "mailto:[email protected]\">John</a>"

Pulling in live URLs

You might also like