Example 1: Fetching and Parsing HTML: Digital Assignment - 2 Distributed Computing
Example 1: Fetching and Parsing HTML: Digital Assignment - 2 Distributed Computing
Distributed Computing
This example downloads a single web page and parses it into a structured table
using Refine’s built in functions. A similar workflow can be applied to a list of
URLs, often generated by parsing another web page, creating a flexible web
harvesting tool.
The raw data for this example is an HTML copy of Shakespeare’s Sonnets from
Project Gutenberg. Processing a book of poems into structured data enables new
ways of reading text, allowing us to sort, manipulate, and connect with other
information.
Start OpenRefine and select Create Project. Refine can import data from a wide
variety of formats and sources, from a local Excel file to web accessible RDF. One
often over looked method is the Clipboard, which allows entering data via copy &
paste. Under “Get Data From”, click Clipboard, and paste this URL into the text
box:
https://programminghistorian.org/assets/fetch-and-parse-data-with-
openrefine/pg1105.html
After clicking Next, Refine should automatically identify the content as a line-
based text file and the default parsing options should be correct. Add the project
name “Sonnets” at the top right and click Create project. This will result in a
project with one column and one row.
Fetch HTML
Name the new column “fetch”. The Throttle delay option sets a pause time
between requests to avoid being blocked by a server. The default is conservative.
After clicking “OK”, Refine will start requesting the URLs from the base column
as if you were opening the pages in your browser, and will store each response in
the cells of the new column. In this case, there is one URL in Column 1 resulting in
one cell in the fetch column containing the full HTML source for the Sonnets web
page.
Parse HTML
Much of the web page is not sonnet text and must be removed to create a clean
data set. First, it is necessary to identify a pattern that can isolate the desired
content. Items will often be nested in a unique container or given a meaningful
class or id.
To make examining the HTML easier, click on the URL in Column 1 to open the
link in a new tab, then right click on the page to “View Page Source”. In this case
the sonnets page does not have distinctive semantic markup, but each poem is
contained inside a single <p> element. Thus, if all the paragraphs are selected, the
sonnets can be extracted from the group.
On the fetch column, click on the menu arrow > edit column > Add column based
on this column. Give the new column the name “parse”, then click in the
Expression text box.
Data in Refine can be transformed using the General Refine Expression Language
(GREL). The Expression box accepts GREL functions that will be applied to each
cell in the existing column to create values for the new one. The Preview window
below the Expression box displays the current value on the left and the value for
the new column on the right.
The default expression is value, the GREL variable representing the current
contents of a cell. This means that each cell is simply copied to the new column,
which is reflected in the Preview. GREL variables and functions are strung
together in sequence using a period, called dot notation. This allows complex
operations to be constructed by passing the results of each function to the next.
value.parseHtml().select("p")
Do not click OK at this point, simply look at the Preview to see the result of the
expression.
Notice that the output on the right no longer starts with the HTML root elements
(<!DOCTYPE html etc.) seen on the left. Instead, it starts with a square bracket [,
displaying an array of all the p elements found in the page. Refine represents an
array as a comma separated list enclosed in square brackets, for example [ "one",
"two", "three" ].
Adding an index number to the expression selects one element from the
array, for example value.parseHtml().select("p")[0]. The beginning of the
sonnets file contains many paragraphs of license information that are
unnecessary for the data set. Skipping ahead through the index numbers, the
first sonnet is found at value.parseHtml().select("p")[37].
GREL also supports using negative index numbers, thus
value.parseHtml().select("p")[-1] will return the last item in the array.
Working backwards, the last sonnet is at index [-3].
Using these index numbers, it is possible to slice the array, extracting only
the range of p that contain sonnets. Add the slice() function to the expression
to preview the sub-set: value.parseHtml().select("p").slice(37,-2).
Clicking OK with the expression above will result in a blank column, a common
cause of confusion when working with arrays. Refine will not store an array object
as a cell value. It is necessary to use toString() or join() to convert the array into a
string variable. The join() function concatenates an array with the specified
separator. For example, the expression [ "one", "two", "three" ].join(";") will result
in the string “one;two;three”. Thus, the final expression to create the parse column
is:
value.parseHtml().select("p").slice(37,-2).join("|")
Split Cells
The parse column now contains all the sonnets separated by “|”, but the project
still contains only one row. Individual rows for each sonnet can be created by
splitting the cell. Click the menu arrow on the parse column > Edit cells > Split
multi-valued cells. Enter the separator | that was used to join in the last step.
After this operation, the top of the project table should now read 154 rows. Below
the number is an option toggle “Show as: rows records”. Clicking on records will
group the rows based on the original table, in this case it will read 1. Keeping track
of these numbers is an important “sanity check” when transforming data in Refine.
The 154 rows make sense because the ebook contained 154 sonnets, while 1 record
represents the original table with only one row. An unexpected number would
indicate a problem with the transformation.
Each cell in the parse column now contains one sonnet surround by a <p> tag. The
tags can be cleaned up by parsing the HTML again. Click on the parse column and
select Edit cells > Transform. This will bring up a dialog box similar to creating a
new column. Transform will overwrite the cells of the current column rather than
creating a new one.
In the expression box, type value.parseHtml(). The preview will show a complete
HTML tree starting with the <html> element. It is important to note that
parseHtml() will automatically fill in missing tags, allowing it to parse these cell
values despite not being valid HTML documents. Select the p tag, add an index
number, and use the function innerHtml() to extract the sonnet text:
value.parseHtml().select("p")[0].innerHtml()
Click OK to transform all 154 cells in the column.
Unescape
Notice that each cell has dozens of , an HTML entity used to represent “no-
break space” since browsers ignore extra white space in the source. These entities
are common when harvesting web pages and can be quickly replaced with the
corresponding plain text characters using the unescape() function. On the parse
column, select Edit cells > Transform and type the following in the expression box:
value.unescape('html')
GREL array functions provide a powerful way to manipulate text data and can be
used to finish processing the sonnets. Any string value can be turned into an array
using the split() function by providing the character or expression that separates the
items (basically the opposite of join()).
In the sonnets each line ends with <br />, providing a convenient separator for
splitting. The expression value.split("<br />") will create an array of the lines of
each sonnet. Index numbers and slices can then be used to populate new columns.
Keep in mind that Refine will not output an array directly to a cell. Be sure to
select one element from the array using an index number or convert it back to a
string with join().
Furthermore, the sonnet text contains a huge amount of unnecessary white space
that was used to layout the poems in the ebook. This can be cut from each line
using the trim() function. Trim automatically removes all leading and trailing white
space in a cell, an essential for data cleaning.
Using these concepts, a single line can be extracted and trimmed to create clean
columns representing the sonnet number and first line. Create two new columns
from the parse column using these names and expressions:
From the parse column, create a new column named “text”, and click in the
Expression box. A forEach() statement asks for an array, a variable name, and an
expression applied to the variable. Following the form forEach(array, variable,
expression), construct the loop using these parameters:
array: value.split("<br />"), creates an array from the lines of the sonnet in
each cell.
variable: line, each item in the array is then represented as the variable (it
could be anything, v is often used).
expression: line.trim(), each item is then evaluated separately with the
specified expression. In this case, trim() cleans the white space from each
sonnet line in the array.
At this point, the statement should look like forEach(value.split("<br />"), line,
line.trim()) in the Expression box. Notice that the Preview now shows an array
where the first element is the sonnet number. Since the results of the forEach() are
returned as a new array, additional array functions can be applied, such as slice and
join. Add slice(1) to remove the sonnet number, and join("\n") to concatenate the
lines in to a string value (\n is the symbol for new line in plain text). Thus, the final
expression to extract and clean the full sonnet text is:
Click “OK” to create the column. Following the same technique, add another new
column from parse named “last” to represent the final couplet lines using:
Finally, numeric columns can be added using the length() function. Create new
columns from text with the names and expressions below:
“characters”, value.length()
“lines”, value.split(/\n/).length()
Cleanup and Export
In this example, we used a number of operations to create new columns with clean
data. This is a typical Refine workflow, allowing each transformation to be easily
checked against the existing data. At this point the unnecessary columns can be
removed. Click on the All column > Edit columns > Re-order / remove columns.
Drag unwanted column names to the right side of the dialog box, in this case
Column 1, fetch, and parse. Drag the remaining columns into the desired order on
the left side. Click Ok to remove and reorder the data set.
Use filters and facets to explore and subset the collection of sonnets. Then click the
export button to generate a version of the new sonnet table for use outside of
Refine. Only the currently selected subset will be exported.
Conclusion : So finally we have clean up data with necessary information from
sonnets poem and poem ‘s information is detailed given with subdivision from
what text, numbers etc. instead of checking for huge book . Processing a book of
poems into structured data enables new ways of reading text, allowing us to sort,
manipulate, and connect with other information has been done .