0% found this document useful (0 votes)
38 views3 pages

Manipulating HTML Using Nokogiri

This document discusses how to extract HTML links from a document using Nokogiri. It shows how to find all links, links within a specific element, a link associated with text, and get link text. XPath and CSS selector syntax examples are provided to extract href attributes, link elements, and text. Useful Nokogiri and HTML parsing references are also included.

Uploaded by

rdpoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views3 pages

Manipulating HTML Using Nokogiri

This document discusses how to extract HTML links from a document using Nokogiri. It shows how to find all links, links within a specific element, a link associated with text, and get link text. XPath and CSS selector syntax examples are provided to extract href attributes, link elements, and text. Useful Nokogiri and HTML parsing references are also included.

Uploaded by

rdpoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

# Extracting HTML links using Nokogiri

Here are some common operations you might do when parsing links in HTTP, shown both in
`css` and `xpath` syntax.

Starting with with this snippet:

require 'rubygems'
require 'nokogiri'

html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML

doc = Nokogiri::HTML(html)

## extracting all the links


We can use xpath or css to nd all the `<a>` elements and then keep only the ones that have
an `href` attribute:

nodeset = doc.xpath('//a') # Get all anchors via xpath


nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

nodeset = doc.css('a') # Get all anchors via css


nodeset.map {|element| element["href"]}.compact # => ["http://google.com",
"http://stackoverflow.com"]

In the above cases, the `.compact` is necessary because the search for the `<a>` element
returns the "just a bookmark" element in addition to the others.

But we can use a more re ned search to nd just the elements that contain an `href`
attribute:

attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath


attrs.map {|attr| attr.value} # => ["http://google.com",
"http://stackoverflow.com"]
fi
fi
fi
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com",
"http://stackoverflow.com"]

## nding a speci c link


To nd a link within the `<div id="block2">`

nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"

nodeset = doc.css('div#block2 a[href]')


nodeset.first['href'] # => "http://stackoverflow.com"

If you know you're searching for just one link, you can use `at_xpath` or `at_css` instead:

attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"

element = doc.at_css('div#block2 a[href]')


element['href'] # => "http://stackoverflow.com"

## nd a link from associated text


What if you know the text associated with a link and want to nd its url? A little xpath-fu (or
css-fu) comes in handy:

element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"

element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"

## nd text from a link


For completeness, here's how you'd get the text associated with a particular link:

element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"

element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"

## useful references
fi
fi
fi
fi
fi
fi
In addition to the extensive [Nokorigi documentation][1], I came across some useful links
while writing this up:

* [a handy Nokogiri cheat sheet][2]


* [a tutorial on parsing HTML with Nokogiri][3]
* [interactively test CSS selector queries][4]

[1]: http://nokogiri.org/
[2]: https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet
[3]: http://ruby.bastardsbook.com/chapters/html-parsing/
[4]: http://try.jsoup.org/

You might also like