-
Notifications
You must be signed in to change notification settings - Fork 129
XML and Namespaces
You can add namespaces to properties. They will be used to query the document along with the selector. This can be useful to parse XML files. Yes, wombat can also scrape XML files. Yay! The syntax is:
class LastFmScraper
include Wombat::Crawler
base_url "http://ws.audioscrobbler.com"
path "/2.0/?method=geo.getevents&location=San%20Francisco&api_key=<YOUR_LASTFM_API_KEY>"
document_format :xml
locations 'xpath=//event', :iterator do
latitude "xpath=./venue/location/geo:point/geo:lat", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
longitude "xpath=./venue/location/geo:point/geo:long", :text, { 'geo' => 'http://www.w3.org/2003/01/geo/wgs84_pos' }
end
end
Note that we used above the option document_format :xml
. This is another special property that tells the type of document we are supposed to parse. It defaults to :html
, so usually you won't need to specify this. If you want to parse a xml, you can say that by format :xml
. The only 2 formats supported so far are html and xml.
If you are going to specify a namespace, you have to also say the type of property you are requesting (:text
, :html
or :list
) as the second argument, before the namespace and after the selector. The namespace must be a hash with keys being the namespace name, and values being the namespace url.