On HTMLParse docs.
You will see the same example but with no explanation. The example is :
I named the python file : spiderweb.py
import HTMLParser from HTMLParser import * import urllib2 from urllib2 import urlopen class webancors(HTMLParser): def __init__(self, url): HTMLParser.__init__(self) r = urlopen(url) self.feed(r.read()) def handle_starttag(self, tag, attrs): if tag == 'a' and attrs: print "Link: %s" % attrs
I use python to import this file:
The method handle_starttag takes two arguments from HTMLParser.
>>> import spiderweb >>> spiderweb.webancors('http://www.yahoo.com') Link: y-mast-sprite y-mast-txt web Link: y-mast-link images Link: y-mast-link video Link: y-mast-link local Link: y-mast-link shopping Link: y-mast-link more Link: p_13838465-sa-drawer Link: y-hdr-link
This arguments, tag and attrs is used to return values.
The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.
Use "http://" not just "www". If don't use "http://" you see errors.
Seam urllib2 have some troubles with:
File "/usr/lib/python2.5/urllib2.py", line 241, in get_type raise ValueError, "unknown url type: %s" % self.__original
You can use all functions HTTParser class.