Overview
The parser is actually a tagsoup parser by design in order to process
most of the "HTML" that can be found out there. Of course, if the HTML
is well-formed and valid, this would be the best. There is only as
much HTML syntax applied as necessary to parse it. You can influence
these syntax definitions by picking another lexer. You can change
the semantics by picking another dtd query class.
This parser guarantees, that for each not-self-closing starttag event also
an endtag event is generated (if the endtag is not actually there, the
data parameter is an empty string). This also happens for empty tags (like
br). On the other hand, there may be more endtag events than starttag
events, because of unbalanced or wrongly nested tags.
Special constructs, which are comments, PIs, marked sections and
declarations may occur anywhere, i.e. they are not closing elements
implicitly.
The default lexer does not deal with NET tags (<h1/Heading/). Neither
does it handle unfinished starttags by SGML rules like <map<area>.
It does know about empty tags (<> and </>).
CDATA elements and comments are handled in a simplified way. Once
the particular state is entered, it's only left, when the accompanying
end marker was found (<script>...</script>, <!-- ... -->).
Anything in between is text.
How is it used?
The parser API is "streamy" on the input side and event based on the
output side. So, what you need first is a building listener, which will
receive all generated parser events and process them. Such is listener
object is expected to implement the BuildingListenerInterface.
Now you create a SoupParser instance and pass the listener object to
the contructor and the parser is ready to be fed. You can feed as many
chunks of input data you like into the parser by using the feed
method. Every feed call may generate mutiple events on the output side.
When you're done feeding, call the parser's finalize method in order
to clean up. This also flushes pending events to the listener.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Inherited from object :
__delattr__ ,
__format__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__sizeof__ ,
__str__ ,
__subclasshook__
|