Package tdi :: Package markup :: Package soup :: Module parser :: Class SoupParser
[frames] | no frames]

Class SoupParser

source code

object --+
         |
        SoupParser

Overview

The parser is actually a tagsoup parser by design in order to process most of the "HTML" that can be found out there. Of course, if the HTML is well-formed and valid, this would be the best. There is only as much HTML syntax applied as necessary to parse it. You can influence these syntax definitions by picking another lexer. You can change the semantics by picking another dtd query class.

This parser guarantees, that for each not-self-closing starttag event also an endtag event is generated (if the endtag is not actually there, the data parameter is an empty string). This also happens for empty tags (like br). On the other hand, there may be more endtag events than starttag events, because of unbalanced or wrongly nested tags.

Special constructs, which are comments, PIs, marked sections and declarations may occur anywhere, i.e. they are not closing elements implicitly.

The default lexer does not deal with NET tags (<h1/Heading/). Neither does it handle unfinished starttags by SGML rules like <map<area>. It does know about empty tags (<> and </>).

CDATA elements and comments are handled in a simplified way. Once the particular state is entered, it's only left, when the accompanying end marker was found (<script>...</script>, <!-- ... -->). Anything in between is text.

How is it used?

The parser API is "streamy" on the input side and event based on the output side. So, what you need first is a building listener, which will receive all generated parser events and process them. Such is listener object is expected to implement the BuildingListenerInterface.

Now you create a SoupParser instance and pass the listener object to the contructor and the parser is ready to be fed. You can feed as many chunks of input data you like into the parser by using the feed method. Every feed call may generate mutiple events on the output side. When you're done feeding, call the parser's finalize method in order to clean up. This also flushes pending events to the listener.

Instance Methods
 
__init__(self, listener, dtd, lexer=None)
Initialization
source code
 
handle_text(self, data) source code
 
handle_starttag(self, name, attrs, closed, data) source code
 
handle_endtag(self, name, data) source code
 
handle_comment(self, data) source code
 
handle_msection(self, name, value, data) source code
 
handle_decl(self, name, value, data) source code
 
handle_pi(self, data) source code
 
handle_escape(self, escaped, data) source code
 
feed(self, food) source code
 
finalize(self) source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Methods
SoupParser
html(cls, listener)
Construct a parser using the HTMLDTD
source code
SoupParser
xml(cls, listener)
Construct a parser using the XMLDTD
source code
Class Variables
  __implements__ = [<class 'tdi.interfaces.ListenerInterface'>, ...
Instance Variables
SoupLexer lexer
The lexer instance
BuildingListenerInterface listener
The building listener to send the events to
Properties

Inherited from object: __class__

Method Details

__init__(self, listener, dtd, lexer=None)
(Constructor)

source code 
Initialization
Parameters:
  • listener (ListenerInterface) - The building listener
  • dtd (DTDInterface) - DTD query object
  • lexer (callable) - Lexer class/factory. This mus be a callable taking an event listener and returning a lexer instance. If omitted or None, the default lexer will be used (DEFAULT_LEXER).
Overrides: object.__init__

html(cls, listener)
Class Method

source code 
Construct a parser using the HTMLDTD
Parameters:
Returns: SoupParser
The new parser instance

xml(cls, listener)
Class Method

source code 
Construct a parser using the XMLDTD
Parameters:
Returns: SoupParser
The new parser instance

handle_text(self, data)

source code 

See Also: ListenerInterface

handle_starttag(self, name, attrs, closed, data)

source code 

See Also: ListenerInterface

handle_endtag(self, name, data)

source code 

See Also: ListenerInterface

handle_comment(self, data)

source code 

See Also: ListenerInterface

handle_msection(self, name, value, data)

source code 

See Also: ListenerInterface

handle_decl(self, name, value, data)

source code 

See Also: ListenerInterface

handle_pi(self, data)

source code 

See Also: ListenerInterface

handle_escape(self, escaped, data)

source code 

See Also: ListenerInterface

feed(self, food)

source code 

See Also: ParserInterface

finalize(self)

source code 
Raises:

See Also: ParserInterface


Class Variable Details

__implements__

Value:
[<class 'tdi.interfaces.ListenerInterface'>,
 <class 'tdi.interfaces.ParserInterface'>]

Instance Variable Details

lexer

The lexer instance
Type:
SoupLexer

listener

The building listener to send the events to
Type:
BuildingListenerInterface