Miscellaneous HTML tools

The tools described here can be found in the → tdi.tools.html module. It contains node manipulation helpers and tools for encoding, decoding and minifying HTML.

Decoding HTML Text

Text in HTML documents is encoded in two ways: First of all, the document itself with all the markup is presented in a particular character encoding, for example, UTF-8 or Windows-1252 or EUC-KR.

Now, text characters not fitting into the document encoding’s character set or conflicting with the markup itself can be encoded using character references (e.g. € – the € character).

To make things more easy (from the user’s point of view) and more complicated (from the implementor’s point of view) at the same time, certain characters can be referenced using named character entities (e.g. € – the € character again, or & – the & character, which obviously conflicts with the markup and always needs to be referenced like this).

TDI encodes unicode input using only the first two options (except for the basic named references <, >, & and ". However, in order to manipulate existing HTML text, it needs to understand all three when decoding it.

Named Character Entities

TDI ships with the mapping of named character entities: → tdi.tools.html.entities. The mapping is directly generated from the HTML5 specification. It’s a superset of the entities defined in HTML 4 / XHTML 1.0 and below, so it can be applied safely for such documents as well.

Python’s standard library provides a similar mapping starting with version 3.3.

The mapping is used by default for the decode function described below.

decode

→ tdi.tools.html.decode takes some HTML encoded text and returns the equivalent unicode string. The input can be unicode itself or a byte string. In this case, the encoding should be passed, too.

Here’s a simple example:

from tdi.tools import html

result = html.decode('André Malo, \x80', 'cp1252')
print result.encode('unicode_escape')
print result.encode('utf-8')

... that produces the following output (if the output medium accepts UTF-8):

Andr\xe9 Malo, \u20ac
André Malo, €

The decode function is used by TDI itself whenever it needs to interprete HTML text (usually attributes), for example when examining the parser events for tdi-attributes or with the class-functions described next.

class_add, class_del

→ tdi.tools.html.class_add and → tdi.tools.html.class_del are simple functions to modify the class attribute of a node. This has been proven to be such a common task - and ugly to implement, that TDI ships with these tiny helpers. Both functions take a node and a variable list of class names; either to add or to remove:

from tdi import html as html_template
from tdi.tools import html

tpl = html_template.from_string("""
<div tdi="div1" class="open">Container 1</div>
<div tdi="div2">Container 2</div>
""".lstrip())

class Model(object):
    def render_div1(self, node):
        html.class_del(node, u'open')

    def render_div2(self, node):
        html.class_add(node, u'open', u'highlight')

tpl.render(Model())

This examples removes the “open” class from div1 and adds the “open” and “highlight” classes to div2:

<div>Container 1</div>
<div class="open highlight">Container 2</div>

Formatted Multiline Content

Formatted multiline content is text with indentations and newlines, typically displayed using a monospaced font. There’s already an HTML element for displaying such content: pre. However, pre may not work very well if the content should be part of a paragraph or is allowed to line-wrap.

The → tdi.tools.html.multiline function encodes such text to regular HTML, using <br> (or <br />) for line breaks and combinations of &nbsp; and spaces for indentations. It also expands tab characters. The input (unicode) will also be escaped and character encoded. The result can be assigned directly as raw content.

Simple code example:

from tdi.tools import html

print html.multiline(u"""
H\xe9llo World!

In December 2012 the world will go down.
\tBye world<!>
""".lstrip(), xhtml=False)

Output:

H&#233;llo World!<br>&nbsp;<br>In December 2012 the world will go down.<br>&nbsp; &nbsp; &nbsp; &nbsp; Bye world&lt;!&gt;

Minifying HTML

Minifying reduces the size of a document by removing redundant or irrelevant content. Typically this includes whitespace and comments. Minifying HTML is hard, though, because it’s sometimes white space sensitive, sometimes not, browsers are buggy, specifications changed over time, and so on.

TDI ships with a HTML minifier, which carefully removes spaces and comments. Additionally, in order to improve the ratio of a possible transport compression (like gzip), it sorts attributes alphabetically.

There are two use cases here:

  1. Minify the HTML templates during the loading phase
  2. Minify some standalone HTML (maybe from a CMS)

The first case is handled by hooking the → tdi.tools.html.MinifyFilter into the template loader. See the filters documentation for a description how to do that.

For the second case there’s the → tdi.tools.html.minify function.

The MinifyFilter only minifies HTML content. The minify function also minifies enclosed style and script blocks (by adding more filters from the javascript and css tools). So the following code:

from tdi.tools import html

print html.minify(u"""
<html>
<head>
    <!-- Here comes the title -->
    <title>Hello World!</title>
    <style>
        Some style.
    </style>
</head>
<body>
    <script>
        Some script.
    </script>
    <h1>Hello World!</h1>
</body>
""".lstrip())

emits:

<html><head><title>Hello World!</title><style>Some style.</style></head><body><script>Some script.</script><h1>Hello World!</h1></body>

Controlling Comment Stripping

By default both → MinifyFilter and → minify strip all comments. Sometimes it’s inevitable to keep certain comments (for example for easier debugging or marking stuff for other tools). TDI‘s minifiers accept a comment_filter parameter – a function, which decides whether a comment is stripped or passed through. The function can also modify the comment before passing it through. Here’s an example:

from tdi.tools import html

def keep_foo(comment):
    if comment == "<!-- foo -->":
        return comment

print html.minify(u"""
<html>
<head>
    <!-- Here comes the title -->
    <title>Hello World!</title>
    <style>
        Some style.
    </style>
</head>
<body>
    <!-- foo -->
    <script>
        Some script.
    </script>
    <h1>Hello World!</h1>
    <!-- bar -->
</body>
""".lstrip(), comment_filter=keep_foo)
<html><head><title>Hello World!</title><style>Some style.</style></head><body><!-- foo --><script>Some script.</script><h1>Hello World!</h1></body>

If you have a factory filter setup, here’s how you pass your comment filter:

  1. Write the filter function
  2. Write a new filter factory function
  3. Pass the filter factory function to the template factory instead of tdi.tools.html.MinifyFilter

Sounds complicated. It’s not:

from tdi import html
from tdi.tools import html as html_tools

# 1. Comment filter
def keep_foo(comment):
    """ Comment filter """
    if comment == "<!-- foo -->":
        return comment

# 2. New filter factory
def html_minifyfilter(builder):
    """ HTML minifier factory """
    return html_tools.MinifyFilter(builder, comment_filter=keep_foo)

# 3. Template Factory
html = html.replace(eventfilters=[
    # ...
    html_minifyfilter, # instead of html_tools.MinifyFilter
    # ...
])

# 4. Do your thing.
tpl = html.from_string("""
<html>
<head>
    <!-- Here comes the title -->
    <title>Hello World!</title>
    <style>
        Some style.
    </style>
</head>
<body>
    <!-- foo -->
    <script>
        Some script.
    </script>
    <h1>Hello World!</h1>
    <!-- bar -->
</body>
""".lstrip())
tpl.render()