admin
doz
fachprojekt
html2pdf_v4.03
htmlpurifier-4.10.0
art
benchmarks
configdoc
docs
dtd
entities
examples
specimens
dev-advanced-api.html
dev-code-quality.txt
dev-config-bcbreaks.txt
dev-config-naming.txt
dev-config-schema.html
dev-flush.html
dev-includes.txt
dev-naming.html
dev-optimization.html
dev-progress.html
enduser-customize.html
enduser-id.html
enduser-overview.txt
enduser-security.txt
enduser-slow.html
enduser-tidy.html
enduser-uri-filter.html
enduser-utf8.html
enduser-youtube.html
fixquotes.htc
index.html
proposal-colors.html
proposal-config.txt
proposal-css-extraction.txt
proposal-errors.txt
proposal-filter-levels.txt
proposal-language.txt
proposal-new-directives.txt
proposal-plists.txt
ref-content-models.txt
ref-css-length.txt
ref-devnetwork.html
ref-html-modularization.txt
ref-proprietary-tags.txt
ref-whatwg.txt
style.css
extras
library
maintenance
plugins
smoketests
tests
.gitattributes
.gitignore
.travis.yml
CREDITS
Doxyfile
INSTALL
INSTALL.fr.utf8
LICENSE
NEWS
README.md
TODO
VERSION
WHATSNEW
WYSIWYG
composer.json
phpdoc.ini
images
prints
prints3
Kennwortwechsel.php
hauptframe.php
index.php
index_alt.php
index_db.php
index_frame.htm
index_ldap.php
login.php
logout.php
menuframe.htm
styles_pc.css
topframe.php
60 lines
3.2 KiB
Plaintext
Executable File
60 lines
3.2 KiB
Plaintext
Executable File
|
|
HTML Purifier
|
|
by Edward Z. Yang
|
|
|
|
There are a number of ad hoc HTML filtering solutions out there on the web
|
|
(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
|
|
claim to filter HTML properly, preventing malicious JavaScript and layout
|
|
breaking HTML from getting through the parser. None of them, however,
|
|
demonstrates a thorough knowledge of neither the DTD that defines the HTML
|
|
nor the caveats of HTML that cannot be expressed by a DTD. Configurable
|
|
filters (such as kses or PHP's built-in striptags() function) have trouble
|
|
validating the contents of attributes and can be subject to security attacks
|
|
due to poor configuration. Other filters take the naive approach of
|
|
blacklisting known threats and tags, failing to account for the introduction
|
|
of new technologies, new tags, new attributes or quirky browser behavior.
|
|
|
|
However, HTML Purifier takes a different approach, one that doesn't use
|
|
specification-ignorant regexes or narrow blacklists. HTML Purifier will
|
|
decompose the whole document into tokens, and rigorously process the tokens by:
|
|
removing non-whitelisted elements, transforming bad practice tags like <font>
|
|
into <span>, properly checking the nesting of tags and their children and
|
|
validating all attributes according to their RFCs.
|
|
|
|
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
|
|
which allows an amazingly diverse mix of HTML and wikitext in its documents,
|
|
gets all the nesting quirks right. Existing solutions hope that no JavaScript
|
|
will slip through, but either do not attempt to ensure that the resulting
|
|
output is valid XHTML or send the HTML through a draconic XML parser (and yet
|
|
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
|
|
tags from being nested within each other).
|
|
|
|
This document no longer is a detailed description of how HTMLPurifier works,
|
|
as those descriptions have been moved to the appropriate code. The first
|
|
draft was drawn up after two rough code sketches and the implementation of a
|
|
forgiving lexer. You may also be interested in the unit tests located in the
|
|
tests/ folder, which provide a living document on how exactly the filter deals
|
|
with malformed input.
|
|
|
|
In summary (see corresponding classes for more details):
|
|
|
|
1. Parse document into an array of tag and text tokens (Lexer)
|
|
2. Remove all elements not on whitelist and transform certain other elements
|
|
into acceptable forms (i.e. <font>)
|
|
3. Make document well formed while helpfully taking into account certain quirks,
|
|
such as the fact that <p> tags traditionally are closed by other block-level
|
|
elements.
|
|
4. Run through all nodes and check children for proper order (especially
|
|
important for tables).
|
|
5. Validate attributes according to more restrictive definitions based on the
|
|
RFCs.
|
|
6. Translate back into a string. (Generator)
|
|
|
|
HTML Purifier is best suited for documents that require a rich array of
|
|
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
|
written in an extremely restrictive set of markup that doesn't require
|
|
all this functionality (or not written in HTML at all), although this may
|
|
be changing in the future with the addition of levels of filtering.
|
|
|
|
vim: et sw=4 sts=4
|