first commit
This commit is contained in:
26
htmlpurifier-4.10.0/docs/dev-advanced-api.html
Executable file
26
htmlpurifier-4.10.0/docs/dev-advanced-api.html
Executable file
@ -0,0 +1,26 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Specification for HTML Purifier's advanced API for defining custom filtering behavior." />
|
||||
<link rel="stylesheet" type="text/css" href="style.css" />
|
||||
|
||||
<title>Advanced API - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>Advanced API</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
Please see <a href="enduser-customize.html">Customize!</a>
|
||||
</p>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
30
htmlpurifier-4.10.0/docs/dev-code-quality.txt
Executable file
30
htmlpurifier-4.10.0/docs/dev-code-quality.txt
Executable file
@ -0,0 +1,30 @@
|
||||
|
||||
Code Quality Issues
|
||||
|
||||
Okay, face it. Programmers can get lazy, cut corners, or make mistakes. They
|
||||
also can do quick prototypes, and then forget to rewrite them later. Well,
|
||||
while I can't list mistakes in here, I can list prototype-like segments
|
||||
of code that should be aggressively refactored. This does not list
|
||||
optimization issues, that needs to be done after intense profiling.
|
||||
|
||||
docs/examples/demo.php - ad hoc HTML/PHP soup to the extreme
|
||||
|
||||
AttrDef - a lot of duplication, more generic classes need to be created;
|
||||
a lot of strtolower() calls, no legit casing
|
||||
Class - doesn't support Unicode characters (fringe); uses regular expressions
|
||||
Lang - code duplication; premature optimization
|
||||
Length - easily mistaken for CSSLength
|
||||
URI - multiple regular expressions; missing validation for parts (?)
|
||||
CSS - parser doesn't accept advanced CSS (fringe)
|
||||
Number - constructor interface inconsistent with Integer
|
||||
Strategy
|
||||
FixNesting - cannot bubble nodes out of structures, duplicated checks
|
||||
for special-case parent node
|
||||
RemoveForeignElements - should be run in parallel with MakeWellFormed
|
||||
URIScheme - needs to have callable generic checks
|
||||
mailto - doesn't validate emails, doesn't validate querystring
|
||||
news - doesn't validate opaque path
|
||||
nntp - doesn't constrain path
|
||||
tel - doesn't validate phone numbers, only allows characters '+', '1-9', and 'x'
|
||||
|
||||
vim: et sw=4 sts=4
|
79
htmlpurifier-4.10.0/docs/dev-config-bcbreaks.txt
Executable file
79
htmlpurifier-4.10.0/docs/dev-config-bcbreaks.txt
Executable file
@ -0,0 +1,79 @@
|
||||
|
||||
Configuration Backwards-Compatibility Breaks
|
||||
|
||||
In version 4.0.0, the configuration subsystem (composed of the outwards
|
||||
facing Config class, as well as the ConfigSchema and ConfigSchema_Interchange
|
||||
subsystems), was significantly revamped to make use of property lists.
|
||||
While most of the changes are internal, some internal APIs were changed for the
|
||||
sake of clarity. HTMLPurifier_Config was kept completely backwards compatible,
|
||||
although some of the functions were retrofitted with an unambiguous alternate
|
||||
syntax. Both of these changes are discussed in this document.
|
||||
|
||||
|
||||
|
||||
1. Outwards Facing Changes
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The HTMLPurifier_Config class now takes an alternate syntax. The general rule
|
||||
is:
|
||||
|
||||
If you passed $namespace, $directive, pass "$namespace.$directive"
|
||||
instead.
|
||||
|
||||
An example:
|
||||
|
||||
$config->set('HTML', 'Allowed', 'p');
|
||||
|
||||
becomes:
|
||||
|
||||
$config->set('HTML.Allowed', 'p');
|
||||
|
||||
New configuration options may have more than one namespace, they might
|
||||
look something like %Filter.YouTube.Blacklist. While you could technically
|
||||
set it with ('HTML', 'YouTube.Blacklist'), the logical extension
|
||||
('HTML', 'YouTube', 'Blacklist') does not work.
|
||||
|
||||
The old API will still work, but will emit E_USER_NOTICEs.
|
||||
|
||||
|
||||
|
||||
2. Internal API Changes
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Some overarching notes: we've completely eliminated the notion of namespace;
|
||||
it's now an informal construct for organizing related configuration directives.
|
||||
|
||||
Also, the validation routines for keys (formerly "$namespace.$directive")
|
||||
have been completely relaxed. I don't think it really should be necessary.
|
||||
|
||||
2.1 HTMLPurifier_ConfigSchema
|
||||
|
||||
First off, if you're interfacing with this class, you really shouldn't.
|
||||
HTMLPurifier_ConfigSchema_Builder_ConfigSchema is really the only class that
|
||||
should ever be creating HTMLPurifier_ConfigSchema, and HTMLPurifier_Config the
|
||||
only class that should be reading it.
|
||||
|
||||
All namespace related methods were removed; they are completely unnecessary
|
||||
now. Any $namespace, $name arguments must be replaced with $key (where
|
||||
$key == "$namespace.$name"), including for addAlias().
|
||||
|
||||
The $info and $defaults member variables are no longer indexed as
|
||||
[$namespace][$name]; they are now indexed as ["$namespace.$name"].
|
||||
|
||||
All deprecated methods were finally removed, after having yelled at you as
|
||||
an E_USER_NOTICE for a while now.
|
||||
|
||||
2.2 HTMLPurifier_ConfigSchema_Interchange
|
||||
|
||||
Member variable $namespaces was removed.
|
||||
|
||||
2.3 HTMLPurifier_ConfigSchema_Interchange_Id
|
||||
|
||||
Member variable $namespace and $directive removed; member variable $key added.
|
||||
Any method that took $namespace, $directive now takes $key.
|
||||
|
||||
2.4 HTMLPurifier_ConfigSchema_Interchange_Namespace
|
||||
|
||||
Removed.
|
||||
|
||||
vim: et sw=4 sts=4
|
164
htmlpurifier-4.10.0/docs/dev-config-naming.txt
Executable file
164
htmlpurifier-4.10.0/docs/dev-config-naming.txt
Executable file
@ -0,0 +1,164 @@
|
||||
Configuration naming
|
||||
|
||||
HTML Purifier 4.0.0 features a new configuration naming system that
|
||||
allows arbitrary nesting of namespaces. While there are certain cases
|
||||
in which using two namespaces is obviously better (the canonical example
|
||||
is where we were using AutoFormatParam to contain directives for AutoFormat
|
||||
parameters), it is unclear whether or not a general migration to highly
|
||||
namespaced directives is a good idea or not.
|
||||
|
||||
== Case studies ==
|
||||
|
||||
=== Attr.* ===
|
||||
|
||||
We have a dead duck HTML.Attr.Name.UseCDATA which migrated before we decided
|
||||
to think this out thoroughly.
|
||||
|
||||
We currently have a large number of directives in the Attr.* namespace.
|
||||
These directives tweak the behavior of some HTML attributes. They have
|
||||
the properties:
|
||||
|
||||
* While they apply to only one attribute at a time, the attribute can
|
||||
span over multiple elements (not necessarily all attributes, either).
|
||||
The information of which elements it impacts is either omitted or
|
||||
informally stated (EnableID applies to all elements, DefaultImageAlt
|
||||
applies to <img> tags, AllowedRev doesn't say but only applies to a tags).
|
||||
|
||||
* There is a certain degree of clustering that could be applied, especially
|
||||
to the ID directives. The clustering could be done with respect to
|
||||
what element/attribute was used, i.e.
|
||||
|
||||
*.id -> EnableID, IDBlacklistRegexp, IDBlacklist, IDPrefixLocal, IDPrefix
|
||||
img.src -> DefaultInvalidImage
|
||||
img.alt -> DefaultImageAlt, DefaultInvalidImageAlt
|
||||
bdo.dir -> DefaultTextDir
|
||||
a.rel -> AllowedRel
|
||||
a.rev -> AllowedRev
|
||||
a.target -> AllowedFrameTargets
|
||||
a.name -> Name.UseCDATA
|
||||
|
||||
* The directives often reference generic attribute types that were specified
|
||||
in the DTD/specification. However, some of the behavior specifically relies
|
||||
on the fact that other use cases of the attribute are not, at current,
|
||||
supported by HTML Purifier.
|
||||
|
||||
AllowedRel, AllowedRev -> heavily <a> specific; if <link> ends up being
|
||||
allowed, we will also have to give users specificity there (we also
|
||||
want to preserve generality) DTD %Linktypes, HTML5 distinguishes
|
||||
between <link> and <a>/<area>
|
||||
AllowedFrameTargets -> heavily <a> specific, but also used by <area>
|
||||
and <form>. Transitional DTD %FrameTarget, not present in strict,
|
||||
HTML5 calls them "browsing contexts"
|
||||
Default*Image* -> as a default parameter, is almost entirely exlcusive
|
||||
to <img>
|
||||
EnableID -> global attribute
|
||||
Name.UseCDATA -> heavily <a> specific, but has heavy other usage by
|
||||
many things
|
||||
|
||||
== AutoFormat.* ==
|
||||
|
||||
These have the fairly normal pluggable architecture that lends itself to
|
||||
large amounts of namespaces (pluggability may be the key to figuring
|
||||
out when gratuitous namespacing is good.) Properties:
|
||||
|
||||
* Boolean directives are fair game for being namespaced: for example,
|
||||
RemoveEmpty.RemoveNbsp triggers RemoveEmpty.RemoveNbsp.Exceptions,
|
||||
the latter of which only makes sense when RemoveEmpty.RemoveNbsp
|
||||
is set to true. (The same applies to RemoveNbsp too)
|
||||
|
||||
The AutoFormat string is a bit long, but is the only bit of repeated
|
||||
context.
|
||||
|
||||
== Core.* ==
|
||||
|
||||
Core is the potpourri of directives, mostly regarding some minor behavioral
|
||||
tweaks for HTML handling abilities.
|
||||
|
||||
AggressivelyFixLt
|
||||
ConvertDocumentToFragment
|
||||
DirectLexLineNumberSyncInterval
|
||||
LexerImpl
|
||||
MaintainLineNumbers
|
||||
Lexer
|
||||
CollectErrors
|
||||
Language
|
||||
Error handling (Language is ostensibly a little more general, but
|
||||
it's only used for error handling right now)
|
||||
ColorKeywords
|
||||
CSS and HTML
|
||||
Encoding
|
||||
EscapeNonASCIICharacters
|
||||
Character encoding
|
||||
EscapeInvalidChildren
|
||||
EscapeInvalidTags
|
||||
HiddenElements
|
||||
RemoveInvalidImg
|
||||
Lexing/Output
|
||||
RemoveScriptContents
|
||||
Deprecated
|
||||
|
||||
== HTML.* ==
|
||||
|
||||
AllowedAttributes
|
||||
AllowedElements
|
||||
AllowedModules
|
||||
Allowed
|
||||
ForbiddenAttributes
|
||||
ForbiddenElements
|
||||
Element set tuning
|
||||
BlockWrapper
|
||||
Child def advanced twiddle
|
||||
CoreModules
|
||||
CustomDoctype
|
||||
Advanced HTMLModuleManager twiddles
|
||||
DefinitionID
|
||||
DefinitionRev
|
||||
Caching
|
||||
Doctype
|
||||
Parent
|
||||
Strict
|
||||
XHTML
|
||||
Global environment
|
||||
MaxImgLength
|
||||
Attribute twiddle? (applies to two attributes)
|
||||
Proprietary
|
||||
SafeEmbed
|
||||
SafeObject
|
||||
Trusted
|
||||
Extra functionality/tagsets
|
||||
TidyAdd
|
||||
TidyLevel
|
||||
TidyRemove
|
||||
Tidy
|
||||
|
||||
== Output.* ==
|
||||
|
||||
These directly affect the output of Generator. These are all advanced
|
||||
twiddles.
|
||||
|
||||
== URI.* ==
|
||||
|
||||
AllowedSchemes
|
||||
OverrideAllowedSchemes
|
||||
Scheme tuning
|
||||
Base
|
||||
DefaultScheme
|
||||
Host
|
||||
Global environment
|
||||
DefinitionID
|
||||
DefinitionRev
|
||||
Caching
|
||||
DisableExternalResources
|
||||
DisableExternal
|
||||
DisableResources
|
||||
Disable
|
||||
Contextual/authority tuning
|
||||
HostBlacklist
|
||||
Authority tuning
|
||||
MakeAbsolute
|
||||
MungeResources
|
||||
MungeSecretKey
|
||||
Munge
|
||||
Transformation behavior (munge can be grouped)
|
||||
|
||||
|
412
htmlpurifier-4.10.0/docs/dev-config-schema.html
Executable file
412
htmlpurifier-4.10.0/docs/dev-config-schema.html
Executable file
@ -0,0 +1,412 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Describes config schema framework in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
<title>Config Schema - HTML Purifier</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Config Schema</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
HTML Purifier has a fairly complex system for configuration. Users
|
||||
interact with a <code>HTMLPurifier_Config</code> object to
|
||||
set configuration directives. The values they set are validated according
|
||||
to a configuration schema, <code>HTMLPurifier_ConfigSchema</code>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The schema is mostly transparent to end-users, but if you're doing development
|
||||
work for HTML Purifier and need to define a new configuration directive,
|
||||
you'll need to interact with it. We'll also talk about how to define
|
||||
userspace configuration directives at the very end.
|
||||
</p>
|
||||
|
||||
<h2>Write a directive file</h2>
|
||||
|
||||
<p>
|
||||
Directive files define configuration directives to be used by
|
||||
HTML Purifier. They are placed in <code>library/HTMLPurifier/ConfigSchema/schema/</code>
|
||||
in the form <code><em>Namespace</em>.<em>Directive</em>.txt</code> (I
|
||||
couldn't think of a more descriptive file extension.)
|
||||
Directive files are actually what we call <code>StringHash</code>es,
|
||||
i.e. associative arrays represented in a string form reminiscent of
|
||||
<a href="http://qa.php.net/write-test.php">PHPT</a> tests. Here's a
|
||||
sample directive file, <code>Test.Sample.txt</code>:
|
||||
</p>
|
||||
|
||||
<pre>Test.Sample
|
||||
TYPE: string/null
|
||||
DEFAULT: NULL
|
||||
ALLOWED: 'foo', 'bar'
|
||||
VALUE-ALIASES: 'baz' => 'bar'
|
||||
VERSION: 3.1.0
|
||||
--DESCRIPTION--
|
||||
This is a sample configuration directive for the purposes of the
|
||||
<code>dev-config-schema.html<code> documentation.
|
||||
--ALIASES--
|
||||
Test.Example</pre>
|
||||
|
||||
<p>
|
||||
Each of these segments has a specific meaning:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Key</th>
|
||||
<th>Example</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>ID</td>
|
||||
<td>Test.Sample</td>
|
||||
<td>The name of the directive, in the form Namespace.Directive
|
||||
(implicitly the first line)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>TYPE</td>
|
||||
<td>string/null</td>
|
||||
<td>The type of variable this directive accepts. See below for
|
||||
details. You can also add <code>/null</code> to the end of
|
||||
any basic type to allow null values too.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DEFAULT</td>
|
||||
<td>NULL</td>
|
||||
<td>A parseable PHP expression of the default value.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DESCRIPTION</td>
|
||||
<td>This is a...</td>
|
||||
<td>An HTML description of what this directive does.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>VERSION</td>
|
||||
<td>3.1.0</td>
|
||||
<td><em>Recommended</em>. The version of HTML Purifier this directive was added.
|
||||
Directives that have been around since 1.0.0 don't have this,
|
||||
but any new ones should.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>ALIASES</td>
|
||||
<td>Test.Example</td>
|
||||
<td><em>Optional</em>. A comma separated list of aliases for this directive.
|
||||
This is most useful for backwards compatibility and should
|
||||
not be used otherwise.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>ALLOWED</td>
|
||||
<td>'foo', 'bar'</td>
|
||||
<td><em>Optional</em>. Set of allowed value for a directive,
|
||||
a comma separated list of parseable PHP expressions. This
|
||||
is only allowed string, istring, text and itext TYPEs.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>VALUE-ALIASES</td>
|
||||
<td>'baz' => 'bar'</td>
|
||||
<td><em>Optional</em>. Mapping of one value to another, and
|
||||
should be a comma separated list of keypair duples. This
|
||||
is only allowed string, istring, text and itext TYPEs.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DEPRECATED-VERSION</td>
|
||||
<td>3.1.0</td>
|
||||
<td><em>Not shown</em>. Indicates that the directive was
|
||||
deprecated this version.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DEPRECATED-USE</td>
|
||||
<td>Test.NewDirective</td>
|
||||
<td><em>Not shown</em>. Indicates what new directive should be
|
||||
used instead. Note that the directives will functionally be
|
||||
different, although they should offer the same functionality.
|
||||
If they are identical, use an alias instead.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>EXTERNAL</td>
|
||||
<td>CSSTidy</td>
|
||||
<td><em>Not shown</em>. Indicates if there is an external library
|
||||
the user will need to download and install to use this configuration
|
||||
directive. As of right now, this is merely a Google-able name; future
|
||||
versions may also provide links and instructions.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
Some notes on format and style:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Each of these keys can be expressed in the short format
|
||||
(<code>KEY: Value</code>) or the long format
|
||||
(<code>--KEY--</code> with value beneath). You must use the
|
||||
long format if multiple lines are needed, or if a long format
|
||||
has been used already (that's why <code>ALIASES</code> in our
|
||||
example is in the long format); otherwise, it's user preference.
|
||||
</li>
|
||||
<li>
|
||||
The HTML descriptions should be wrapped at about 80 columns; do
|
||||
not rely on editor word-wrapping.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
Also, as promised, here is the set of possible types:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Type</th>
|
||||
<th>Example</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>string</td>
|
||||
<td>'Foo'</td>
|
||||
<td><a href="http://docs.php.net/manual/en/language.types.string.php">String</a> without newlines</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>istring</td>
|
||||
<td>'foo'</td>
|
||||
<td>Case insensitive ASCII string without newlines</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>text</td>
|
||||
<td>"A<em>\n</em>b"</td>
|
||||
<td>String with newlines</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>itext</td>
|
||||
<td>"a<em>\n</em>b"</td>
|
||||
<td>Case insensitive ASCII string without newlines</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>int</td>
|
||||
<td>23</td>
|
||||
<td>Integer</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>float</td>
|
||||
<td>3.0</td>
|
||||
<td>Floating point number</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>bool</td>
|
||||
<td>true</td>
|
||||
<td>Boolean</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>lookup</td>
|
||||
<td>array('key' => true)</td>
|
||||
<td>Lookup array, used with <code>isset($var[$key])</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>list</td>
|
||||
<td>array('f', 'b')</td>
|
||||
<td>List array, with ordered numerical indexes</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>hash</td>
|
||||
<td>array('key' => 'val')</td>
|
||||
<td>Associative array of keys to values</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>mixed</td>
|
||||
<td>new stdClass</td>
|
||||
<td>Any PHP variable is fine</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
The examples represent what will be returned out of the configuration
|
||||
object; users have a little bit of leeway when setting configuration
|
||||
values (for example, a lookup value can be specified as a list;
|
||||
HTML Purifier will flip it as necessary.) These types are defined
|
||||
in <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/VarParser.php">
|
||||
library/HTMLPurifier/VarParser.php</a>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For more information on what values are allowed, and how they are parsed,
|
||||
consult <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php">
|
||||
library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php</a>, as well
|
||||
as <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/Interchange/Directive.php">
|
||||
library/HTMLPurifier/ConfigSchema/Interchange/Directive.php</a> for
|
||||
the semantics of the parsed values.
|
||||
</p>
|
||||
|
||||
<h2>Refreshing the cache</h2>
|
||||
|
||||
<p>
|
||||
You may have noticed that your directive file isn't doing anything
|
||||
yet. That's because it hasn't been added to the runtime
|
||||
<code>HTMLPurifier_ConfigSchema</code> instance. Run
|
||||
<code>maintenance/generate-schema-cache.php</code> to fix this.
|
||||
If there were no errors, you're good to go! Don't forget to add
|
||||
some unit tests for your functionality!
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you ever make changes to your configuration directives, you
|
||||
will need to run this script again.
|
||||
</p>
|
||||
<h2>Adding in-house schema definitions</h2>
|
||||
|
||||
<p>
|
||||
Placing stuff directly in HTML Purifier's source tree is generally not a
|
||||
good idea, so HTML Purifier 4.0.0+ has some facilities in place to make your
|
||||
life easier.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The first is to pass an extra parameter to <code>maintenance/generate-schema-cache.php</code>
|
||||
with the location of your directory (relative or absolute path will do). For example,
|
||||
if I'm storing my custom definitions in <em>/var/htmlpurifier/myschema</em>, run:
|
||||
<code>php maintenance/generate-schema-cache.php /var/htmlpurifier/myschema</code>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Alternatively, you can create a small loader PHP file in the HTML Purifier base
|
||||
directory named <code>config-schema.php</code> (this is the same directory
|
||||
you would place a <code>test-settings.php</code> file). In this file, add
|
||||
the following line for each directory you want to load:
|
||||
</p>
|
||||
|
||||
<pre>$builder->buildDir($interchange, '/var/htmlpurifier/myschema');</pre>
|
||||
|
||||
<p>You can even load a single file using:</p>
|
||||
|
||||
<pre>$builder->buildFile($interchange, '/var/htmlpurifier/myschema/MyApp.Directive.txt');</pre>
|
||||
|
||||
<p>Storing custom definitions that you don't plan on sending back upstream in
|
||||
a separate directory is <em>definitely</em> a good idea! Additionally, picking
|
||||
a good namespace can go a long way to saving you grief if you want to use
|
||||
someone else's change, but they picked the same name, or if HTML Purifier
|
||||
decides to add support for a configuration directive that has the same name.</p>
|
||||
|
||||
<!-- TODO: how to name directives that rely on naming conventions -->
|
||||
|
||||
<h2>Errors</h2>
|
||||
|
||||
<p>
|
||||
All directive files go through a rigorous validation process
|
||||
through <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/Validator.php">
|
||||
library/HTMLPurifier/ConfigSchema/Validator.php</a>, as well
|
||||
as some basic checks during building. While
|
||||
listing every error out here is out-of-scope for this document, we
|
||||
can give some general tips for interpreting error messages.
|
||||
There are two types of errors: builder errors and validation errors.
|
||||
</p>
|
||||
|
||||
<h3>Builder errors</h3>
|
||||
|
||||
<blockquote>
|
||||
<p>
|
||||
<strong>Exception:</strong> Expected type string, got
|
||||
integer in DEFAULT in directive hash 'Ns.Dir'
|
||||
</p>
|
||||
</blockquote>
|
||||
|
||||
<p>
|
||||
You can identify a builder error by the keyword "directive hash."
|
||||
These are the easiest to deal with, because they directly correspond
|
||||
with your directive file. Find the offending directive file (which
|
||||
is the directive hash plus the .txt extension), find the
|
||||
offending index ("in DEFAULT" means the DEFAULT key) and fix the error.
|
||||
This particular error would occur if your default value is not the same
|
||||
type as TYPE.
|
||||
</p>
|
||||
|
||||
<h3>Validation errors</h3>
|
||||
|
||||
<blockquote>
|
||||
<p>
|
||||
<strong>Exception:</strong> Alias 3 in valueAliases in directive
|
||||
'Ns.Dir' must be a string
|
||||
</p>
|
||||
</blockquote>
|
||||
|
||||
<p>
|
||||
These are a little trickier, because we're not actually validating
|
||||
your directive file, or even the direct string hash representation.
|
||||
We're validating an Interchange object, and the error messages do
|
||||
not mention any string hash keys.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Nevertheless, it's not difficult to figure out what went wrong.
|
||||
Read the "context" statements in reverse:
|
||||
</p>
|
||||
|
||||
<dl>
|
||||
<dt>in directive 'Ns.Dir'</dt>
|
||||
<dd>This means we need to look at the directive file <code>Ns.Dir.txt</code></dd>
|
||||
<dt>in valueAliases</dt>
|
||||
<dd>There's no key actually called this, but there's one that's close:
|
||||
VALUE-ALIASES. Indeed, that's where to look.</dd>
|
||||
<dt>Alias 3</dt>
|
||||
<dd>The value alias that is equal to 3 is the culprit.</dd>
|
||||
</dl>
|
||||
|
||||
<p>
|
||||
In this particular case, you're not allowed to alias integers values to
|
||||
strings values.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The most difficult part is translating the Interchange member variable (valueAliases)
|
||||
into a directive file key (VALUE-ALIASES), but there's a one-to-one
|
||||
correspondence currently. If the two formats diverge, any discrepancies
|
||||
will be described in <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php">
|
||||
library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php</a>.
|
||||
</p>
|
||||
|
||||
<h2>Internals</h2>
|
||||
|
||||
<p>
|
||||
Much of the configuration schema framework's codebase deals with
|
||||
shuffling data from one format to another, and doing validation on this
|
||||
data.
|
||||
The keystone of all of this is the <code>HTMLPurifier_ConfigSchema_Interchange</code>
|
||||
class, which represents the purest, parsed representation of the schema.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Hand-writing this data is unwieldy, however, so we write directive files.
|
||||
These directive files are parsed by <code>HTMLPurifier_StringHashParser</code>
|
||||
into <code>HTMLPurifier_StringHash</code>es, which then
|
||||
are run through <code>HTMLPurifier_ConfigSchema_InterchangeBuilder</code>
|
||||
to construct the interchange object.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
From the interchange object, the data can be siphoned into other forms
|
||||
using <code>HTMLPurifier_ConfigSchema_Builder</code> subclasses.
|
||||
For example, <code>HTMLPurifier_ConfigSchema_Builder_ConfigSchema</code>
|
||||
generates a runtime <code>HTMLPurifier_ConfigSchema</code> object,
|
||||
which <code>HTMLPurifier_Config</code> uses to validate its incoming
|
||||
data. There is also an XML serializer, which is used to build documentation.
|
||||
</p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
68
htmlpurifier-4.10.0/docs/dev-flush.html
Executable file
68
htmlpurifier-4.10.0/docs/dev-flush.html
Executable file
@ -0,0 +1,68 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Discusses when to flush HTML Purifier's various caches." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
<title>Flushing the Purifier - HTML Purifier</title>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Flushing the Purifier</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
If you've been poking around the various folders in HTML Purifier,
|
||||
you may have noticed the <code>maintenance</code> directory. Almost
|
||||
all of these scripts are devoted to flushing out the various caches
|
||||
HTML Purifier uses. Normal users don't have to worry about this:
|
||||
regular library usage is transparent. However, when doing development
|
||||
work on HTML Purifier, you may find you have to flush one of the
|
||||
caches.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
As a general rule of thumb, run <code>flush.php</code> whenever you make
|
||||
any <em>major</em> changes, or when tests start mysteriously failing.
|
||||
In more detail, run this script if:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
You added new source files to HTML Purifier's main library.
|
||||
(see <code>generate-includes.php</code>)
|
||||
</li>
|
||||
<li>
|
||||
You modified the configuration schema (see
|
||||
<code>generate-schema-cache.php</code>). This usually means
|
||||
adding or modifying files in <code>HTMLPurifier/ConfigSchema/schema/</code>,
|
||||
although in rare cases modifying <code>HTMLPurifier/ConfigSchema.php</code>
|
||||
will also require this.
|
||||
</li>
|
||||
<li>
|
||||
You modified a Definition, or its subsystems. The most usual candidate
|
||||
is <code>HTMLPurifier/HTMLDefinition.php</code>, which also encompasses
|
||||
the files in <code>HTMLPurifier/HTMLModule/</code> as well as if you've
|
||||
<a href="enduser-customize.html">customizing definitions</a> without
|
||||
the cache disabled. (see <code>flush-generation-cache.php</code>)
|
||||
</li>
|
||||
<li>
|
||||
You modified source files, and have been using the standalone
|
||||
version from the full installation. (see <code>generate-standalone.php</code>)
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
You can check out the corresponding scripts for more information on what they
|
||||
do.
|
||||
</p>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
281
htmlpurifier-4.10.0/docs/dev-includes.txt
Executable file
281
htmlpurifier-4.10.0/docs/dev-includes.txt
Executable file
@ -0,0 +1,281 @@
|
||||
|
||||
INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION
|
||||
|
||||
The Problem
|
||||
-----------
|
||||
|
||||
HTML Purifier contains a number of extra components that are not used all
|
||||
of the time, only if the user explicitly specifies that we should use
|
||||
them.
|
||||
|
||||
Some of these optional components are optionally included (Filter,
|
||||
Language, Lexer, Printer), while others are included all the time
|
||||
(Injector, URIFilter, HTMLModule, URIScheme). We will stipulate that these
|
||||
are all developer specified: it is conceivable that certain Tokens are not
|
||||
used, but this is user-dependent and should not be trusted.
|
||||
|
||||
We should come up with a consistent way to handle these things and ensure
|
||||
that we get the maximum performance when there is bytecode caches and
|
||||
when there are not. Unfortunately, these two goals seem contrary to each
|
||||
other.
|
||||
|
||||
A peripheral issue is the performance of ConfigSchema, which has been
|
||||
shown take a large, constant amount of initialization time, and is
|
||||
intricately linked to the issue of includes due to its pervasive use
|
||||
in our plugin architecture.
|
||||
|
||||
Pros and Cons
|
||||
-------------
|
||||
|
||||
We will assume that user-based extensions will be included by them.
|
||||
|
||||
Conditional includes:
|
||||
Pros:
|
||||
- User management is simplified; only a single directive needs to be set
|
||||
- Only necessary code is included
|
||||
Cons:
|
||||
- Doesn't play nicely with opcode caches
|
||||
- Adds complexity to standalone version
|
||||
- Optional configuration directives are not exposed without a little
|
||||
extra coaxing (not implemented yet)
|
||||
|
||||
Include it all:
|
||||
Pros:
|
||||
- User management is still simple
|
||||
- Plays nicely with opcode caches and standalone version
|
||||
- All configuration directives are present
|
||||
Cons:
|
||||
- Lots of (how much?) extra code is included
|
||||
- Classes that inherit from external libraries will cause compile
|
||||
errors
|
||||
|
||||
Build an include stub (Let's do this!):
|
||||
Pros:
|
||||
- Only necessary code is included
|
||||
- Plays nicely with opcode caches and standalone version
|
||||
- require (without once) can be used, see above
|
||||
- Could further extend as a compilation to one file
|
||||
Cons:
|
||||
- Not implemented yet
|
||||
- Requires user intervention and use of a command line script
|
||||
- Standalone script must be chained to this
|
||||
- More complex and compiled-language-like
|
||||
- Requires a whole new class of system-wide configuration directives,
|
||||
as configuration objects can be reused
|
||||
- Determining what needs to be included can be complex (see above)
|
||||
- No way of autodetecting dynamically instantiated classes
|
||||
- Might be slow
|
||||
|
||||
Include stubs
|
||||
-------------
|
||||
|
||||
This solution may be "just right" for users who are heavily oriented
|
||||
towards performance. However, there are a number of picky implementation
|
||||
details to work out beforehand.
|
||||
|
||||
The number one concern is how to make the HTML Purifier files "work
|
||||
out of the box", while still being able to easily get them into a form
|
||||
that works with this setup. As the codebase stands right now, it would
|
||||
be necessary to strip out all of the require_once calls. The only way
|
||||
we could get rid of the require_once calls is to use __autoload or
|
||||
use the stub for all cases (which might not be a bad idea).
|
||||
|
||||
Aside
|
||||
-----
|
||||
An important thing to remember, however, is that these require_once's
|
||||
are valuable data about what classes a file needs. Unfortunately, there's
|
||||
no distinction between whether or not the file is needed all the time,
|
||||
or whether or not it is one of our "optional" files. Thus, it is
|
||||
effectively useless.
|
||||
|
||||
Deprecated
|
||||
----------
|
||||
One of the things I'd like to do is have the code search for any classes
|
||||
that are explicitly mentioned in the code. If a class isn't mentioned, I
|
||||
get to assume that it is "optional," i.e. included via introspection.
|
||||
The choice is either to use PHP's tokenizer or use regexps; regexps would
|
||||
be faster but a tokenizer would be more correct. If this ends up being
|
||||
unfeasible, adding dependency comments isn't a bad idea. (This could
|
||||
even be done automatically by search/replacing require_once, although
|
||||
we'd have to manually inspect the results for the optional requires.)
|
||||
|
||||
NOTE: This ends up not being necessary, as we're going to make the user
|
||||
figure out all the extra classes they need, and only include the core
|
||||
which is predetermined.
|
||||
|
||||
Using the autoload framework with include stubs works nicely with
|
||||
introspective classes: instead of having to have require_once inside
|
||||
the function, we can let autoload do the work; we simply need to
|
||||
new $class or accept the object straight from the caller. Handling filters
|
||||
becomes a simple matter of ticking off configuration directives, and
|
||||
if ConfigSchema spits out errors, adding the necessary includes. We could
|
||||
also use the autoload framework as a fallback, in case the user forgets
|
||||
to make the include, but doesn't really care about performance.
|
||||
|
||||
Insight
|
||||
-------
|
||||
All of this talk is merely a natural extension of what our current
|
||||
standalone functionality does. However, instead of having our code
|
||||
perform the includes, or attempting to inline everything that possibly
|
||||
could be used, we boot the issue to the user, making them include
|
||||
everything or setup the fallback autoload handler.
|
||||
|
||||
Configuration Schema
|
||||
--------------------
|
||||
|
||||
A common deficiency for all of the conditional include setups (including
|
||||
the dynamically built include PHP stub) is that if one of this
|
||||
conditionally included files includes a configuration directive, it
|
||||
is not accessible to configdoc. A stopgap solution for this problem is
|
||||
to have it piggy-back off of the data in the merge-library.php script
|
||||
to figure out what extra files it needs to include, but if the file also
|
||||
inherits classes that don't exist, we're in big trouble.
|
||||
|
||||
I think it's high time we centralized the configuration documentation.
|
||||
However, the type checking has been a great boon for the library, and
|
||||
I'd like to keep that. The compromise is to use some other source, and
|
||||
then parse it into the ConfigSchema internal format (sans all of those
|
||||
nasty documentation strings which we really don't need at runtime) and
|
||||
serialize that for future use.
|
||||
|
||||
The next question is that of format. XML is very verbose, and the prospect
|
||||
of setting defaults in it gives me willies. However, this may be necessary.
|
||||
Splitting up the file into manageable chunks may alleviate this trouble,
|
||||
and we may be even want to create our own format optimized for specifying
|
||||
configuration. It might look like (based off the PHPT format, which is
|
||||
nicely compact yet unambiguous and human-readable):
|
||||
|
||||
Core.HiddenElements
|
||||
TYPE: lookup
|
||||
DEFAULT: array('script', 'style') // auto-converted during processing
|
||||
--ALIASES--
|
||||
Core.InvisibleElements, Core.StupidElements
|
||||
--DESCRIPTION--
|
||||
<p>
|
||||
Blah blah
|
||||
</p>
|
||||
|
||||
The first line is the directive name, the lines after that prior to the
|
||||
first --HEADER-- block are single-line values, and then after that
|
||||
the multiline values are there. No value is restricted to a particular
|
||||
format: DEFAULT could very well be multiline if that would be easier.
|
||||
This would make it insanely easy, also, to add arbitrary extra parameters,
|
||||
like:
|
||||
|
||||
VERSION: 3.0.0
|
||||
ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array()
|
||||
EXTERNAL: CSSTidy // this would be documented somewhere else with a URL
|
||||
|
||||
The final loss would be that you wouldn't know what file the directive
|
||||
was used in; with some clever regexps it should be possible to
|
||||
figure out where $config->get($ns, $d); occurs. Reflective calls to
|
||||
the configuration object is mitigated by the fact that getBatch is
|
||||
used, so we can simply talk about that in the namespace definition page.
|
||||
This might be slow, but it would only happen when we are creating
|
||||
the documentation for consumption, and is sugar.
|
||||
|
||||
We can put this in a schema/ directory, outside of HTML Purifier. The serialized
|
||||
data gets treated like entities.ser.
|
||||
|
||||
The final thing that needs to be handled is user defined configurations.
|
||||
They can be added at runtime using ConfigSchema::registerDirectory()
|
||||
which globs the directory and grabs all of the directives to be incorporated
|
||||
in. Then, the result is saved. We may want to take advantage of the
|
||||
DefinitionCache framework, although it is not altogether certain what
|
||||
configuration directives would be used to generate our key (meta-directives!)
|
||||
|
||||
Further thoughts
|
||||
----------------
|
||||
Our master configuration schema will only need to be updated once
|
||||
every new version, so it's easily versionable. User specified
|
||||
schema files are far more volatile, but it's far too expensive
|
||||
to check the filemtimes of all the files, so a DefinitionRev style
|
||||
mechanism works better. However, we can uniquely identify the
|
||||
schema based on the directories they loaded, so there's no need
|
||||
for a DefinitionId until we give them full programmatic control.
|
||||
|
||||
These variables should be directly incorporated into ConfigSchema,
|
||||
and ConfigSchema should handle serialization. Some refactoring will be
|
||||
necessary for the DefinitionCache classes, as they are built with
|
||||
Config in mind. If the user changes something, the cache file gets
|
||||
rebuilt. If the version changes, the cache file gets rebuilt. Since
|
||||
our unit tests flush the caches before we start, and the operation is
|
||||
pretty fast, this will not negatively impact unit testing.
|
||||
|
||||
One last thing: certain configuration directives require that files
|
||||
get added. They may even be specified dynamically. It is not a good idea
|
||||
for the HTMLPurifier_Config object to be used directly for such matters.
|
||||
Instead, the userland code should explicitly perform the includes. We may
|
||||
put in something like:
|
||||
|
||||
REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks
|
||||
|
||||
To indicate that if that class doesn't exist, and the user is attempting
|
||||
to use the directive, we should fatally error out. The stub includes the core files,
|
||||
and the user includes everything else. Any reflective things like new
|
||||
$class would be required to tie in with the configuration.
|
||||
|
||||
It would work very well with rarely used configuration options, but it
|
||||
wouldn't be so good for "core" parts that can be disabled. In such cases
|
||||
the core include file would need to be modified, and the only way
|
||||
to properly do this is use the configuration object. Once again, our
|
||||
ability to create cache keys saves the day again: we can create arbitrary
|
||||
stub files for arbitrary configurations and include those. They could
|
||||
even be the single file affairs. The only thing we'd need to include,
|
||||
then, would be HTMLPurifier_Config! Then, the configuration object would
|
||||
load the library.
|
||||
|
||||
An aside...
|
||||
-----------
|
||||
One questions, however, the wisdom of letting PHP files write other PHP
|
||||
files. It seems like a recipe for disaster, or at least lots of headaches
|
||||
in highly secured setups, where PHP does not have the ability to write
|
||||
to its root. In such cases, we could use sticky bits or tell the user
|
||||
to manually generate the file.
|
||||
|
||||
The other troublesome bit is actually doing the calculations necessary.
|
||||
For certain cases, it's simple (such as URIScheme), but for AttrDef
|
||||
and HTMLModule the dependency trees are very complex in relation to
|
||||
%HTML.Allowed and friends. I think that this idea should be shelved
|
||||
and looked at a later, less insane date.
|
||||
|
||||
An interesting dilemma presents itself when a configuration form is offered
|
||||
to the user. Normally, the configuration object is not accessible without
|
||||
editing PHP code; this facility changes thing. The sensible thing to do
|
||||
is stipulate that all classes required by the directives you allow must
|
||||
be included.
|
||||
|
||||
Unit testing
|
||||
------------
|
||||
|
||||
Setting up the parsing and translation into our existing format would not
|
||||
be difficult to do. It might represent a good time for us to rethink our
|
||||
tests for these facilities; as creative as they are, they are often hacky
|
||||
and require public visibility for things that ought to be protected.
|
||||
This is especially applicable for our DefinitionCache tests.
|
||||
|
||||
Migration
|
||||
---------
|
||||
|
||||
Because we are not *adding* anything essentially new, it should be trivial
|
||||
to write a script to take our existing data and dump it into the new format.
|
||||
Well, not trivial, but fairly easy to accomplish. Primary implementation
|
||||
difficulties would probably involve formatting the file nicely.
|
||||
|
||||
Backwards-compatibility
|
||||
-----------------------
|
||||
|
||||
I expect that the ConfigSchema methods should stick around for a little bit,
|
||||
but display E_USER_NOTICE warnings that they are deprecated. This will
|
||||
require documentation!
|
||||
|
||||
New stuff
|
||||
---------
|
||||
|
||||
VERSION: Version number directive was introduced
|
||||
DEPRECATED-VERSION: If the directive was deprecated, when was it deprecated?
|
||||
DEPRECATED-USE: If the directive was deprecated, what should the user use now?
|
||||
REQUIRES: What classes does this configuration directive require, but are
|
||||
not part of the HTML Purifier core?
|
||||
|
||||
vim: et sw=4 sts=4
|
83
htmlpurifier-4.10.0/docs/dev-naming.html
Executable file
83
htmlpurifier-4.10.0/docs/dev-naming.html
Executable file
@ -0,0 +1,83 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Defines class naming conventions in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Naming Conventions - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>Naming Conventions</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>The classes in this library follow a few naming conventions, which may
|
||||
help you find the correct functionality more quickly. Here they are:</p>
|
||||
|
||||
<dl>
|
||||
|
||||
<dt>All classes occupy the HTMLPurifier pseudo-namespace.</dt>
|
||||
<dd>This means that all classes are prefixed with HTMLPurifier_. As such, all
|
||||
names under HTMLPurifier_ are reserved. I recommend that you use the name
|
||||
HTMLPurifierX_YourName_ClassName, especially if you want to take advantage
|
||||
of HTMLPurifier_ConfigDef.</dd>
|
||||
|
||||
<dt>All classes correspond to their path if library/ was in the include path</dt>
|
||||
<dd>HTMLPurifier_AttrDef is located at HTMLPurifier/AttrDef.php; replace
|
||||
underscores with slashes and append .php and you'll have the location of
|
||||
the class.</dd>
|
||||
|
||||
<dt>Harness and Test are reserved class names for unit tests</dt>
|
||||
<dd>The suffix <code>Test</code> indicates that the class is a subclass of UnitTestCase
|
||||
(of the Simpletest library) and is testable. "Harness" indicates a subclass
|
||||
of UnitTestCase that is not meant to be run but to be extended into
|
||||
concrete test cases and contains custom test methods (i.e. assert*())</dd>
|
||||
|
||||
<dt>Class names do not necessarily represent inheritance hierarchies</dt>
|
||||
<dd>While we try to reflect inheritance in naming to some extent, it is not
|
||||
guaranteed (for instance, none of the classes inherit from HTMLPurifier,
|
||||
the base class). However, all class files have the require_once
|
||||
declarations to whichever classes they are tightly coupled to.</dd>
|
||||
|
||||
<dt>Strategy has a meaning different from the Gang of Four pattern</dt>
|
||||
<dd>In Design Patterns, the Gang of Four describes a Strategy object as
|
||||
encapsulating an algorithm so that they can be switched at run-time. While
|
||||
our strategies are indeed algorithms, they are not meant to be substituted:
|
||||
all must be present in order for proper functioning.</dd>
|
||||
|
||||
<dt>Abbreviations are avoided</dt>
|
||||
<dd>We try to avoid abbreviations as much as possible, but in some cases,
|
||||
abbreviated version is more readable than the full version. Here, we
|
||||
list common abbreviations:
|
||||
<ul>
|
||||
<li>Attr to Attributes (note that it is plural, i.e. <code>$attr = array()</code>)</li>
|
||||
<li>Def to Definition</li>
|
||||
<li><code>$ret</code> is the value to be returned in a function</li>
|
||||
</ul>
|
||||
</dd>
|
||||
|
||||
<dt>Ambiguity concerning the definition of Def/Definition</dt>
|
||||
<dd>While a definition normally defines the structure/acceptable values of
|
||||
an entity, most of the definitions in this application also attempt
|
||||
to validate and fix the value. I am unsure of a better name, as
|
||||
"Validator" would exclude fixing the value, "Fixer" doesn't invoke
|
||||
the proper image of "fixing" something, and "ValidatorFixer" is too long!
|
||||
Some other suggestions were "Handler", "Reference", "Check", "Fix",
|
||||
"Repair" and "Heal".</dd>
|
||||
|
||||
<dt>Transform not Transformer</dt>
|
||||
<dd>Transform is both a noun and a verb, and thus we define a "Transform" as
|
||||
something that "transforms," leaving "Transformer" (which sounds like an
|
||||
electrical device/robot toy).</dd>
|
||||
|
||||
</dl>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
33
htmlpurifier-4.10.0/docs/dev-optimization.html
Executable file
33
htmlpurifier-4.10.0/docs/dev-optimization.html
Executable file
@ -0,0 +1,33 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Discusses possible methods of optimizing HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Optimization - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>Optimization</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Here are some possible optimization techniques we can apply to code sections if
|
||||
they turn out to be slow. Be sure not to prematurely optimize: if you get
|
||||
that itch, put it here!</p>
|
||||
|
||||
<ul>
|
||||
<li>Make Tokens Flyweights (may prove problematic, probably not worth it)</li>
|
||||
<li>Rewrite regexps into PHP code</li>
|
||||
<li>Batch regexp validation (do as many per function call as possible)</li>
|
||||
<li>Parallelize strategies</li>
|
||||
</ul>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
309
htmlpurifier-4.10.0/docs/dev-progress.html
Executable file
309
htmlpurifier-4.10.0/docs/dev-progress.html
Executable file
@ -0,0 +1,309 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Tables detailing HTML element and CSS property implementation coverage in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Implementation Progress - HTML Purifier</title>
|
||||
|
||||
<style type="text/css">
|
||||
|
||||
td {padding-right:1em;border-bottom:1px solid #000;padding-left:0.5em;}
|
||||
th {text-align:left;padding-top:1.4em;font-size:13pt;
|
||||
border-bottom:2px solid #000;background:#FFF;}
|
||||
thead th {text-align:left;padding:0.1em;background-color:#EEE;}
|
||||
|
||||
.impl-yes {background:#9D9;}
|
||||
.impl-partial {background:#FFA;}
|
||||
.impl-no {background:#CCC;}
|
||||
|
||||
.danger {color:#600;}
|
||||
.css1 {color:#060;}
|
||||
.required {font-weight:bold;}
|
||||
.feature {color:#999;}
|
||||
|
||||
</style>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>Implementation Progress</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
<strong>Warning:</strong> This table is kept for historical purposes and
|
||||
is not being actively updated.
|
||||
</p>
|
||||
|
||||
<h2>Key</h2>
|
||||
|
||||
<table cellspacing="0"><tbody>
|
||||
<tr><td class="impl-yes">Implemented</td></tr>
|
||||
<tr><td class="impl-partial">Partially implemented</td></tr>
|
||||
<tr><td class="impl-no">Not priority to implement</td></tr>
|
||||
<tr><td class="danger">Dangerous attribute/property</td></tr>
|
||||
<tr><td class="css1">Present in CSS1</td></tr>
|
||||
<tr><td class="feature">Feature, requires extra work</td></tr>
|
||||
</tbody></table>
|
||||
|
||||
<h2>CSS</h2>
|
||||
|
||||
<table cellspacing="0">
|
||||
|
||||
<thead>
|
||||
<tr><th>Name</th><th>Notes</th></tr>
|
||||
</thead>
|
||||
|
||||
<!--
|
||||
<tr><td>-</td><td>-</td></tr>
|
||||
-->
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="2">Standard</th></tr>
|
||||
<tr class="css1 impl-yes"><td>background-color</td><td>COMPOSITE(<color>, transparent)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>background</td><td>SHORTHAND, currently alias for background-color</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border</td><td>SHORTHAND, MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border-color</td><td>MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border-style</td><td>MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border-width</td><td>MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border-*</td><td>SHORTHAND</td></tr>
|
||||
<tr class="impl-yes"><td>border-*-color</td><td>COMPOSITE(<color>, transparent)</td></tr>
|
||||
<tr class="impl-yes"><td>border-*-style</td><td>ENUM(none, hidden, dotted, dashed,
|
||||
solid, double, groove, ridge, inset, outset)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>border-*-width</td><td>COMPOSITE(<length>, thin, medium, thick)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>clear</td><td>ENUM(none, left, right, both)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>color</td><td><color></td></tr>
|
||||
<tr class="css1 impl-yes"><td>float</td><td>ENUM(left, right, none), May require layout
|
||||
precautions with clear</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font</td><td>SHORTHAND</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font-family</td><td>CSS validator may complain if fallback font
|
||||
family not specified</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font-size</td><td>COMPOSITE(<absolute-size>,
|
||||
<relative-size>, <length>, <percentage>)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font-style</td><td>ENUM(normal, italic, oblique)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font-variant</td><td>ENUM(normal, small-caps)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>font-weight</td><td>ENUM(normal, bold, bolder, lighter,
|
||||
100, 200, 300, 400, 500, 600, 700, 800, 900), maybe special code for
|
||||
in-between integers</td></tr>
|
||||
<tr class="css1 impl-yes"><td>letter-spacing</td><td>COMPOSITE(<length>, normal)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>line-height</td><td>COMPOSITE(<number>,
|
||||
<length>, <percentage>, normal)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>list-style-position</td><td>ENUM(inside, outside),
|
||||
Strange behavior in browsers</td></tr>
|
||||
<tr class="css1 impl-yes"><td>list-style-type</td><td>ENUM(...),
|
||||
Well-supported values are: disc, circle, square,
|
||||
decimal, lower-roman, upper-roman, lower-alpha and upper-alpha. See also
|
||||
CSS 3. Mostly IE lack of support.</td></tr>
|
||||
<tr class="css1 impl-yes"><td>list-style</td><td>SHORTHAND</td></tr>
|
||||
<tr class="css1 impl-yes"><td>margin</td><td>MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>margin-*</td><td>COMPOSITE(<length>,
|
||||
<percentage>, auto)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>padding</td><td>MULTIPLE</td></tr>
|
||||
<tr class="css1 impl-yes"><td>padding-*</td><td>COMPOSITE(<length>(positive),
|
||||
<percentage>(positive))</td></tr>
|
||||
<tr class="css1 impl-yes"><td>text-align</td><td>ENUM(left, right,
|
||||
center, justify)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>text-decoration</td><td>No blink (argh my eyes), not
|
||||
enum, can be combined (composite sorta): underline, overline,
|
||||
line-through</td></tr>
|
||||
<tr class="css1 impl-yes"><td>text-indent</td><td>COMPOSITE(<length>,
|
||||
<percentage>)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>text-transform</td><td>ENUM(capitalize, uppercase,
|
||||
lowercase, none)</td></tr>
|
||||
<tr class="css1 impl-yes"><td>width</td><td>COMPOSITE(<length>,
|
||||
<percentage>, auto), Interesting</td></tr>
|
||||
<tr class="css1 impl-yes"><td>word-spacing</td><td>COMPOSITE(<length>, auto),
|
||||
IE 5 no support</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="2">Table</th></tr>
|
||||
<tr class="impl-yes"><td>border-collapse</td><td>ENUM(collapse, seperate)</td></tr>
|
||||
<tr class="impl-yes"><td>border-space</td><td>MULTIPLE</td></tr>
|
||||
<tr class="impl-yes"><td>caption-side</td><td>ENUM(top, bottom)</td></tr>
|
||||
<tr class="feature"><td>empty-cells</td><td>ENUM(show, hide), No IE support makes this useless,
|
||||
possible fix with &nbsp;? Unknown release milestone.</td></tr>
|
||||
<tr class="impl-yes"><td>table-layout</td><td>ENUM(auto, fixed)</td></tr>
|
||||
<tr class="impl-yes css1"><td>vertical-align</td><td>COMPOSITE(ENUM(baseline, sub,
|
||||
super, top, text-top, middle, bottom, text-bottom), <percentage>,
|
||||
<length>) Also applies to others with explicit height</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="2">Absolute positioning, unknown release milestone</th></tr>
|
||||
<tr class="danger impl-no"><td>bottom</td><td rowspan="4">Dangerous, must be non-negative to even be considered,
|
||||
but it's still possible to arbitrarily position by running over.</td></tr>
|
||||
<tr class="danger impl-no"><td>left</td></tr>
|
||||
<tr class="danger impl-no"><td>right</td></tr>
|
||||
<tr class="danger impl-no"><td>top</td></tr>
|
||||
<tr class="impl-no"><td>clip</td><td>-</td></tr>
|
||||
<tr class="danger impl-no"><td>position</td><td>ENUM(static, relative, absolute, fixed)
|
||||
relative not absolute?</td></tr>
|
||||
<tr class="danger impl-no"><td>z-index</td><td>Dangerous</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="2">Unknown</th></tr>
|
||||
<tr class="danger css1 impl-yes"><td>background-image</td><td>Dangerous</td></tr>
|
||||
<tr class="css1 impl-yes"><td>background-attachment</td><td>ENUM(scroll, fixed),
|
||||
Depends on background-image</td></tr>
|
||||
<tr class="css1 impl-yes"><td>background-position</td><td>Depends on background-image</td></tr>
|
||||
<tr class="danger impl-no"><td>cursor</td><td>Dangerous but fluffy</td></tr>
|
||||
<tr class="danger impl-yes"><td>display</td><td>ENUM(...), Dangerous but interesting;
|
||||
will not implement list-item, run-in (Opera only) or table (no IE);
|
||||
inline-block has incomplete IE6 support and requires -moz-inline-box
|
||||
for Mozilla. Unknown target milestone.</td></tr>
|
||||
<tr class="css1 impl-yes"><td>height</td><td>Interesting, why use it? Unknown target milestone.</td></tr>
|
||||
<tr class="danger css1 impl-yes"><td>list-style-image</td><td>Dangerous?</td></tr>
|
||||
<tr class="impl-no"><td>max-height</td><td rowspan="4">No IE 5/6</td></tr>
|
||||
<tr class="impl-no"><td>min-height</td></tr>
|
||||
<tr class="impl-no"><td>max-width</td></tr>
|
||||
<tr class="impl-no"><td>min-width</td></tr>
|
||||
<tr class="impl-no"><td>orphans</td><td>No IE support</td></tr>
|
||||
<tr class="impl-no"><td>widows</td><td>No IE support</td></tr>
|
||||
<tr><td>overflow</td><td>ENUM, IE 5/6 almost (remove visible if set). Unknown target milestone.</td></tr>
|
||||
<tr><td>page-break-after</td><td>ENUM(auto, always, avoid, left, right),
|
||||
IE 5.5/6 and Opera. Unknown target milestone.</td></tr>
|
||||
<tr><td>page-break-before</td><td>ENUM(auto, always, avoid, left, right),
|
||||
Mostly supported. Unknown target milestone.</td></tr>
|
||||
<tr><td>page-break-inside</td><td>ENUM(avoid, auto), Opera only. Unknown target milestone.</td></tr>
|
||||
<tr class="impl-no"><td>quotes</td><td>May be dropped from CSS2, fairly useless for inline context</td></tr>
|
||||
<tr class="danger impl-yes"><td>visibility</td><td>ENUM(visible, hidden, collapse),
|
||||
Dangerous</td></tr>
|
||||
<tr class="css1 feature impl-partial"><td>white-space</td><td>ENUM(normal, pre, nowrap, pre-wrap,
|
||||
pre-line), Spotty implementation:
|
||||
pre (no IE 5/6), <em>nowrap</em> (no IE 5, supported),
|
||||
pre-wrap (only Opera), pre-line (no support). Fixable? Unknown target milestone.</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody class="impl-no">
|
||||
<tr><th colspan="2">Aural</th></tr>
|
||||
<tr><td>azimuth</td><td>-</td></tr>
|
||||
<tr><td>cue</td><td>-</td></tr>
|
||||
<tr><td>cue-after</td><td>-</td></tr>
|
||||
<tr><td>cue-before</td><td>-</td></tr>
|
||||
<tr><td>elevation</td><td>-</td></tr>
|
||||
<tr><td>pause-after</td><td>-</td></tr>
|
||||
<tr><td>pause-before</td><td>-</td></tr>
|
||||
<tr><td>pause</td><td>-</td></tr>
|
||||
<tr><td>pitch-range</td><td>-</td></tr>
|
||||
<tr><td>pitch</td><td>-</td></tr>
|
||||
<tr><td>play-during</td><td>-</td></tr>
|
||||
<tr><td>richness</td><td>-</td></tr>
|
||||
<tr><td>speak-header</td><td>Table related</td></tr>
|
||||
<tr><td>speak-numeral</td><td>-</td></tr>
|
||||
<tr><td>speak-punctuation</td><td>-</td></tr>
|
||||
<tr><td>speak</td><td>-</td></tr>
|
||||
<tr><td>speech-rate</td><td>-</td></tr>
|
||||
<tr><td>stress</td><td>-</td></tr>
|
||||
<tr><td>voice-family</td><td>-</td></tr>
|
||||
<tr><td>volume</td><td>-</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody class="impl-no">
|
||||
<tr><th colspan="2">Will not implement</th></tr>
|
||||
<tr><td>content</td><td>Not applicable for inline styles</td></tr>
|
||||
<tr><td>counter-increment</td><td>Needs content, Opera only</td></tr>
|
||||
<tr><td>counter-reset</td><td>Needs content, Opera only</td></tr>
|
||||
<tr><td>direction</td><td>No support</td></tr>
|
||||
<tr><td>outline-color</td><td rowspan="4">IE Mac and Opera on outside,
|
||||
Mozilla on inside and needs -moz-outline, no IE support.</td></tr>
|
||||
<tr><td>outline-style</td></tr>
|
||||
<tr><td>outline-width</td></tr>
|
||||
<tr><td>outline</td></tr>
|
||||
<tr><td>unicode-bidi</td><td>No support</td></tr>
|
||||
</tbody>
|
||||
|
||||
</table>
|
||||
|
||||
<h2>Interesting Attributes</h2>
|
||||
|
||||
<table cellspacing="0">
|
||||
|
||||
<thead>
|
||||
<tr><th>Attribute</th><th>Tags</th><th>Notes</th></tr>
|
||||
</thead>
|
||||
|
||||
<!--
|
||||
<tr><th></th></tr>
|
||||
<tbody>
|
||||
<tr><td>-</td><td>-</td><td>-</td></tr>
|
||||
</tbody>
|
||||
-->
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="3">CSS</th></tr>
|
||||
<tr class="impl-yes"><td>style</td><td>All</td><td>Parser is reasonably functional. Status here doesn't count individual properties.</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="3">Questionable</th></tr>
|
||||
<tr class="impl-no"><td>accesskey</td><td>A</td><td>May interfere with main interface</td></tr>
|
||||
<tr class="impl-no"><td>tabindex</td><td>A</td><td>May interfere with main interface</td></tr>
|
||||
<tr class="impl-yes"><td>target</td><td>A</td><td>Config enabled, only useful for frame layouts, disallowed in strict</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="3">Miscellaneous</th></tr>
|
||||
<tr><td>datetime</td><td>DEL, INS</td><td>No visible effect, ISO format</td></tr>
|
||||
<tr class="impl-yes"><td>rel</td><td>A</td><td>Largely user-defined: nofollow, tag (see microformats)</td></tr>
|
||||
<tr class="impl-yes"><td>rev</td><td>A</td><td>Largely user-defined: vote-*</td></tr>
|
||||
<tr class="feature"><td>axis</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr>
|
||||
<tr class="feature"><td>char</td><td>COL, COLGROUP, TBODY, TD, TFOOT, TH, THEAD, TR</td><td>W3C only: No browser implementation</td></tr>
|
||||
<tr class="feature"><td>headers</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr>
|
||||
<tr class="impl-yes"><td>scope</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody class="impl-yes">
|
||||
<tr><th colspan="3">URI</th></tr>
|
||||
<tr><td rowspan="2">cite</td><td>BLOCKQUOTE, Q</td><td>For attribution</td></tr>
|
||||
<tr><td>DEL, INS</td><td>Link to explanation why it changed</td></tr>
|
||||
<tr><td>href</td><td>A</td><td>-</td></tr>
|
||||
<tr><td>longdesc</td><td>IMG</td><td>-</td></tr>
|
||||
<tr class="required"><td>src</td><td>IMG</td><td>Required</td></tr>
|
||||
</tbody>
|
||||
|
||||
<tbody>
|
||||
<tr><th colspan="3">Transform</th></tr>
|
||||
<tr class="impl-yes"><td rowspan="5">align</td><td>CAPTION</td><td>'caption-side' for top/bottom, 'text-align' for left/right</td></tr>
|
||||
<tr class="impl-yes"><td>IMG</td><td rowspan="3">See specimens/html-align-to-css.html</td></tr>
|
||||
<tr class="impl-yes"><td>TABLE</td></tr>
|
||||
<tr class="impl-yes"><td>HR</td></tr>
|
||||
<tr class="impl-yes"><td>H1, H2, H3, H4, H5, H6, P</td><td>Equivalent style 'text-align'</td></tr>
|
||||
<tr class="required impl-yes"><td>alt</td><td>IMG</td><td>Required, insert image filename if src is present or default invalid image text</td></tr>
|
||||
<tr class="impl-yes"><td rowspan="3">bgcolor</td><td>TABLE</td><td>Superset style 'background-color'</td></tr>
|
||||
<tr class="impl-yes"><td>TR</td><td>Superset style 'background-color'</td></tr>
|
||||
<tr class="impl-yes"><td>TD, TH</td><td>Superset style 'background-color'</td></tr>
|
||||
<tr class="impl-yes"><td>border</td><td>IMG</td><td>Equivalent style <code>border:[number]px solid</code></td></tr>
|
||||
<tr class="impl-yes"><td>clear</td><td>BR</td><td>Near-equiv style 'clear', transform 'all' into 'both'</td></tr>
|
||||
<tr class="impl-no"><td>compact</td><td>DL, OL, UL</td><td>Boolean, needs custom CSS class; rarely used anyway</td></tr>
|
||||
<tr class="required impl-yes"><td>dir</td><td>BDO</td><td>Required, insert ltr (or configuration value) if none</td></tr>
|
||||
<tr class="impl-yes"><td>height</td><td>TD, TH</td><td>Near-equiv style 'height', needs px suffix if original was in pixels</td></tr>
|
||||
<tr class="impl-yes"><td>hspace</td><td>IMG</td><td>Near-equiv styles 'margin-top' and 'margin-bottom', needs px suffix</td></tr>
|
||||
<tr class="impl-yes"><td>lang</td><td>*</td><td>Copy value to xml:lang</td></tr>
|
||||
<tr class="impl-yes"><td rowspan="2">name</td><td>IMG</td><td>Turn into ID</td></tr>
|
||||
<tr class="impl-yes"><td>A</td><td>Turn into ID</td></tr>
|
||||
<tr class="impl-yes"><td>noshade</td><td>HR</td><td>Boolean, style 'border-style:solid;'</td></tr>
|
||||
<tr class="impl-yes"><td>nowrap</td><td>TD, TH</td><td>Boolean, style 'white-space:nowrap;' (not compat with IE5)</td></tr>
|
||||
<tr class="impl-yes"><td>size</td><td>HR</td><td>Near-equiv 'height', needs px suffix if original was pixels</td></tr>
|
||||
<tr class="required impl-yes"><td>src</td><td>IMG</td><td>Required, insert blank or default img if not set</td></tr>
|
||||
<tr class="impl-yes"><td>start</td><td>OL</td><td>Poorly supported 'counter-reset', allowed in loose, dropped in strict</td></tr>
|
||||
<tr class="impl-yes"><td rowspan="3">type</td><td>LI</td><td rowspan="3">Equivalent style 'list-style-type', different allowed values though. (needs testing)</td></tr>
|
||||
<tr class="impl-yes"><td>OL</td></tr>
|
||||
<tr class="impl-yes"><td>UL</td></tr>
|
||||
<tr class="impl-yes"><td>value</td><td>LI</td><td>Poorly supported 'counter-reset', allowed in loose, dropped in strict</td></tr>
|
||||
<tr class="impl-yes"><td>vspace</td><td>IMG</td><td>Near-equiv styles 'margin-left' and 'margin-right', needs px suffix, see hspace</td></tr>
|
||||
<tr class="impl-yes"><td rowspan="2">width</td><td>HR</td><td rowspan="2">Near-equiv style 'width', needs px suffix if original was pixels</td></tr>
|
||||
<tr class="impl-yes"><td>TD, TH</td></tr>
|
||||
</tbody>
|
||||
|
||||
</table>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
1201
htmlpurifier-4.10.0/docs/dtd/xhtml1-transitional.dtd
Executable file
1201
htmlpurifier-4.10.0/docs/dtd/xhtml1-transitional.dtd
Executable file
File diff suppressed because it is too large
Load Diff
850
htmlpurifier-4.10.0/docs/enduser-customize.html
Executable file
850
htmlpurifier-4.10.0/docs/enduser-customize.html
Executable file
@ -0,0 +1,850 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Tutorial for customizing HTML Purifier's tag and attribute sets." />
|
||||
<link rel="stylesheet" type="text/css" href="style.css" />
|
||||
|
||||
<title>Customize - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1 class="subtitled">Customize!</h1>
|
||||
<div class="subtitle">HTML Purifier is a Swiss-Army Knife</div>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
HTML Purifier has this quirk where if you try to allow certain elements or
|
||||
attributes, HTML Purifier will tell you that it's not supported, and that
|
||||
you should go to the forums to find out how to implement it. Well, this
|
||||
document is how to implement elements and attributes which HTML Purifier
|
||||
doesn't support out of the box.
|
||||
</p>
|
||||
|
||||
<h2>Is it necessary?</h2>
|
||||
|
||||
<p>
|
||||
Before we even write any code, it is paramount to consider whether or
|
||||
not the code we're writing is necessary or not. HTML Purifier, by default,
|
||||
contains a large set of elements and attributes: large enough so that
|
||||
<em>any</em> element or attribute in XHTML 1.0 or 1.1 (and its HTML variants)
|
||||
that can be safely used by the general public is implemented.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
So what needs to be implemented? (Feel free to skip this section if
|
||||
you know what you want).
|
||||
</p>
|
||||
|
||||
<h3>XHTML 1.0</h3>
|
||||
|
||||
<p>
|
||||
All of the modules listed below are based off of the
|
||||
<a href="http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#sec_5.2.">modularization of
|
||||
XHTML</a>, which, while technically for XHTML 1.1, is quite a useful
|
||||
resource.
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>Structure</li>
|
||||
<li>Frames</li>
|
||||
<li>Applets (deprecated)</li>
|
||||
<li>Forms</li>
|
||||
<li>Image maps</li>
|
||||
<li>Objects</li>
|
||||
<li>Frames</li>
|
||||
<li>Events</li>
|
||||
<li>Meta-information</li>
|
||||
<li>Style sheets</li>
|
||||
<li>Link (not hypertext)</li>
|
||||
<li>Base</li>
|
||||
<li>Name</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
If you don't recognize it, you probably don't need it. But the curious
|
||||
can look all of these modules up in the above-mentioned document. Note
|
||||
that inline scripting comes packaged with HTML Purifier (more on this
|
||||
later).
|
||||
</p>
|
||||
|
||||
<h3>XHTML 1.1</h3>
|
||||
|
||||
<p>
|
||||
As of HTMLPurifier 2.1.0, we have implemented the
|
||||
<a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>,
|
||||
which defines a set of tags
|
||||
for publishing short annotations for text, used mostly in Japanese
|
||||
and Chinese school texts, but applicable for positioning any text (not
|
||||
limited to translations) above or below other corresponding text.
|
||||
</p>
|
||||
|
||||
<h3>HTML 5</h3>
|
||||
|
||||
<p>
|
||||
<a href="http://www.whatwg.org/specs/web-apps/current-work/">HTML 5</a>
|
||||
is a fork of HTML 4.01 by WHATWG, who believed that XHTML 2.0 was headed
|
||||
in the wrong direction. It too is a working draft, and may change
|
||||
drastically before publication, but it should be noted that the
|
||||
<code>canvas</code> tag has been implemented by many browser vendors.
|
||||
</p>
|
||||
|
||||
<h3>Proprietary</h3>
|
||||
|
||||
<p>
|
||||
There are a number of proprietary tags still in the wild. Many of them
|
||||
have been documented in <a href="ref-proprietary-tags.txt">ref-proprietary-tags.txt</a>,
|
||||
but there is currently no implementation for any of them.
|
||||
</p>
|
||||
|
||||
<h3>Extensions</h3>
|
||||
|
||||
<p>
|
||||
There are also a number of other XML languages out there that can
|
||||
be embedded in HTML documents: two of the most popular are MathML and
|
||||
SVG, and I frequently get requests to implement these. But they are
|
||||
expansive, comprehensive specifications, and it would take far too long
|
||||
to implement them <em>correctly</em> (most systems I've seen go as far
|
||||
as whitelisting tags and no further; come on, what about nesting!)
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Word of warning: HTML Purifier is currently <em>not</em> namespace
|
||||
aware.
|
||||
</p>
|
||||
|
||||
<h2>Giving back</h2>
|
||||
|
||||
<p>
|
||||
As you may imagine from the details above (don't be abashed if you didn't
|
||||
read it all: a glance over would have done), there's quite a bit that
|
||||
HTML Purifier doesn't implement. Recent architectural changes have
|
||||
allowed HTML Purifier to implement elements and attributes that are not
|
||||
safe! Don't worry, they won't be activated unless you set %HTML.Trusted
|
||||
to true, but they certainly help out users who need to put, say, forms
|
||||
on their page and don't want to go through the trouble of reading this
|
||||
and implementing it themself.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
So any of the above that you implement for your own application could
|
||||
help out some other poor sap on the other side of the globe. Help us
|
||||
out, and send back code so that it can be hammered into a module and
|
||||
released with the core. Any code would be greatly appreciated!
|
||||
</p>
|
||||
|
||||
<h2>And now...</h2>
|
||||
|
||||
<p>
|
||||
Enough philosophical talk, time for some code:
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
if ($def = $config->maybeGetRawHTMLDefinition()) {
|
||||
// our code will go here
|
||||
}</pre>
|
||||
|
||||
<p>
|
||||
Assuming that HTML Purifier has already been properly loaded (hint:
|
||||
include <code>HTMLPurifier.auto.php</code>), this code will set up
|
||||
the environment that you need to start customizing the HTML definition.
|
||||
What's going on?
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
The first three lines are regular configuration code:
|
||||
<ul>
|
||||
<li>
|
||||
%HTML.DefinitionID is set to a unique identifier for your
|
||||
custom HTML definition. This prevents it from clobbering
|
||||
other custom definitions on the same installation.
|
||||
</li>
|
||||
<li>
|
||||
%HTML.DefinitionRev is a revision integer of your HTML
|
||||
definition. Because HTML definitions are cached, you'll need
|
||||
to increment this whenever you make a change in order to flush
|
||||
the cache.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
The fourth line retrieves a raw <code>HTMLPurifier_HTMLDefinition</code>
|
||||
object that we will be tweaking. Interestingly enough, we have
|
||||
placed it in an if block: this is because
|
||||
<code>maybeGetRawHTMLDefinition</code>, as its name suggests, may
|
||||
return a NULL, in which case we should skip doing any
|
||||
initialization. This, in fact, will correspond to when our fully
|
||||
customized object is already in the cache.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<h2>Turn off caching</h2>
|
||||
|
||||
<p>
|
||||
To make development easier, we're going to temporarily turn off
|
||||
definition caching:
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
<strong>$config->set('Cache.DefinitionImpl', null); // TODO: remove this later!</strong>
|
||||
$def = $config->getHTMLDefinition(true);</pre>
|
||||
|
||||
<p>
|
||||
A few things should be mentioned about the caching mechanism before
|
||||
we move on. For performance reasons, HTML Purifier caches generated
|
||||
<code>HTMLPurifier_Definition</code> objects in serialized files
|
||||
stored (by default) in <code>library/HTMLPurifier/DefinitionCache/Serializer</code>.
|
||||
A lot of processing is done in order to create these objects, so it
|
||||
makes little sense to repeat the same processing over and over again
|
||||
whenever HTML Purifier is called.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
In order to identify a cache entry, HTML Purifier uses three variables:
|
||||
the library's version number, the value of %HTML.DefinitionRev and
|
||||
a serial of relevant configuration. Whenever any of these changes,
|
||||
a new HTML definition is generated. Notice that there is no way
|
||||
for the definition object to track changes to customizations: here, it
|
||||
is up to you to supply appropriate information to DefinitionID and
|
||||
DefinitionRev.
|
||||
</p>
|
||||
|
||||
<h2 id="addAttribute">Add an attribute</h2>
|
||||
|
||||
<p>
|
||||
For this example, we're going to implement the <code>target</code> attribute found
|
||||
on <code>a</code> elements. To implement an attribute, we have to
|
||||
ask a few questions:
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>What element is it found on?</li>
|
||||
<li>What is its name?</li>
|
||||
<li>Is it required or optional?</li>
|
||||
<li>What are valid values for it?</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
The first three are easy: the element is <code>a</code>, the attribute
|
||||
is <code>target</code>, and it is not a required attribute. (If it
|
||||
was required, we'd need to append an asterisk to the attribute name,
|
||||
you'll see an example of this in the addElement() example).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The last question is a little trickier.
|
||||
Lets allow the special values: _blank, _self, _target and _top.
|
||||
The form of this is called an <strong>enumeration</strong>, a list of
|
||||
valid values, although only one can be used at a time. To translate
|
||||
this into code form, we write:
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
$config->set('Cache.DefinitionImpl', null); // remove this later!
|
||||
$def = $config->getHTMLDefinition(true);
|
||||
<strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong></pre>
|
||||
|
||||
<p>
|
||||
The <code>Enum#_blank,_self,_target,_top</code> does all the magic.
|
||||
The string is split into two parts, separated by a hash mark (#):
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>The first part is the name of what we call an <code>AttrDef</code></li>
|
||||
<li>The second part is the parameter of the above-mentioned <code>AttrDef</code></li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
If that sounds vague and generic, it's because it is! HTML Purifier defines
|
||||
an assortment of different attribute types one can use, and each of these
|
||||
has their own specialized parameter format. Here are some of the more useful
|
||||
ones:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Type</th>
|
||||
<th>Format</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>Enum</th>
|
||||
<td><em>[s:]</em>value1,value2,...</td>
|
||||
<td>
|
||||
Attribute with a number of valid values, one of which may be used. When
|
||||
s: is present, the enumeration is case sensitive.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Bool</th>
|
||||
<td>attribute_name</td>
|
||||
<td>
|
||||
Boolean attribute, with only one valid value: the name
|
||||
of the attribute.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>CDATA</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute of arbitrary text. Can also be referred to as <strong>Text</strong>
|
||||
(the specification makes a semantic distinction between the two).
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>ID</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies a unique ID
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Pixels</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies an integer pixel length
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Length</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies a pixel or percentage length
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>NMTOKENS</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies a number of name tokens, example: the
|
||||
<code>class</code> attribute
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>URI</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies a URI, example: the <code>href</code>
|
||||
attribute
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Number</th>
|
||||
<td></td>
|
||||
<td>
|
||||
Attribute that specifies an positive integer number
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
For a complete list, consult
|
||||
<a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/AttrTypes.php"><code>library/HTMLPurifier/AttrTypes.php</code></a>;
|
||||
more information on attributes that accept parameters can be found on their
|
||||
respective includes in
|
||||
<a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/AttrDef"><code>library/HTMLPurifier/AttrDef</code></a>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Sometimes, the restrictive list in AttrTypes just doesn't cut it. Don't
|
||||
sweat: you can also use a fully instantiated object as the value. The
|
||||
equivalent, verbose form of the above example is:
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
$config->set('Cache.DefinitionImpl', null); // remove this later!
|
||||
$def = $config->getHTMLDefinition(true);
|
||||
<strong>$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(
|
||||
array('_blank','_self','_target','_top')
|
||||
));</strong></pre>
|
||||
|
||||
<p>
|
||||
Trust me, you'll learn to love the shorthand.
|
||||
</p>
|
||||
|
||||
<h2>Add an element</h2>
|
||||
|
||||
<p>
|
||||
Adding attributes is really small-fry stuff, though, and it was possible
|
||||
to add them (albeit a bit more wordy) prior to 2.0. The real gem of
|
||||
the Advanced API is adding elements. There are five questions to
|
||||
ask when adding a new element:
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>What is the element's name?</li>
|
||||
<li>What content set does this element belong to?</li>
|
||||
<li>What are the allowed children of this element?</li>
|
||||
<li>What attributes does the element allow that are general?</li>
|
||||
<li>What attributes does the element allow that are specific to this element?</li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
It's a mouthful, and you'll be slightly lost if your not familiar with
|
||||
the HTML specification, so let's explain them step by step.
|
||||
</p>
|
||||
|
||||
<h3>Content set</h3>
|
||||
|
||||
<p>
|
||||
The HTML specification defines two major content sets: Inline
|
||||
and Block. Each of these
|
||||
content sets contain a list of elements: Inline contains things like
|
||||
<code>span</code> and <code>b</code> while Block contains things like
|
||||
<code>div</code> and <code>blockquote</code>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
These content sets amount to a macro mechanism for HTML definition. Most
|
||||
elements in HTML are organized into one of these two sets, and most
|
||||
elements in HTML allow elements from one of these sets. If we had
|
||||
to write each element verbatim into each other element's allowed
|
||||
children, we would have ridiculously large lists; instead we use
|
||||
content sets to compactify the declaration.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Practically speaking, there are several useful values you can use here:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Content set</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>Inline</th>
|
||||
<td>Character level elements, text</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Block</th>
|
||||
<td>Block-like elements, like paragraphs and lists</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th><em>false</em></th>
|
||||
<td>
|
||||
Any element that doesn't fit into the mold, for example <code>li</code>
|
||||
or <code>tr</code>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
By specifying a valid value here, all other elements that use that
|
||||
content set will also allow your element, without you having to do
|
||||
anything. If you specify <em>false</em>, you'll have to register
|
||||
your element manually.
|
||||
</p>
|
||||
|
||||
<h3>Allowed children</h3>
|
||||
|
||||
<p>
|
||||
Allowed children defines the elements that this element can contain.
|
||||
The allowed values may range from none to a complex regexp depending on
|
||||
your element.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you've ever taken a look at the HTML DTD's before, you may have
|
||||
noticed declarations like this:
|
||||
</p>
|
||||
|
||||
<pre><!ELEMENT LI - O (%flow;)* -- list item --></pre>
|
||||
|
||||
<p>
|
||||
The <code>(%flow;)*</code> indicates the allowed children of the
|
||||
<code>li</code> tag: <code>li</code> allows any number of flow
|
||||
elements as its children. (The <code>- O</code> allows the closing tag to be
|
||||
omitted, though in XML this is not allowed.) In HTML Purifier,
|
||||
we'd write it like <code>Flow</code> (here's where the content sets
|
||||
we were discussing earlier come into play). There are three shorthand
|
||||
content models you can specify:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Content model</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>Empty</th>
|
||||
<td>No children allowed, like <code>br</code> or <code>hr</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Inline</th>
|
||||
<td>Any number of inline elements and text, like <code>span</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Flow</th>
|
||||
<td>Any number of inline elements, block elements and text, like <code>div</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
This covers 90% of all the cases out there, but what about elements that
|
||||
break the mold like <code>ul</code>? This guy requires at least one
|
||||
child, and the only valid children for it are <code>li</code>. The
|
||||
content model is: <code>Required: li</code>. There are two parts: the
|
||||
first type determines what <code>ChildDef</code> will be used to validate
|
||||
content models. The most common values are:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Type</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>Required</th>
|
||||
<td>Children must be one or more of the valid elements</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Optional</th>
|
||||
<td>Children can be any number of the valid elements</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Custom</th>
|
||||
<td>Children must follow the DTD-style regex</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
You can also implement your own <code>ChildDef</code>: this was done
|
||||
for a few special cases in HTML Purifier such as <code>Chameleon</code>
|
||||
(for <code>ins</code> and <code>del</code>), <code>StrictBlockquote</code>
|
||||
and <code>Table</code>.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The second part specifies either valid elements or a regular expression.
|
||||
Valid elements are separated with horizontal bars (|), i.e.
|
||||
"<code>a | b | c</code>". Use #PCDATA to represent plain text.
|
||||
Regular expressions are based off of DTD's style:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>Parentheses () are used for grouping</li>
|
||||
<li>Commas (,) separate elements that should come one after another</li>
|
||||
<li>Horizontal bars (|) indicate one or the other elements should be used</li>
|
||||
<li>Plus signs (+) are used for a one or more match</li>
|
||||
<li>Asterisks (*) are used for a zero or more match</li>
|
||||
<li>Question marks (?) are used for a zero or one match</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
For example, "<code>a, b?, (c | d), e+, f*</code>" means "In this order,
|
||||
one <code>a</code> element, at most one <code>b</code> element,
|
||||
one <code>c</code> or <code>d</code> element (but not both), one or more
|
||||
<code>e</code> elements, and any number of <code>f</code> elements."
|
||||
Regex veterans should be able to jump right in, and those not so savvy
|
||||
can always copy-paste W3C's content model definitions into HTML Purifier
|
||||
and hope for the best.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
A word of warning: while the regex format is extremely flexible on
|
||||
the developer's side, it is
|
||||
quite unforgiving on the user's side. If the user input does not <em>exactly</em>
|
||||
match the specification, the entire contents of the element will
|
||||
be nuked. This is why there is are specific content model types like
|
||||
Optional and Required: while they could be implemented as <code>Custom:
|
||||
(valid | elements)*</code>, the custom classes contain special recovery
|
||||
measures that make sure as much of the user's original content gets
|
||||
through. HTML Purifier's core, as a rule, does not use Custom.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
One final note: you can also use Content Sets inside your valid elements
|
||||
lists or regular expressions. In fact, the three shorthand content models
|
||||
mentioned above are just that: abbreviations:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Content model</th>
|
||||
<th>Implementation</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>Inline</th>
|
||||
<td>Optional: Inline | #PCDATA</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Flow</th>
|
||||
<td>Optional: Flow | #PCDATA</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
When the definition is compiled, Inline will be replaced with a
|
||||
horizontal-bar separated list of inline elements. Also, notice that
|
||||
it does not contain text: you have to specify that yourself.
|
||||
</p>
|
||||
|
||||
<h3>Common attributes</h3>
|
||||
|
||||
<p>
|
||||
Congratulations: you have just gotten over the proverbial hump (Allowed
|
||||
children). Common attributes is much simpler, and boils down to
|
||||
one question: does your element have the <code>id</code>, <code>style</code>,
|
||||
<code>class</code>, <code>title</code> and <code>lang</code> attributes?
|
||||
If so, you'll want to specify the <code>Common</code> attribute collection,
|
||||
which contains these five attributes that are found on almost every
|
||||
HTML element in the specification.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
There are a few more collections, but they're really edge cases:
|
||||
</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Collection</th>
|
||||
<th>Attributes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th>I18N</th>
|
||||
<td><code>lang</code>, possibly <code>xml:lang</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Core</th>
|
||||
<td><code>style</code>, <code>class</code>, <code>id</code> and <code>title</code></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
Common is a combination of the above-mentioned collections.
|
||||
</p>
|
||||
|
||||
<p class="aside">
|
||||
Readers familiar with the modularization may have noticed that the Core
|
||||
attribute collection differs from that specified by the <a
|
||||
href="http://www.w3.org/TR/xhtml-modularization/abstract_modules.html#s_commonatts">abstract
|
||||
modules of the XHTML Modularization 1.1</a>. We believe this section
|
||||
to be in error, as <code>br</code> permits the use of the <code>style</code>
|
||||
attribute even though it uses the <code>Core</code> collection, and
|
||||
the DTD and XML Schemas supplied by W3C support our interpretation.
|
||||
</p>
|
||||
|
||||
<h3>Attributes</h3>
|
||||
|
||||
<p>
|
||||
If you didn't read the <a href="#addAttribute">earlier section on
|
||||
adding attributes</a>, read it now. The last parameter is simply
|
||||
an array of attribute names to attribute implementations, in the exact
|
||||
same format as <code>addAttribute()</code>.
|
||||
</p>
|
||||
|
||||
<h3>Putting it all together</h3>
|
||||
|
||||
<p>
|
||||
We're going to implement <code>form</code>. Before we embark, lets
|
||||
grab a reference implementation from over at the
|
||||
<a href="http://www.w3.org/TR/html4/sgml/loosedtd.html">transitional DTD</a>:
|
||||
</p>
|
||||
|
||||
<pre><!ELEMENT FORM - - (%flow;)* -(FORM) -- interactive form -->
|
||||
<!ATTLIST FORM
|
||||
%attrs; -- %coreattrs, %i18n, %events --
|
||||
action %URI; #REQUIRED -- server-side form handler --
|
||||
method (GET|POST) GET -- HTTP method used to submit the form--
|
||||
enctype %ContentType; "application/x-www-form-urlencoded"
|
||||
accept %ContentTypes; #IMPLIED -- list of MIME types for file upload --
|
||||
name CDATA #IMPLIED -- name of form for scripting --
|
||||
onsubmit %Script; #IMPLIED -- the form was submitted --
|
||||
onreset %Script; #IMPLIED -- the form was reset --
|
||||
target %FrameTarget; #IMPLIED -- render in this frame --
|
||||
accept-charset %Charsets; #IMPLIED -- list of supported charsets --
|
||||
></pre>
|
||||
|
||||
<p>
|
||||
Juicy! With just this, we can answer four of our five questions:
|
||||
</p>
|
||||
|
||||
<ol>
|
||||
<li>What is the element's name? <strong>form</strong></li>
|
||||
<li>What content set does this element belong to? <strong>Block</strong>
|
||||
(this needs a little sleuthing, I find the easiest way is to search
|
||||
the DTD for <code>FORM</code> and determine which set it is in.)</li>
|
||||
<li>What are the allowed children of this element? <strong>One
|
||||
or more flow elements, but no nested <code>form</code>s</strong></li>
|
||||
<li>What attributes does the element allow that are general? <strong>Common</strong></li>
|
||||
<li>What attributes does the element allow that are specific to this element? <strong>A whole bunch, see ATTLIST;
|
||||
we're going to do the vital ones: <code>action</code>, <code>method</code> and <code>name</code></strong></li>
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
Time for some code:
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
$config->set('Cache.DefinitionImpl', null); // remove this later!
|
||||
$def = $config->getHTMLDefinition(true);
|
||||
$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(
|
||||
array('_blank','_self','_target','_top')
|
||||
));
|
||||
<strong>$form = $def->addElement(
|
||||
'form', // name
|
||||
'Block', // content set
|
||||
'Flow', // allowed children
|
||||
'Common', // attribute collection
|
||||
array( // attributes
|
||||
'action*' => 'URI',
|
||||
'method' => 'Enum#get|post',
|
||||
'name' => 'ID'
|
||||
)
|
||||
);
|
||||
$form->excludes = array('form' => true);</strong></pre>
|
||||
|
||||
<p>
|
||||
Each of the parameters corresponds to one of the questions we asked.
|
||||
Notice that we added an asterisk to the end of the <code>action</code>
|
||||
attribute to indicate that it is required. If someone specifies a
|
||||
<code>form</code> without that attribute, the tag will be axed.
|
||||
Also, the extra line at the end is a special extra declaration that
|
||||
prevents forms from being nested within each other.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
And that's all there is to it! Implementing the rest of the form
|
||||
module is left as an exercise to the user; to see more examples
|
||||
check the <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/HTMLModule"><code>library/HTMLPurifier/HTMLModule/</code></a> directory
|
||||
in your local HTML Purifier installation.
|
||||
</p>
|
||||
|
||||
<h2>And beyond...</h2>
|
||||
|
||||
<p>
|
||||
Perceptive users may have realized that, to a certain extent, we
|
||||
have simply re-implemented the facilities of XML Schema or the
|
||||
Document Type Definition. What you are seeing here, however, is
|
||||
not just an XML Schema or Document Type Definition: it is a fully
|
||||
expressive method of specifying the definition of HTML that is
|
||||
a portable superset of the capabilities of the two above-mentioned schema
|
||||
languages. What makes HTMLDefinition so powerful is the fact that
|
||||
if we don't have an implementation for a content model or an attribute
|
||||
definition, you can supply it yourself by writing a PHP class.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
There are many facets of HTMLDefinition beyond the Advanced API I have
|
||||
walked you through today. To find out more about these, you can
|
||||
check out these source files:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/HTMLModule.php"><code>library/HTMLPurifier/HTMLModule.php</code></a></li>
|
||||
<li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ElementDef.php"><code>library/HTMLPurifier/ElementDef.php</code></a></li>
|
||||
</ul>
|
||||
|
||||
<h2 id="optimized">Notes for HTML Purifier 4.2.0 and earlier</h3>
|
||||
|
||||
<p>
|
||||
Previously, this tutorial gave some incorrect template code for
|
||||
editing raw definitions, and that template code will now produce the
|
||||
error <q>Due to a documentation error in previous version of HTML
|
||||
Purifier...</q> Here is how to mechanically transform old-style
|
||||
code into new-style code.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
First, identify all code that edits the raw definition object, and
|
||||
put it together. Ensure none of this code must be run on every
|
||||
request; if some sub-part needs to always be run, move it outside
|
||||
this block. Here is an example below, with the raw definition
|
||||
object code bolded.
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
$def = $config->getHTMLDefinition(true);
|
||||
<strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong>
|
||||
$purifier = new HTMLPurifier($config);</pre>
|
||||
|
||||
<p>
|
||||
Next, replace the raw definition retrieval with a
|
||||
maybeGetRawHTMLDefinition method call inside an if conditional, and
|
||||
place the editing code inside that if block.
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
|
||||
$config->set('HTML.DefinitionRev', 1);
|
||||
<strong>if ($def = $config->maybeGetRawHTMLDefinition()) {
|
||||
$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
|
||||
}</strong>
|
||||
$purifier = new HTMLPurifier($config);</pre>
|
||||
|
||||
<p>
|
||||
And you're done! Alternatively, if you're OK with not ever caching
|
||||
your code, the following will still work and not emit warnings.
|
||||
</p>
|
||||
|
||||
<pre>$config = HTMLPurifier_Config::createDefault();
|
||||
$def = $config->getHTMLDefinition(true);
|
||||
$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
|
||||
$purifier = new HTMLPurifier($config);</pre>
|
||||
|
||||
<p>
|
||||
A slightly less efficient version of this was what was going on with
|
||||
old versions of HTML Purifier.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<em>Technical notes:</em> ajh pointed out on <a
|
||||
href="http://htmlpurifier.org/phorum/read.php?5,5164,5169#msg-5169">in a forum topic</a> that
|
||||
HTML Purifier appeared to be repeatedly writing to the cache even
|
||||
when a cache entry already existed. Investigation lead to the
|
||||
discovery of the following infelicity: caching of customized
|
||||
definitions didn't actually work! The problem was that even though
|
||||
a cache file would be written out at the end of the process, there
|
||||
was no way for HTML Purifier to say, <q>Actually, I've already got a
|
||||
copy of your work, no need to reconfigure your
|
||||
customizations</q>. This required the API to change: placing
|
||||
all of the customizations to the raw definition object in a
|
||||
conditional which could be skipped.
|
||||
</p>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
148
htmlpurifier-4.10.0/docs/enduser-id.html
Executable file
148
htmlpurifier-4.10.0/docs/enduser-id.html
Executable file
@ -0,0 +1,148 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Explains various methods for allowing IDs in documents safely in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>IDs - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1 class="subtitled">IDs</h1>
|
||||
<div class="subtitle">What they are, why you should(n't) wear them, and how to deal with it</div>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Prior to HTML Purifier 1.2.0, this library blithely accepted user input that
|
||||
looked like this:</p>
|
||||
|
||||
<pre><a id="fragment">Anchor</a></pre>
|
||||
|
||||
<p>...presenting an attractive vector for those that would destroy standards
|
||||
compliance: simply set the ID to one that is already used elsewhere in the
|
||||
document and voila: validation breaks. There was a half-hearted attempt to
|
||||
prevent this by allowing users to blacklist IDs, but I suspect that no one
|
||||
really bothered, and thus, with the release of 1.2.0, IDs are now <em>removed</em>
|
||||
by default.</p>
|
||||
|
||||
<p>IDs, however, are quite useful functionality to have, so if users start
|
||||
complaining about broken anchors you'll probably want to turn them back on
|
||||
with %Attr.EnableID. But before you go mucking around with the config
|
||||
object, it's probably worth to take some precautions to keep your page
|
||||
validating. Why?</p>
|
||||
|
||||
<ol>
|
||||
<li>Standards-compliant pages are good</li>
|
||||
<li>Duplicated IDs interfere with anchors. If there are two id="foobar"s in a
|
||||
document, which spot does a browser presented with the fragment #foobar go
|
||||
to? Most browsers opt for the first appearing ID, making it impossible
|
||||
to references the second section. Similarly, duplicated IDs can hijack
|
||||
client-side scripting that relies on the IDs of elements.</li>
|
||||
</ol>
|
||||
|
||||
<p>You have (currently) four ways of dealing with the problem.</p>
|
||||
|
||||
|
||||
|
||||
<h2 class="subtitled">Blacklisting IDs</h2>
|
||||
<div class="subsubtitle">Good for pages with single content source and stable templates</div>
|
||||
|
||||
<p>Keeping in terms with the
|
||||
<acronym title="Keep It Simple, Stupid">KISS</acronym> principle, let us
|
||||
deal with the most obvious solution: preventing users from using any IDs that
|
||||
appear elsewhere on the document. The method is simple:</p>
|
||||
|
||||
<pre>$config->set('Attr.EnableID', true);
|
||||
$config->set('Attr.IDBlacklist' array(
|
||||
'list', 'of', 'attribute', 'values', 'that', 'are', 'forbidden'
|
||||
));</pre>
|
||||
|
||||
<p>That being said, there are some notable drawbacks. First of all, you have to
|
||||
know precisely which IDs are being used by the HTML surrounding the user code.
|
||||
This is easier said than done: quite often the page designer and the system
|
||||
coder work separately, so the designer has to constantly be talking with the
|
||||
coder whenever he decides to add a new anchor. Miss one and you open yourself
|
||||
to possible standards-compliance issues.</p>
|
||||
|
||||
<p>Furthermore, this position becomes untenable when a single web page must hold
|
||||
multiple portions of user-submitted content. Since there's obviously no way
|
||||
to find out before-hand what IDs users will use, the blacklist is helpless.
|
||||
And since HTML Purifier validates each segment separately, perhaps doing
|
||||
so at different times, it would be extremely difficult to dynamically update
|
||||
the blacklist in between runs.</p>
|
||||
|
||||
<p>Finally, simply destroying the ID is extremely un-userfriendly behavior: after
|
||||
all, they might have simply specified a duplicate ID by accident.</p>
|
||||
|
||||
<p>Thus, we get to our second method.</p>
|
||||
|
||||
|
||||
|
||||
<h2 class="subtitled">Namespacing IDs</h2>
|
||||
<div class="subsubtitle">Lazy developer's way, but needs user education</div>
|
||||
|
||||
<p>This method, too, is quite simple: add a prefix to all user IDs. With this
|
||||
code:</p>
|
||||
|
||||
<pre>$config->set('Attr.EnableID', true);
|
||||
$config->set('Attr.IDPrefix', 'user_');</pre>
|
||||
|
||||
<p>...this:</p>
|
||||
|
||||
<pre><a id="foobar">Anchor!</a></pre>
|
||||
|
||||
<p>...turns into:</p>
|
||||
|
||||
<pre><a id="user_foobar">Anchor!</a></pre>
|
||||
|
||||
<p>As long as you don't have any IDs that start with user_, collisions are
|
||||
guaranteed not to happen. The drawback is obvious: if a user submits
|
||||
id="foobar", they probably expect to be able to reference their page with
|
||||
#foobar. You'll have to tell them, "No, that doesn't work, you have to add
|
||||
user_ to the beginning."</p>
|
||||
|
||||
<p>And yes, things get hairier. Even with a nice prefix, we still have done
|
||||
nothing about multiple HTML Purifier outputs on one page. Thus, we have
|
||||
a second configuration value to piggy-back off of: %Attr.IDPrefixLocal:</p>
|
||||
|
||||
<pre>$config->set('Attr.IDPrefixLocal', 'comment' . $id . '_');</pre>
|
||||
|
||||
<p>This new attributes does nothing but append on to regular IDPrefix, but is
|
||||
special in that it is volatile: it's value is determined at run-time and
|
||||
cannot possibly be cordoned into, say, a .ini config file. As for what to
|
||||
put into the directive, is up to you, but I would recommend the ID number
|
||||
the text has been assigned in the database. Whatever you pick, however, it
|
||||
has to be unique and stable for the text you are validating. Note, however,
|
||||
that we require that %Attr.IDPrefix be set before you use this directive.</p>
|
||||
|
||||
<p>And also remember: the user has to know what this prefix is too!</p>
|
||||
|
||||
|
||||
|
||||
<h2>Abstinence</h2>
|
||||
|
||||
<p>You may not want to bother. That's okay too, just don't enable IDs.</p>
|
||||
|
||||
<p>Personally, I would take this road whenever user-submitted content would be
|
||||
possibly be shown together on one page. Why a blog comment would need to use
|
||||
anchors is beyond me.</p>
|
||||
|
||||
|
||||
|
||||
<h2>Denial</h2>
|
||||
|
||||
<p>To revert back to pre-1.2.0 behavior, simply:</p>
|
||||
|
||||
<pre>$config->set('Attr.EnableID', true);</pre>
|
||||
|
||||
<p>Don't come crying to me when your page mysteriously stops validating, though.</p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
59
htmlpurifier-4.10.0/docs/enduser-overview.txt
Executable file
59
htmlpurifier-4.10.0/docs/enduser-overview.txt
Executable file
@ -0,0 +1,59 @@
|
||||
|
||||
HTML Purifier
|
||||
by Edward Z. Yang
|
||||
|
||||
There are a number of ad hoc HTML filtering solutions out there on the web
|
||||
(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that
|
||||
claim to filter HTML properly, preventing malicious JavaScript and layout
|
||||
breaking HTML from getting through the parser. None of them, however,
|
||||
demonstrates a thorough knowledge of neither the DTD that defines the HTML
|
||||
nor the caveats of HTML that cannot be expressed by a DTD. Configurable
|
||||
filters (such as kses or PHP's built-in striptags() function) have trouble
|
||||
validating the contents of attributes and can be subject to security attacks
|
||||
due to poor configuration. Other filters take the naive approach of
|
||||
blacklisting known threats and tags, failing to account for the introduction
|
||||
of new technologies, new tags, new attributes or quirky browser behavior.
|
||||
|
||||
However, HTML Purifier takes a different approach, one that doesn't use
|
||||
specification-ignorant regexes or narrow blacklists. HTML Purifier will
|
||||
decompose the whole document into tokens, and rigorously process the tokens by:
|
||||
removing non-whitelisted elements, transforming bad practice tags like <font>
|
||||
into <span>, properly checking the nesting of tags and their children and
|
||||
validating all attributes according to their RFCs.
|
||||
|
||||
To my knowledge, there is nothing like this on the web yet. Not even MediaWiki,
|
||||
which allows an amazingly diverse mix of HTML and wikitext in its documents,
|
||||
gets all the nesting quirks right. Existing solutions hope that no JavaScript
|
||||
will slip through, but either do not attempt to ensure that the resulting
|
||||
output is valid XHTML or send the HTML through a draconic XML parser (and yet
|
||||
still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a>
|
||||
tags from being nested within each other).
|
||||
|
||||
This document no longer is a detailed description of how HTMLPurifier works,
|
||||
as those descriptions have been moved to the appropriate code. The first
|
||||
draft was drawn up after two rough code sketches and the implementation of a
|
||||
forgiving lexer. You may also be interested in the unit tests located in the
|
||||
tests/ folder, which provide a living document on how exactly the filter deals
|
||||
with malformed input.
|
||||
|
||||
In summary (see corresponding classes for more details):
|
||||
|
||||
1. Parse document into an array of tag and text tokens (Lexer)
|
||||
2. Remove all elements not on whitelist and transform certain other elements
|
||||
into acceptable forms (i.e. <font>)
|
||||
3. Make document well formed while helpfully taking into account certain quirks,
|
||||
such as the fact that <p> tags traditionally are closed by other block-level
|
||||
elements.
|
||||
4. Run through all nodes and check children for proper order (especially
|
||||
important for tables).
|
||||
5. Validate attributes according to more restrictive definitions based on the
|
||||
RFCs.
|
||||
6. Translate back into a string. (Generator)
|
||||
|
||||
HTML Purifier is best suited for documents that require a rich array of
|
||||
HTML tags. Things like blog comments are, in all likelihood, most appropriately
|
||||
written in an extremely restrictive set of markup that doesn't require
|
||||
all this functionality (or not written in HTML at all), although this may
|
||||
be changing in the future with the addition of levels of filtering.
|
||||
|
||||
vim: et sw=4 sts=4
|
18
htmlpurifier-4.10.0/docs/enduser-security.txt
Executable file
18
htmlpurifier-4.10.0/docs/enduser-security.txt
Executable file
@ -0,0 +1,18 @@
|
||||
|
||||
Security
|
||||
|
||||
Like anything that claims to afford security, HTML_Purifier can be circumvented
|
||||
through negligence of people. This class will do its job: no more, no less,
|
||||
and it's up to you to provide it the proper information and proper context
|
||||
to be effective. Things to remember:
|
||||
|
||||
1. Character Encoding: see enduser-utf8.html for more info.
|
||||
|
||||
2. IDs: see enduser-id.html for more info
|
||||
|
||||
3. URIs: see enduser-uri-filter.html
|
||||
|
||||
4. CSS: document pending
|
||||
Explain which CSS styles we blocked and why.
|
||||
|
||||
vim: et sw=4 sts=4
|
120
htmlpurifier-4.10.0/docs/enduser-slow.html
Executable file
120
htmlpurifier-4.10.0/docs/enduser-slow.html
Executable file
@ -0,0 +1,120 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Explains how to speed up HTML Purifier through caching or inbound filtering." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Speeding up HTML Purifier - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1 class="subtitled">Speeding up HTML Purifier</h1>
|
||||
<div class="subtitle">...also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG page</div>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>HTML Purifier is a very powerful library. But with power comes great
|
||||
responsibility, in the form of longer execution times. Remember, this
|
||||
library isn't lightly grazing over submitted HTML: it's deconstructing
|
||||
the whole thing, rigorously checking the parts, and then putting it back
|
||||
together. </p>
|
||||
|
||||
<p>So, if it so turns out that HTML Purifier is kinda too slow for outbound
|
||||
filtering, you've got a few options: </p>
|
||||
|
||||
<h2>Inbound filtering</h2>
|
||||
|
||||
<p>Perform filtering of HTML when it's submitted by the user. Since the
|
||||
user is already submitting something, an extra half a second tacked on
|
||||
to the load time probably isn't going to be that huge of a problem.
|
||||
Then, displaying the content is a simple a manner of outputting it
|
||||
directly from your database/filesystem. The trouble with this method is
|
||||
that your user loses the original text, and when doing edits, will be
|
||||
handling the filtered text. While this may be a good thing, especially
|
||||
if you're using a WYSIWYG editor, it can also result in data-loss if a
|
||||
user makes a typo. </p>
|
||||
|
||||
<p>Example (non-functional):</p>
|
||||
|
||||
<pre><?php
|
||||
/**
|
||||
* FORM SUBMISSION PAGE
|
||||
* display_error($message) : displays nice error page with message
|
||||
* display_success() : displays a nice success page
|
||||
* display_form() : displays the HTML submission form
|
||||
* database_insert($html) : inserts data into database as new row
|
||||
*/
|
||||
if (!empty($_POST)) {
|
||||
require_once '/path/to/library/HTMLPurifier.auto.php';
|
||||
require_once 'HTMLPurifier.func.php';
|
||||
$dirty_html = isset($_POST['html']) ? $_POST['html'] : false;
|
||||
if (!$dirty_html) {
|
||||
display_error('You must write some HTML!');
|
||||
}
|
||||
$html = HTMLPurifier($dirty_html);
|
||||
database_insert($html);
|
||||
display_success();
|
||||
// notice that $dirty_html is *not* saved
|
||||
} else {
|
||||
display_form();
|
||||
}
|
||||
?></pre>
|
||||
|
||||
<h2>Caching the filtered output</h2>
|
||||
|
||||
<p>Accept the submitted text and put it unaltered into the database, but
|
||||
then also generate a filtered version and stash that in the database.
|
||||
Serve the filtered version to readers, and the unaltered version to
|
||||
editors. If need be, you can invalidate the cache and have the cached
|
||||
filtered version be regenerated on the first page view. Pros? Full data
|
||||
retention. Cons? It's more complicated, and opens other editors up to
|
||||
XSS if they are using a WYSIWYG editor (to fix that, they'd have to be
|
||||
able to get their hands on the *really* original text served in
|
||||
plaintext mode). </p>
|
||||
|
||||
<p>Example (non-functional):</p>
|
||||
|
||||
<pre><?php
|
||||
/**
|
||||
* VIEW PAGE
|
||||
* display_error($message) : displays nice error page with message
|
||||
* cache_get($id) : retrieves HTML from fast cache (db or file)
|
||||
* cache_insert($id, $html) : inserts good HTML into cache system
|
||||
* database_get($id) : retrieves raw HTML from database
|
||||
*/
|
||||
$id = isset($_GET['id']) ? (int) $_GET['id'] : false;
|
||||
if (!$id) {
|
||||
display_error('Must specify ID.');
|
||||
exit;
|
||||
}
|
||||
$html = cache_get($id); // filesystem or database
|
||||
if ($html === false) {
|
||||
// cache didn't have the HTML, generate it
|
||||
$raw_html = database_get($id);
|
||||
require_once '/path/to/library/HTMLPurifier.auto.php';
|
||||
require_once 'HTMLPurifier.func.php';
|
||||
$html = HTMLPurifier($raw_html);
|
||||
cache_insert($id, $html);
|
||||
}
|
||||
echo $html;
|
||||
?></pre>
|
||||
|
||||
<h2>Summary</h2>
|
||||
|
||||
<p>In short, inbound filtering is the simple option and caching is the
|
||||
robust option (albeit with bigger storage requirements). </p>
|
||||
|
||||
<p>There is a third option, independent of the two we've discussed: profile
|
||||
and optimize HTMLPurifier yourself. Be sure to report back your results
|
||||
if you decide to do that! Especially if you port HTML Purifier to C++.
|
||||
<tt>;-)</tt></p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
231
htmlpurifier-4.10.0/docs/enduser-tidy.html
Executable file
231
htmlpurifier-4.10.0/docs/enduser-tidy.html
Executable file
@ -0,0 +1,231 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Tutorial for tweaking HTML Purifier's Tidy-like behavior." />
|
||||
<link rel="stylesheet" type="text/css" href="style.css" />
|
||||
|
||||
<title>Tidy - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>Tidy</h1>
|
||||
|
||||
<div id="filing">Filed under Development</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>You've probably heard of HTML Tidy, Dave Raggett's little piece
|
||||
of software that cleans up poorly written HTML. Let me say it straight
|
||||
out:</p>
|
||||
|
||||
<p class="emphasis">This ain't HTML Tidy!</p>
|
||||
|
||||
<p>Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier
|
||||
that allows users to submit deprecated elements and attributes and get
|
||||
valid strict markup back. For example:</p>
|
||||
|
||||
<pre><center>Centered</center></pre>
|
||||
|
||||
<p>...becomes:</p>
|
||||
|
||||
<pre><div style="text-align:center;">Centered</div></pre>
|
||||
|
||||
<p>...when this particular fix is run on the HTML. This tutorial will give
|
||||
you the lowdown of what exactly HTML Purifier will do when Tidy
|
||||
is on, and how to fine-tune this behavior. Once again, <strong>you do
|
||||
not need Tidy installed on your PHP to use these features!</strong></p>
|
||||
|
||||
<h2>What does it do?</h2>
|
||||
|
||||
<p>Tidy will do several things to your HTML:</p>
|
||||
|
||||
<ul>
|
||||
<li>Convert deprecated elements and attributes to standards-compliant
|
||||
alternatives</li>
|
||||
<li>Enforce XHTML compatibility guidelines and other best practices</li>
|
||||
<li>Preserve data that would normally be removed as per W3C</li>
|
||||
</ul>
|
||||
|
||||
<h2>What are levels?</h2>
|
||||
|
||||
<p>Levels describe how aggressive the Tidy module should be when
|
||||
cleaning up HTML. There are four levels to pick: none, light, medium
|
||||
and heavy. Each of these levels has a well-defined set of behavior
|
||||
associated with it, although it may change depending on your doctype.</p>
|
||||
|
||||
<dl>
|
||||
<dt>light</dt>
|
||||
<dd>This is the <strong>lenient</strong> level. If a tag or attribute
|
||||
is about to be removed because it isn't supported by the
|
||||
doctype, Tidy will step in and change into an alternative that
|
||||
is supported.</dd>
|
||||
<dt>medium</dt>
|
||||
<dd>This is the <strong>correctional</strong> level. At this level,
|
||||
all the functions of light are performed, as well as some extra,
|
||||
non-essential best practices enforcement. Changes made on this
|
||||
level are very benign and are unlikely to cause problems.</dd>
|
||||
<dt>heavy</dt>
|
||||
<dd>This is the <strong>aggressive</strong> level. If a tag or
|
||||
attribute is deprecated, it will be converted into a non-deprecated
|
||||
version, no ifs ands or buts.</dd>
|
||||
</dl>
|
||||
|
||||
<p>By default, Tidy operates on the <strong>medium</strong> level. You can
|
||||
change the level of cleaning by setting the %HTML.TidyLevel configuration
|
||||
directive:</p>
|
||||
|
||||
<pre>$config->set('HTML.TidyLevel', 'heavy'); // burn baby burn!</pre>
|
||||
|
||||
<h2>Is the light level really light?</h2>
|
||||
|
||||
<p>It depends on what doctype you're using. If your documents are HTML
|
||||
4.01 <em>Transitional</em>, HTML Purifier will be lazy
|
||||
and won't clean up your <code>center</code>
|
||||
or <code>font</code> tags. But if you're using HTML 4.01 <em>Strict</em>,
|
||||
HTML Purifier has no choice: it has to convert them, or they will
|
||||
be nuked out of existence. So while light on Transitional will result
|
||||
in little to no changes, light on Strict will still result in quite
|
||||
a lot of fixes.</p>
|
||||
|
||||
<p>This is different behavior from 1.6 or before, where deprecated
|
||||
tags in transitional documents would
|
||||
always be cleaned up regardless. This is also better behavior.</p>
|
||||
|
||||
<h2>My pages look different!</h2>
|
||||
|
||||
<p>HTML Purifier is tasked with converting deprecated tags and
|
||||
attributes to standards-compliant alternatives, which usually
|
||||
need copious amounts of CSS. It's also not foolproof: sometimes
|
||||
things do get lost in the translation. This is why when HTML Purifier
|
||||
can get away with not doing cleaning, it won't; this is why
|
||||
the default value is <strong>medium</strong> and not heavy.</p>
|
||||
|
||||
<p>Fortunately, only a few attributes have problems with the switch
|
||||
over. They are described below:</p>
|
||||
|
||||
<table class="table">
|
||||
<thead><tr>
|
||||
<th>Element@Attr</th>
|
||||
<th>Changes</th>
|
||||
</tr></thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>caption@align</td>
|
||||
<td>Firefox supports stuffing the caption on the
|
||||
left and right side of the table, a feature that
|
||||
Internet Explorer, understandably, does not have.
|
||||
When align equals right or left, the text will simply
|
||||
be aligned on the left or right side.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>img@align</td>
|
||||
<td>The implementation for align bottom is good, but not
|
||||
perfect. There are a few pixel differences.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>br@clear</td>
|
||||
<td>Clear both gets a little wonky in Internet Explorer. Haven't
|
||||
really been able to figure out why.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>hr@noshade</td>
|
||||
<td>All browsers implement this slightly differently: we've
|
||||
chosen to make noshade horizontal rules gray.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p>There are a few more minor, although irritating, bugs.
|
||||
Some older browsers support deprecated attributes,
|
||||
but not CSS. Transformed elements and attributes will look unstyled
|
||||
to said browsers. Also, CSS precedence is slightly different for
|
||||
inline styles versus presentational markup. In increasing precedence:</p>
|
||||
|
||||
<ol>
|
||||
<li>Presentational attributes</li>
|
||||
<li>External style sheets</li>
|
||||
<li>Inline styling</li>
|
||||
</ol>
|
||||
|
||||
<p>This means that styling that may have been masked by external CSS
|
||||
declarations will start showing up (a good thing, perhaps). Finally,
|
||||
if you've turned off the style attribute, almost all of
|
||||
these transformations will not work. Sorry mates.</p>
|
||||
|
||||
<p>You can review the rendering before and after of these transformations
|
||||
by consulting the <a
|
||||
href="http://htmlpurifier.org/live/smoketests/attrTransform.php">attrTransform.php
|
||||
smoketest</a>.</p>
|
||||
|
||||
<h2>I like the general idea, but the specifics bug me!</h2>
|
||||
|
||||
<p>So you want HTML Purifier to clean up your HTML, but you're not
|
||||
so happy about the br@clear implementation. That's perfectly fine!
|
||||
HTML Purifier will make accomodations:</p>
|
||||
|
||||
<pre>$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
|
||||
$config->set('HTML.TidyLevel', 'heavy'); // all changes, minus...
|
||||
<strong>$config->set('HTML.TidyRemove', 'br@clear');</strong></pre>
|
||||
|
||||
<p>That third line does the magic, removing the br@clear fix
|
||||
from the module, ensuring that <code><br clear="both" /></code>
|
||||
will pass through unharmed. The reverse is possible too:</p>
|
||||
|
||||
<pre>$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
|
||||
$config->set('HTML.TidyLevel', 'none'); // no changes, plus...
|
||||
<strong>$config->set('HTML.TidyAdd', 'p@align');</strong></pre>
|
||||
|
||||
<p>In this case, all transformations are shut off, except for the p@align
|
||||
one, which you found handy.</p>
|
||||
|
||||
<p>To find out what the names of fixes you want to turn on or off are,
|
||||
you'll have to consult the source code, specifically the files in
|
||||
<code>HTMLPurifier/HTMLModule/Tidy/</code>. There is, however, a
|
||||
general syntax:</p>
|
||||
|
||||
<table class="table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Name</th>
|
||||
<th>Example</th>
|
||||
<th>Interpretation</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>element</td>
|
||||
<td>font</td>
|
||||
<td>Tag transform for <em>element</em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>element@attr</td>
|
||||
<td>br@clear</td>
|
||||
<td>Attribute transform for <em>attr</em> on <em>element</em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>@attr</td>
|
||||
<td>@lang</td>
|
||||
<td>Global attribute transform for <em>attr</em></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>e#content_model_type</td>
|
||||
<td>blockquote#content_model_type</td>
|
||||
<td>Change of child processing implementation for <em>e</em></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<h2>So... what's the lowdown?</h2>
|
||||
|
||||
<p>The lowdown is, quite frankly, HTML Purifier's default settings are
|
||||
probably good enough. The next step is to bump the level up to heavy,
|
||||
and if that still doesn't satisfy your appetite, do some fine-tuning.
|
||||
Other than that, don't worry about it: this all works silently and
|
||||
effectively in the background.</p>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
204
htmlpurifier-4.10.0/docs/enduser-uri-filter.html
Executable file
204
htmlpurifier-4.10.0/docs/enduser-uri-filter.html
Executable file
@ -0,0 +1,204 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Tutorial for creating custom URI filters." />
|
||||
<link rel="stylesheet" type="text/css" href="style.css" />
|
||||
|
||||
<title>URI Filters - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1>URI Filters</h1>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>
|
||||
This is a quick and dirty document to get you on your way to writing
|
||||
custom URI filters for your own URL filtering needs. Why would you
|
||||
want to write a URI filter? If you need URIs your users put into
|
||||
HTML to magically change into a different URI, this is
|
||||
exactly what you need!
|
||||
</p>
|
||||
|
||||
<h2>Creating the class</h2>
|
||||
|
||||
<p>
|
||||
Any URI filter you make will be a subclass of <code>HTMLPurifier_URIFilter</code>.
|
||||
The scaffolding is thus:
|
||||
</p>
|
||||
|
||||
<pre>class HTMLPurifier_URIFilter_<strong>NameOfFilter</strong> extends HTMLPurifier_URIFilter
|
||||
{
|
||||
public $name = '<strong>NameOfFilter</strong>';
|
||||
public function prepare($config) {}
|
||||
public function filter(&$uri, $config, $context) {}
|
||||
}</pre>
|
||||
|
||||
<p>
|
||||
Fill in the variable <code>$name</code> with the name of your filter, and
|
||||
take a look at the two methods. <code>prepare()</code> is an initialization
|
||||
method that is called only once, before any filtering has been done of the
|
||||
HTML. Use it to perform any costly setup work that only needs to be done
|
||||
once. <code>filter()</code> is the guts and innards of our filter:
|
||||
it takes the URI and does whatever needs to be done to it.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
If you've worked with HTML Purifier, you'll recognize the <code>$config</code>
|
||||
and <code>$context</code> parameters. On the other hand, <code>$uri</code>
|
||||
is something unique to this section of the application: it's a
|
||||
<code>HTMLPurifier_URI</code> object. The interface is thus:
|
||||
</p>
|
||||
|
||||
<pre>class HTMLPurifier_URI
|
||||
{
|
||||
public $scheme, $userinfo, $host, $port, $path, $query, $fragment;
|
||||
public function HTMLPurifier_URI($scheme, $userinfo, $host, $port, $path, $query, $fragment);
|
||||
public function toString();
|
||||
public function copy();
|
||||
public function getSchemeObj($config, $context);
|
||||
public function validate($config, $context);
|
||||
}</pre>
|
||||
|
||||
<p>
|
||||
The first three methods are fairly self-explanatory: you have a constructor,
|
||||
a serializer, and a cloner. Generally, you won't be using them when
|
||||
you are manipulating the URI objects themselves.
|
||||
<code>getSchemeObj()</code> is a special purpose method that returns
|
||||
a <code>HTMLPurifier_URIScheme</code> object corresponding to the specific
|
||||
URI at hand. <code>validate()</code> performs general-purpose validation
|
||||
on the internal components of a URI. Once again, you don't need to
|
||||
worry about these: they've already been handled for you.
|
||||
</p>
|
||||
|
||||
<h2>URI format</h2>
|
||||
|
||||
<p>
|
||||
As a URIFilter, we're interested in the member variables of the URI object.
|
||||
</p>
|
||||
|
||||
<table class="quick"><tbody>
|
||||
<tr><th>Scheme</th> <td>The protocol for identifying (and possibly locating) a resource (http, ftp, https)</td></tr>
|
||||
<tr><th>Userinfo</th> <td>User information such as a username (bob)</td></tr>
|
||||
<tr><th>Host</th> <td>Domain name or IP address of the server (example.com, 127.0.0.1)</td></tr>
|
||||
<tr><th>Port</th> <td>Network port number for the server (80, 12345)</td></tr>
|
||||
<tr><th>Path</th> <td>Data that identifies the resource, possibly hierarchical (/path/to, ed@example.com)</td></tr>
|
||||
<tr><th>Query</th> <td>String of information to be interpreted by the resource (?q=search-term)</td></tr>
|
||||
<tr><th>Fragment</th> <td>Additional information for the resource after retrieval (#bookmark)</td></tr>
|
||||
</tbody></table>
|
||||
|
||||
<p>
|
||||
Because the URI is presented to us in this form, and not
|
||||
<code>http://bob@example.com:8080/foo.php?q=string#hash</code>, it saves us
|
||||
a lot of trouble in having to parse the URI every time we want to filter
|
||||
it. For the record, the above URI has the following components:
|
||||
</p>
|
||||
|
||||
<table class="quick"><tbody>
|
||||
<tr><th>Scheme</th> <td>http</td></tr>
|
||||
<tr><th>Userinfo</th> <td>bob</td></tr>
|
||||
<tr><th>Host</th> <td>example.com</td></tr>
|
||||
<tr><th>Port</th> <td>8080</td></tr>
|
||||
<tr><th>Path</th> <td>/foo.php</td></tr>
|
||||
<tr><th>Query</th> <td>q=string</td></tr>
|
||||
<tr><th>Fragment</th> <td>hash</td></tr>
|
||||
</tbody></table>
|
||||
|
||||
<p>
|
||||
Note that there is no question mark or octothorpe in the query or
|
||||
fragment: these get removed during parsing.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
With this information, you can get straight to implementing your
|
||||
<code>filter()</code> method. But one more thing...
|
||||
</p>
|
||||
|
||||
<h2>Return value: Boolean, not URI</h2>
|
||||
|
||||
<p>
|
||||
You may have noticed that the URI is being passed in by reference.
|
||||
This means that whatever changes you make to it, those changes will
|
||||
be reflected in the URI object the callee had. <strong>Do not
|
||||
return the URI object: it is unnecessary and will cause bugs.</strong>
|
||||
Instead, return a boolean value, true if the filtering was successful,
|
||||
or false if the URI is beyond repair and needs to be axed.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Let's suppose I wanted to write a filter that converted links with a
|
||||
custom <code>image</code> scheme to its corresponding real path on
|
||||
our website:
|
||||
</p>
|
||||
|
||||
<pre>class HTMLPurifier_URIFilter_TransformImageScheme extends HTMLPurifier_URIFilter
|
||||
{
|
||||
public $name = 'TransformImageScheme';
|
||||
public function filter(&$uri, $config, $context) {
|
||||
if ($uri->scheme !== 'image') return true;
|
||||
$img_name = $uri->path;
|
||||
// Overwrite the previous URI object
|
||||
$uri = new HTMLPurifier_URI('http', null, null, null, '/img/' . $img_name . '.png', null, null);
|
||||
return true;
|
||||
}
|
||||
}</pre>
|
||||
|
||||
<p>
|
||||
Notice I did not <code>return $uri;</code>. This filter would turn
|
||||
<code>image:Foo</code> into <code>/img/Foo.png</code>.
|
||||
</p>
|
||||
|
||||
<h2>Activating your filter</h2>
|
||||
|
||||
<p>
|
||||
Having a filter is all well and good, but you need to tell HTML Purifier
|
||||
to use it. Fortunately, this part's simple:
|
||||
</p>
|
||||
|
||||
<pre>$uri = $config->getDefinition('URI');
|
||||
$uri->addFilter(new HTMLPurifier_URIFilter_<strong>NameOfFilter</strong>(), $config);</pre>
|
||||
|
||||
<p>
|
||||
After adding a filter, you won't be able to set configuration directives.
|
||||
Structure your code accordingly.
|
||||
</p>
|
||||
|
||||
<!-- XXX: link to new documentation system -->
|
||||
|
||||
<h2>Post-filter</h2>
|
||||
|
||||
<p>
|
||||
Remember our TransformImageScheme filter? That filter acted before we had
|
||||
performed scheme validation; otherwise, the URI would have been filtered
|
||||
out when it was discovered that there was no image scheme. Well, a post-filter
|
||||
is run after scheme specific validation, so it's ideal for bulk
|
||||
post-processing of URIs, including munging. To specify a URI as a post-filter,
|
||||
set the <code>$post</code> member variable to TRUE.
|
||||
</p>
|
||||
|
||||
<pre>class HTMLPurifier_URIFilter_MyPostFilter extends HTMLPurifier_URIFilter
|
||||
{
|
||||
public $name = 'MyPostFilter';
|
||||
public $post = true;
|
||||
// ... extra code here
|
||||
}
|
||||
</pre>
|
||||
|
||||
<h2>Examples</h2>
|
||||
|
||||
<p>
|
||||
Check the
|
||||
<a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/URIFilter">URIFilter</a>
|
||||
directory for more implementation examples, and see <a href="proposal-new-directives.txt">the
|
||||
new directives proposal document</a> for ideas on what could be implemented
|
||||
as a filter.
|
||||
</p>
|
||||
|
||||
</body></html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
1060
htmlpurifier-4.10.0/docs/enduser-utf8.html
Executable file
1060
htmlpurifier-4.10.0/docs/enduser-utf8.html
Executable file
File diff suppressed because it is too large
Load Diff
153
htmlpurifier-4.10.0/docs/enduser-youtube.html
Executable file
153
htmlpurifier-4.10.0/docs/enduser-youtube.html
Executable file
@ -0,0 +1,153 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Explains how to safely allow the embedding of flash from trusted sites in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Embedding YouTube Videos - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1 class="subtitled">Embedding YouTube Videos</h1>
|
||||
<div class="subtitle">...as well as other dangerous active content</div>
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Clients like their YouTube videos. It gives them a warm fuzzy feeling when
|
||||
they see a neat little embedded video player on their websites that can play
|
||||
the latest clips from their documentary "Fido and the Bones of Spring".
|
||||
All joking aside, the ability to embed YouTube videos or other active
|
||||
content in their pages is something that a lot of people like.</p>
|
||||
|
||||
<p>This is a <em>bad</em> idea. The moment you embed anything untrusted,
|
||||
you will definitely be slammed by a manner of nasties that can be
|
||||
embedded in things from your run of the mill Flash movie to
|
||||
<a href="http://blog.spywareguide.com/2006/12/myspace_phish_attack_leads_use.html">Quicktime movies</a>.
|
||||
Even <code>img</code> tags, which HTML Purifier allows by default, can be
|
||||
dangerous. Be distrustful of anything that tells a browser to load content
|
||||
from another website automatically.</p>
|
||||
|
||||
<p>Luckily for us, however, whitelisting saves the day. Sure, letting users
|
||||
include any old random flash file could be dangerous, but if it's
|
||||
from a specific website, it probably is okay. If no amount of pleading will
|
||||
convince the people upstairs that they should just settle with just linking
|
||||
to their movies, you may find this technique very useful.</p>
|
||||
|
||||
<h2>Looking in</h2>
|
||||
|
||||
<p>Below is custom code that allows users to embed
|
||||
YouTube videos. This is not favoritism: this trick can easily be adapted for
|
||||
other forms of embeddable content.</p>
|
||||
|
||||
<p>Usually, websites like YouTube give us boilerplate code that you can insert
|
||||
into your documents. YouTube's code goes like this:</p>
|
||||
|
||||
<pre>
|
||||
<object width="425" height="350">
|
||||
<param name="movie" value="http://www.youtube.com/v/AyPzM5WK8ys" />
|
||||
<param name="wmode" value="transparent" />
|
||||
<embed src="http://www.youtube.com/v/AyPzM5WK8ys"
|
||||
type="application/x-shockwave-flash"
|
||||
wmode="transparent" width="425" height="350" />
|
||||
</object>
|
||||
</pre>
|
||||
|
||||
<p>There are two things to note about this code:</p>
|
||||
|
||||
<ol>
|
||||
<li><code><embed></code> is not recognized by W3C, so if you want
|
||||
standards-compliant code, you'll have to get rid of it.</li>
|
||||
<li>The code is exactly the same for all instances, except for the
|
||||
identifier <tt>AyPzM5WK8ys</tt> which tells us which movie file
|
||||
to retrieve.</li>
|
||||
</ol>
|
||||
|
||||
<p>What point 2 means is that if we have code like <code><span
|
||||
class="youtube-embed">AyPzM5WK8ys</span></code> your
|
||||
application can reconstruct the full object from this small snippet that
|
||||
passes through HTML Purifier <em>unharmed</em>.
|
||||
<a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/Filter/YouTube.php">Show me the code!</a></p>
|
||||
|
||||
<p>And the corresponding usage:</p>
|
||||
|
||||
<pre><?php
|
||||
$config->set('Filter.YouTube', true);
|
||||
?></pre>
|
||||
|
||||
<p>There is a bit going in the two code snippets, so let's explain.</p>
|
||||
|
||||
<ol>
|
||||
<li>This is a Filter object, which intercepts the HTML that is
|
||||
coming into and out of the purifier. You can add as many
|
||||
filter objects as you like. <code>preFilter()</code>
|
||||
processes the code before it gets purified, and <code>postFilter()</code>
|
||||
processes the code afterwards. So, we'll use <code>preFilter()</code> to
|
||||
replace the object tag with a <code>span</code>, and <code>postFilter()</code>
|
||||
to restore it.</li>
|
||||
<li>The first preg_replace call replaces any YouTube code users may have
|
||||
embedded into the benign span tag. Span is used because it is inline,
|
||||
and objects are inline too. We are very careful to be extremely
|
||||
restrictive on what goes inside the span tag, as if an errant code
|
||||
gets in there it could get messy.</li>
|
||||
<li>The HTML is then purified as usual.</li>
|
||||
<li>Then, another preg_replace replaces the span tag with a fully fledged
|
||||
object. Note that the embed is removed, and, in its place, a data
|
||||
attribute was added to the object. This makes the tag standards
|
||||
compliant! It also breaks Internet Explorer, so we add in a bit of
|
||||
conditional comments with the old embed code to make it work again.
|
||||
It's all quite convoluted but works.</li>
|
||||
</ol>
|
||||
|
||||
<h2>Warning</h2>
|
||||
|
||||
<p>There are a number of possible problems with the code above, depending
|
||||
on how you look at it.</p>
|
||||
|
||||
<h3>Cannot change width and height</h3>
|
||||
|
||||
<p>The width and height of the final YouTube movie cannot be adjusted. This
|
||||
is because I am lazy. If you really insist on letting users change the size
|
||||
of the movie, what you need to do is package up the attributes inside the
|
||||
span tag (along with the movie ID). It gets complicated though: a malicious
|
||||
user can specify an outrageously large height and width and attempt to crash
|
||||
the user's operating system/browser. You need to either cap it by limiting
|
||||
the amount of digits allowed in the regex or using a callback to check the
|
||||
number.</p>
|
||||
|
||||
<h3>Trusts media's host's security</h3>
|
||||
|
||||
<p>By allowing this code onto our website, we are trusting that YouTube has
|
||||
tech-savvy enough people not to allow their users to inject malicious
|
||||
code into the Flash files. An exploit on YouTube means an exploit on your
|
||||
site. Even though YouTube is run by the reputable Google, it
|
||||
<a href="http://ha.ckers.org/blog/20061213/google-xss-vuln/">doesn't</a>
|
||||
mean they are
|
||||
<a href="http://ha.ckers.org/blog/20061208/xss-in-googles-orkut/">invulnerable.</a>
|
||||
You're putting a certain measure of the job on an external provider (just as
|
||||
you have by entrusting your user input to HTML Purifier), and
|
||||
it is important that you are cognizant of the risk.</p>
|
||||
|
||||
<h3>Poorly written adaptations compromise security</h3>
|
||||
|
||||
<p>This should go without saying, but if you're going to adapt this code
|
||||
for Google Video or the like, make sure you do it <em>right</em>. It's
|
||||
extremely easy to allow a character too many in <code>postFilter()</code> and
|
||||
suddenly you're introducing XSS into HTML Purifier's XSS free output. HTML
|
||||
Purifier may be well written, but it cannot guard against vulnerabilities
|
||||
introduced after it has finished.</p>
|
||||
|
||||
<h2>Help out!</h2>
|
||||
|
||||
<p>If you write a filter for your favorite video destination (or anything
|
||||
like that, for that matter), send it over and it might get included
|
||||
with the core!</p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
196
htmlpurifier-4.10.0/docs/entities/xhtml-lat1.ent
Executable file
196
htmlpurifier-4.10.0/docs/entities/xhtml-lat1.ent
Executable file
@ -0,0 +1,196 @@
|
||||
<!-- Portions (C) International Organization for Standardization 1986
|
||||
Permission to copy in any form is granted for use with
|
||||
conforming SGML systems and applications as defined in
|
||||
ISO 8879, provided this notice is included in all copies.
|
||||
-->
|
||||
<!-- Character entity set. Typical invocation:
|
||||
<!ENTITY % HTMLlat1 PUBLIC
|
||||
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
|
||||
%HTMLlat1;
|
||||
-->
|
||||
|
||||
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space,
|
||||
U+00A0 ISOnum -->
|
||||
<!ENTITY iexcl "¡"> <!-- inverted exclamation mark, U+00A1 ISOnum -->
|
||||
<!ENTITY cent "¢"> <!-- cent sign, U+00A2 ISOnum -->
|
||||
<!ENTITY pound "£"> <!-- pound sign, U+00A3 ISOnum -->
|
||||
<!ENTITY curren "¤"> <!-- currency sign, U+00A4 ISOnum -->
|
||||
<!ENTITY yen "¥"> <!-- yen sign = yuan sign, U+00A5 ISOnum -->
|
||||
<!ENTITY brvbar "¦"> <!-- broken bar = broken vertical bar,
|
||||
U+00A6 ISOnum -->
|
||||
<!ENTITY sect "§"> <!-- section sign, U+00A7 ISOnum -->
|
||||
<!ENTITY uml "¨"> <!-- diaeresis = spacing diaeresis,
|
||||
U+00A8 ISOdia -->
|
||||
<!ENTITY copy "©"> <!-- copyright sign, U+00A9 ISOnum -->
|
||||
<!ENTITY ordf "ª"> <!-- feminine ordinal indicator, U+00AA ISOnum -->
|
||||
<!ENTITY laquo "«"> <!-- left-pointing double angle quotation mark
|
||||
= left pointing guillemet, U+00AB ISOnum -->
|
||||
<!ENTITY not "¬"> <!-- not sign = angled dash,
|
||||
U+00AC ISOnum -->
|
||||
<!ENTITY shy "­"> <!-- soft hyphen = discretionary hyphen,
|
||||
U+00AD ISOnum -->
|
||||
<!ENTITY reg "®"> <!-- registered sign = registered trade mark sign,
|
||||
U+00AE ISOnum -->
|
||||
<!ENTITY macr "¯"> <!-- macron = spacing macron = overline
|
||||
= APL overbar, U+00AF ISOdia -->
|
||||
<!ENTITY deg "°"> <!-- degree sign, U+00B0 ISOnum -->
|
||||
<!ENTITY plusmn "±"> <!-- plus-minus sign = plus-or-minus sign,
|
||||
U+00B1 ISOnum -->
|
||||
<!ENTITY sup2 "²"> <!-- superscript two = superscript digit two
|
||||
= squared, U+00B2 ISOnum -->
|
||||
<!ENTITY sup3 "³"> <!-- superscript three = superscript digit three
|
||||
= cubed, U+00B3 ISOnum -->
|
||||
<!ENTITY acute "´"> <!-- acute accent = spacing acute,
|
||||
U+00B4 ISOdia -->
|
||||
<!ENTITY micro "µ"> <!-- micro sign, U+00B5 ISOnum -->
|
||||
<!ENTITY para "¶"> <!-- pilcrow sign = paragraph sign,
|
||||
U+00B6 ISOnum -->
|
||||
<!ENTITY middot "·"> <!-- middle dot = Georgian comma
|
||||
= Greek middle dot, U+00B7 ISOnum -->
|
||||
<!ENTITY cedil "¸"> <!-- cedilla = spacing cedilla, U+00B8 ISOdia -->
|
||||
<!ENTITY sup1 "¹"> <!-- superscript one = superscript digit one,
|
||||
U+00B9 ISOnum -->
|
||||
<!ENTITY ordm "º"> <!-- masculine ordinal indicator,
|
||||
U+00BA ISOnum -->
|
||||
<!ENTITY raquo "»"> <!-- right-pointing double angle quotation mark
|
||||
= right pointing guillemet, U+00BB ISOnum -->
|
||||
<!ENTITY frac14 "¼"> <!-- vulgar fraction one quarter
|
||||
= fraction one quarter, U+00BC ISOnum -->
|
||||
<!ENTITY frac12 "½"> <!-- vulgar fraction one half
|
||||
= fraction one half, U+00BD ISOnum -->
|
||||
<!ENTITY frac34 "¾"> <!-- vulgar fraction three quarters
|
||||
= fraction three quarters, U+00BE ISOnum -->
|
||||
<!ENTITY iquest "¿"> <!-- inverted question mark
|
||||
= turned question mark, U+00BF ISOnum -->
|
||||
<!ENTITY Agrave "À"> <!-- latin capital letter A with grave
|
||||
= latin capital letter A grave,
|
||||
U+00C0 ISOlat1 -->
|
||||
<!ENTITY Aacute "Á"> <!-- latin capital letter A with acute,
|
||||
U+00C1 ISOlat1 -->
|
||||
<!ENTITY Acirc "Â"> <!-- latin capital letter A with circumflex,
|
||||
U+00C2 ISOlat1 -->
|
||||
<!ENTITY Atilde "Ã"> <!-- latin capital letter A with tilde,
|
||||
U+00C3 ISOlat1 -->
|
||||
<!ENTITY Auml "Ä"> <!-- latin capital letter A with diaeresis,
|
||||
U+00C4 ISOlat1 -->
|
||||
<!ENTITY Aring "Å"> <!-- latin capital letter A with ring above
|
||||
= latin capital letter A ring,
|
||||
U+00C5 ISOlat1 -->
|
||||
<!ENTITY AElig "Æ"> <!-- latin capital letter AE
|
||||
= latin capital ligature AE,
|
||||
U+00C6 ISOlat1 -->
|
||||
<!ENTITY Ccedil "Ç"> <!-- latin capital letter C with cedilla,
|
||||
U+00C7 ISOlat1 -->
|
||||
<!ENTITY Egrave "È"> <!-- latin capital letter E with grave,
|
||||
U+00C8 ISOlat1 -->
|
||||
<!ENTITY Eacute "É"> <!-- latin capital letter E with acute,
|
||||
U+00C9 ISOlat1 -->
|
||||
<!ENTITY Ecirc "Ê"> <!-- latin capital letter E with circumflex,
|
||||
U+00CA ISOlat1 -->
|
||||
<!ENTITY Euml "Ë"> <!-- latin capital letter E with diaeresis,
|
||||
U+00CB ISOlat1 -->
|
||||
<!ENTITY Igrave "Ì"> <!-- latin capital letter I with grave,
|
||||
U+00CC ISOlat1 -->
|
||||
<!ENTITY Iacute "Í"> <!-- latin capital letter I with acute,
|
||||
U+00CD ISOlat1 -->
|
||||
<!ENTITY Icirc "Î"> <!-- latin capital letter I with circumflex,
|
||||
U+00CE ISOlat1 -->
|
||||
<!ENTITY Iuml "Ï"> <!-- latin capital letter I with diaeresis,
|
||||
U+00CF ISOlat1 -->
|
||||
<!ENTITY ETH "Ð"> <!-- latin capital letter ETH, U+00D0 ISOlat1 -->
|
||||
<!ENTITY Ntilde "Ñ"> <!-- latin capital letter N with tilde,
|
||||
U+00D1 ISOlat1 -->
|
||||
<!ENTITY Ograve "Ò"> <!-- latin capital letter O with grave,
|
||||
U+00D2 ISOlat1 -->
|
||||
<!ENTITY Oacute "Ó"> <!-- latin capital letter O with acute,
|
||||
U+00D3 ISOlat1 -->
|
||||
<!ENTITY Ocirc "Ô"> <!-- latin capital letter O with circumflex,
|
||||
U+00D4 ISOlat1 -->
|
||||
<!ENTITY Otilde "Õ"> <!-- latin capital letter O with tilde,
|
||||
U+00D5 ISOlat1 -->
|
||||
<!ENTITY Ouml "Ö"> <!-- latin capital letter O with diaeresis,
|
||||
U+00D6 ISOlat1 -->
|
||||
<!ENTITY times "×"> <!-- multiplication sign, U+00D7 ISOnum -->
|
||||
<!ENTITY Oslash "Ø"> <!-- latin capital letter O with stroke
|
||||
= latin capital letter O slash,
|
||||
U+00D8 ISOlat1 -->
|
||||
<!ENTITY Ugrave "Ù"> <!-- latin capital letter U with grave,
|
||||
U+00D9 ISOlat1 -->
|
||||
<!ENTITY Uacute "Ú"> <!-- latin capital letter U with acute,
|
||||
U+00DA ISOlat1 -->
|
||||
<!ENTITY Ucirc "Û"> <!-- latin capital letter U with circumflex,
|
||||
U+00DB ISOlat1 -->
|
||||
<!ENTITY Uuml "Ü"> <!-- latin capital letter U with diaeresis,
|
||||
U+00DC ISOlat1 -->
|
||||
<!ENTITY Yacute "Ý"> <!-- latin capital letter Y with acute,
|
||||
U+00DD ISOlat1 -->
|
||||
<!ENTITY THORN "Þ"> <!-- latin capital letter THORN,
|
||||
U+00DE ISOlat1 -->
|
||||
<!ENTITY szlig "ß"> <!-- latin small letter sharp s = ess-zed,
|
||||
U+00DF ISOlat1 -->
|
||||
<!ENTITY agrave "à"> <!-- latin small letter a with grave
|
||||
= latin small letter a grave,
|
||||
U+00E0 ISOlat1 -->
|
||||
<!ENTITY aacute "á"> <!-- latin small letter a with acute,
|
||||
U+00E1 ISOlat1 -->
|
||||
<!ENTITY acirc "â"> <!-- latin small letter a with circumflex,
|
||||
U+00E2 ISOlat1 -->
|
||||
<!ENTITY atilde "ã"> <!-- latin small letter a with tilde,
|
||||
U+00E3 ISOlat1 -->
|
||||
<!ENTITY auml "ä"> <!-- latin small letter a with diaeresis,
|
||||
U+00E4 ISOlat1 -->
|
||||
<!ENTITY aring "å"> <!-- latin small letter a with ring above
|
||||
= latin small letter a ring,
|
||||
U+00E5 ISOlat1 -->
|
||||
<!ENTITY aelig "æ"> <!-- latin small letter ae
|
||||
= latin small ligature ae, U+00E6 ISOlat1 -->
|
||||
<!ENTITY ccedil "ç"> <!-- latin small letter c with cedilla,
|
||||
U+00E7 ISOlat1 -->
|
||||
<!ENTITY egrave "è"> <!-- latin small letter e with grave,
|
||||
U+00E8 ISOlat1 -->
|
||||
<!ENTITY eacute "é"> <!-- latin small letter e with acute,
|
||||
U+00E9 ISOlat1 -->
|
||||
<!ENTITY ecirc "ê"> <!-- latin small letter e with circumflex,
|
||||
U+00EA ISOlat1 -->
|
||||
<!ENTITY euml "ë"> <!-- latin small letter e with diaeresis,
|
||||
U+00EB ISOlat1 -->
|
||||
<!ENTITY igrave "ì"> <!-- latin small letter i with grave,
|
||||
U+00EC ISOlat1 -->
|
||||
<!ENTITY iacute "í"> <!-- latin small letter i with acute,
|
||||
U+00ED ISOlat1 -->
|
||||
<!ENTITY icirc "î"> <!-- latin small letter i with circumflex,
|
||||
U+00EE ISOlat1 -->
|
||||
<!ENTITY iuml "ï"> <!-- latin small letter i with diaeresis,
|
||||
U+00EF ISOlat1 -->
|
||||
<!ENTITY eth "ð"> <!-- latin small letter eth, U+00F0 ISOlat1 -->
|
||||
<!ENTITY ntilde "ñ"> <!-- latin small letter n with tilde,
|
||||
U+00F1 ISOlat1 -->
|
||||
<!ENTITY ograve "ò"> <!-- latin small letter o with grave,
|
||||
U+00F2 ISOlat1 -->
|
||||
<!ENTITY oacute "ó"> <!-- latin small letter o with acute,
|
||||
U+00F3 ISOlat1 -->
|
||||
<!ENTITY ocirc "ô"> <!-- latin small letter o with circumflex,
|
||||
U+00F4 ISOlat1 -->
|
||||
<!ENTITY otilde "õ"> <!-- latin small letter o with tilde,
|
||||
U+00F5 ISOlat1 -->
|
||||
<!ENTITY ouml "ö"> <!-- latin small letter o with diaeresis,
|
||||
U+00F6 ISOlat1 -->
|
||||
<!ENTITY divide "÷"> <!-- division sign, U+00F7 ISOnum -->
|
||||
<!ENTITY oslash "ø"> <!-- latin small letter o with stroke,
|
||||
= latin small letter o slash,
|
||||
U+00F8 ISOlat1 -->
|
||||
<!ENTITY ugrave "ù"> <!-- latin small letter u with grave,
|
||||
U+00F9 ISOlat1 -->
|
||||
<!ENTITY uacute "ú"> <!-- latin small letter u with acute,
|
||||
U+00FA ISOlat1 -->
|
||||
<!ENTITY ucirc "û"> <!-- latin small letter u with circumflex,
|
||||
U+00FB ISOlat1 -->
|
||||
<!ENTITY uuml "ü"> <!-- latin small letter u with diaeresis,
|
||||
U+00FC ISOlat1 -->
|
||||
<!ENTITY yacute "ý"> <!-- latin small letter y with acute,
|
||||
U+00FD ISOlat1 -->
|
||||
<!ENTITY thorn "þ"> <!-- latin small letter thorn,
|
||||
U+00FE ISOlat1 -->
|
||||
<!ENTITY yuml "ÿ"> <!-- latin small letter y with diaeresis,
|
||||
U+00FF ISOlat1 -->
|
80
htmlpurifier-4.10.0/docs/entities/xhtml-special.ent
Executable file
80
htmlpurifier-4.10.0/docs/entities/xhtml-special.ent
Executable file
@ -0,0 +1,80 @@
|
||||
<!-- Special characters for XHTML -->
|
||||
|
||||
<!-- Character entity set. Typical invocation:
|
||||
<!ENTITY % HTMLspecial PUBLIC
|
||||
"-//W3C//ENTITIES Special for XHTML//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
|
||||
%HTMLspecial;
|
||||
-->
|
||||
|
||||
<!-- Portions (C) International Organization for Standardization 1986:
|
||||
Permission to copy in any form is granted for use with
|
||||
conforming SGML systems and applications as defined in
|
||||
ISO 8879, provided this notice is included in all copies.
|
||||
-->
|
||||
|
||||
<!-- Relevant ISO entity set is given unless names are newly introduced.
|
||||
New names (i.e., not in ISO 8879 list) do not clash with any
|
||||
existing ISO 8879 entity names. ISO 10646 character numbers
|
||||
are given for each character, in hex. values are decimal
|
||||
conversions of the ISO 10646 values and refer to the document
|
||||
character set. Names are Unicode names.
|
||||
-->
|
||||
|
||||
<!-- C0 Controls and Basic Latin -->
|
||||
<!ENTITY quot """> <!-- quotation mark, U+0022 ISOnum -->
|
||||
<!ENTITY amp "&#38;"> <!-- ampersand, U+0026 ISOnum -->
|
||||
<!ENTITY lt "&#60;"> <!-- less-than sign, U+003C ISOnum -->
|
||||
<!ENTITY gt ">"> <!-- greater-than sign, U+003E ISOnum -->
|
||||
<!ENTITY apos "'"> <!-- apostrophe = APL quote, U+0027 ISOnum -->
|
||||
|
||||
<!-- Latin Extended-A -->
|
||||
<!ENTITY OElig "Œ"> <!-- latin capital ligature OE,
|
||||
U+0152 ISOlat2 -->
|
||||
<!ENTITY oelig "œ"> <!-- latin small ligature oe, U+0153 ISOlat2 -->
|
||||
<!-- ligature is a misnomer, this is a separate character in some languages -->
|
||||
<!ENTITY Scaron "Š"> <!-- latin capital letter S with caron,
|
||||
U+0160 ISOlat2 -->
|
||||
<!ENTITY scaron "š"> <!-- latin small letter s with caron,
|
||||
U+0161 ISOlat2 -->
|
||||
<!ENTITY Yuml "Ÿ"> <!-- latin capital letter Y with diaeresis,
|
||||
U+0178 ISOlat2 -->
|
||||
|
||||
<!-- Spacing Modifier Letters -->
|
||||
<!ENTITY circ "ˆ"> <!-- modifier letter circumflex accent,
|
||||
U+02C6 ISOpub -->
|
||||
<!ENTITY tilde "˜"> <!-- small tilde, U+02DC ISOdia -->
|
||||
|
||||
<!-- General Punctuation -->
|
||||
<!ENTITY ensp " "> <!-- en space, U+2002 ISOpub -->
|
||||
<!ENTITY emsp " "> <!-- em space, U+2003 ISOpub -->
|
||||
<!ENTITY thinsp " "> <!-- thin space, U+2009 ISOpub -->
|
||||
<!ENTITY zwnj "‌"> <!-- zero width non-joiner,
|
||||
U+200C NEW RFC 2070 -->
|
||||
<!ENTITY zwj "‍"> <!-- zero width joiner, U+200D NEW RFC 2070 -->
|
||||
<!ENTITY lrm "‎"> <!-- left-to-right mark, U+200E NEW RFC 2070 -->
|
||||
<!ENTITY rlm "‏"> <!-- right-to-left mark, U+200F NEW RFC 2070 -->
|
||||
<!ENTITY ndash "–"> <!-- en dash, U+2013 ISOpub -->
|
||||
<!ENTITY mdash "—"> <!-- em dash, U+2014 ISOpub -->
|
||||
<!ENTITY lsquo "‘"> <!-- left single quotation mark,
|
||||
U+2018 ISOnum -->
|
||||
<!ENTITY rsquo "’"> <!-- right single quotation mark,
|
||||
U+2019 ISOnum -->
|
||||
<!ENTITY sbquo "‚"> <!-- single low-9 quotation mark, U+201A NEW -->
|
||||
<!ENTITY ldquo "“"> <!-- left double quotation mark,
|
||||
U+201C ISOnum -->
|
||||
<!ENTITY rdquo "”"> <!-- right double quotation mark,
|
||||
U+201D ISOnum -->
|
||||
<!ENTITY bdquo "„"> <!-- double low-9 quotation mark, U+201E NEW -->
|
||||
<!ENTITY dagger "†"> <!-- dagger, U+2020 ISOpub -->
|
||||
<!ENTITY Dagger "‡"> <!-- double dagger, U+2021 ISOpub -->
|
||||
<!ENTITY permil "‰"> <!-- per mille sign, U+2030 ISOtech -->
|
||||
<!ENTITY lsaquo "‹"> <!-- single left-pointing angle quotation mark,
|
||||
U+2039 ISO proposed -->
|
||||
<!-- lsaquo is proposed but not yet ISO standardized -->
|
||||
<!ENTITY rsaquo "›"> <!-- single right-pointing angle quotation mark,
|
||||
U+203A ISO proposed -->
|
||||
<!-- rsaquo is proposed but not yet ISO standardized -->
|
||||
|
||||
<!-- Currency Symbols -->
|
||||
<!ENTITY euro "€"> <!-- euro sign, U+20AC NEW -->
|
237
htmlpurifier-4.10.0/docs/entities/xhtml-symbol.ent
Executable file
237
htmlpurifier-4.10.0/docs/entities/xhtml-symbol.ent
Executable file
@ -0,0 +1,237 @@
|
||||
<!-- Mathematical, Greek and Symbolic characters for XHTML -->
|
||||
|
||||
<!-- Character entity set. Typical invocation:
|
||||
<!ENTITY % HTMLsymbol PUBLIC
|
||||
"-//W3C//ENTITIES Symbols for XHTML//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
|
||||
%HTMLsymbol;
|
||||
-->
|
||||
|
||||
<!-- Portions (C) International Organization for Standardization 1986:
|
||||
Permission to copy in any form is granted for use with
|
||||
conforming SGML systems and applications as defined in
|
||||
ISO 8879, provided this notice is included in all copies.
|
||||
-->
|
||||
|
||||
<!-- Relevant ISO entity set is given unless names are newly introduced.
|
||||
New names (i.e., not in ISO 8879 list) do not clash with any
|
||||
existing ISO 8879 entity names. ISO 10646 character numbers
|
||||
are given for each character, in hex. values are decimal
|
||||
conversions of the ISO 10646 values and refer to the document
|
||||
character set. Names are Unicode names.
|
||||
-->
|
||||
|
||||
<!-- Latin Extended-B -->
|
||||
<!ENTITY fnof "ƒ"> <!-- latin small letter f with hook = function
|
||||
= florin, U+0192 ISOtech -->
|
||||
|
||||
<!-- Greek -->
|
||||
<!ENTITY Alpha "Α"> <!-- greek capital letter alpha, U+0391 -->
|
||||
<!ENTITY Beta "Β"> <!-- greek capital letter beta, U+0392 -->
|
||||
<!ENTITY Gamma "Γ"> <!-- greek capital letter gamma,
|
||||
U+0393 ISOgrk3 -->
|
||||
<!ENTITY Delta "Δ"> <!-- greek capital letter delta,
|
||||
U+0394 ISOgrk3 -->
|
||||
<!ENTITY Epsilon "Ε"> <!-- greek capital letter epsilon, U+0395 -->
|
||||
<!ENTITY Zeta "Ζ"> <!-- greek capital letter zeta, U+0396 -->
|
||||
<!ENTITY Eta "Η"> <!-- greek capital letter eta, U+0397 -->
|
||||
<!ENTITY Theta "Θ"> <!-- greek capital letter theta,
|
||||
U+0398 ISOgrk3 -->
|
||||
<!ENTITY Iota "Ι"> <!-- greek capital letter iota, U+0399 -->
|
||||
<!ENTITY Kappa "Κ"> <!-- greek capital letter kappa, U+039A -->
|
||||
<!ENTITY Lambda "Λ"> <!-- greek capital letter lamda,
|
||||
U+039B ISOgrk3 -->
|
||||
<!ENTITY Mu "Μ"> <!-- greek capital letter mu, U+039C -->
|
||||
<!ENTITY Nu "Ν"> <!-- greek capital letter nu, U+039D -->
|
||||
<!ENTITY Xi "Ξ"> <!-- greek capital letter xi, U+039E ISOgrk3 -->
|
||||
<!ENTITY Omicron "Ο"> <!-- greek capital letter omicron, U+039F -->
|
||||
<!ENTITY Pi "Π"> <!-- greek capital letter pi, U+03A0 ISOgrk3 -->
|
||||
<!ENTITY Rho "Ρ"> <!-- greek capital letter rho, U+03A1 -->
|
||||
<!-- there is no Sigmaf, and no U+03A2 character either -->
|
||||
<!ENTITY Sigma "Σ"> <!-- greek capital letter sigma,
|
||||
U+03A3 ISOgrk3 -->
|
||||
<!ENTITY Tau "Τ"> <!-- greek capital letter tau, U+03A4 -->
|
||||
<!ENTITY Upsilon "Υ"> <!-- greek capital letter upsilon,
|
||||
U+03A5 ISOgrk3 -->
|
||||
<!ENTITY Phi "Φ"> <!-- greek capital letter phi,
|
||||
U+03A6 ISOgrk3 -->
|
||||
<!ENTITY Chi "Χ"> <!-- greek capital letter chi, U+03A7 -->
|
||||
<!ENTITY Psi "Ψ"> <!-- greek capital letter psi,
|
||||
U+03A8 ISOgrk3 -->
|
||||
<!ENTITY Omega "Ω"> <!-- greek capital letter omega,
|
||||
U+03A9 ISOgrk3 -->
|
||||
|
||||
<!ENTITY alpha "α"> <!-- greek small letter alpha,
|
||||
U+03B1 ISOgrk3 -->
|
||||
<!ENTITY beta "β"> <!-- greek small letter beta, U+03B2 ISOgrk3 -->
|
||||
<!ENTITY gamma "γ"> <!-- greek small letter gamma,
|
||||
U+03B3 ISOgrk3 -->
|
||||
<!ENTITY delta "δ"> <!-- greek small letter delta,
|
||||
U+03B4 ISOgrk3 -->
|
||||
<!ENTITY epsilon "ε"> <!-- greek small letter epsilon,
|
||||
U+03B5 ISOgrk3 -->
|
||||
<!ENTITY zeta "ζ"> <!-- greek small letter zeta, U+03B6 ISOgrk3 -->
|
||||
<!ENTITY eta "η"> <!-- greek small letter eta, U+03B7 ISOgrk3 -->
|
||||
<!ENTITY theta "θ"> <!-- greek small letter theta,
|
||||
U+03B8 ISOgrk3 -->
|
||||
<!ENTITY iota "ι"> <!-- greek small letter iota, U+03B9 ISOgrk3 -->
|
||||
<!ENTITY kappa "κ"> <!-- greek small letter kappa,
|
||||
U+03BA ISOgrk3 -->
|
||||
<!ENTITY lambda "λ"> <!-- greek small letter lamda,
|
||||
U+03BB ISOgrk3 -->
|
||||
<!ENTITY mu "μ"> <!-- greek small letter mu, U+03BC ISOgrk3 -->
|
||||
<!ENTITY nu "ν"> <!-- greek small letter nu, U+03BD ISOgrk3 -->
|
||||
<!ENTITY xi "ξ"> <!-- greek small letter xi, U+03BE ISOgrk3 -->
|
||||
<!ENTITY omicron "ο"> <!-- greek small letter omicron, U+03BF NEW -->
|
||||
<!ENTITY pi "π"> <!-- greek small letter pi, U+03C0 ISOgrk3 -->
|
||||
<!ENTITY rho "ρ"> <!-- greek small letter rho, U+03C1 ISOgrk3 -->
|
||||
<!ENTITY sigmaf "ς"> <!-- greek small letter final sigma,
|
||||
U+03C2 ISOgrk3 -->
|
||||
<!ENTITY sigma "σ"> <!-- greek small letter sigma,
|
||||
U+03C3 ISOgrk3 -->
|
||||
<!ENTITY tau "τ"> <!-- greek small letter tau, U+03C4 ISOgrk3 -->
|
||||
<!ENTITY upsilon "υ"> <!-- greek small letter upsilon,
|
||||
U+03C5 ISOgrk3 -->
|
||||
<!ENTITY phi "φ"> <!-- greek small letter phi, U+03C6 ISOgrk3 -->
|
||||
<!ENTITY chi "χ"> <!-- greek small letter chi, U+03C7 ISOgrk3 -->
|
||||
<!ENTITY psi "ψ"> <!-- greek small letter psi, U+03C8 ISOgrk3 -->
|
||||
<!ENTITY omega "ω"> <!-- greek small letter omega,
|
||||
U+03C9 ISOgrk3 -->
|
||||
<!ENTITY thetasym "ϑ"> <!-- greek theta symbol,
|
||||
U+03D1 NEW -->
|
||||
<!ENTITY upsih "ϒ"> <!-- greek upsilon with hook symbol,
|
||||
U+03D2 NEW -->
|
||||
<!ENTITY piv "ϖ"> <!-- greek pi symbol, U+03D6 ISOgrk3 -->
|
||||
|
||||
<!-- General Punctuation -->
|
||||
<!ENTITY bull "•"> <!-- bullet = black small circle,
|
||||
U+2022 ISOpub -->
|
||||
<!-- bullet is NOT the same as bullet operator, U+2219 -->
|
||||
<!ENTITY hellip "…"> <!-- horizontal ellipsis = three dot leader,
|
||||
U+2026 ISOpub -->
|
||||
<!ENTITY prime "′"> <!-- prime = minutes = feet, U+2032 ISOtech -->
|
||||
<!ENTITY Prime "″"> <!-- double prime = seconds = inches,
|
||||
U+2033 ISOtech -->
|
||||
<!ENTITY oline "‾"> <!-- overline = spacing overscore,
|
||||
U+203E NEW -->
|
||||
<!ENTITY frasl "⁄"> <!-- fraction slash, U+2044 NEW -->
|
||||
|
||||
<!-- Letterlike Symbols -->
|
||||
<!ENTITY weierp "℘"> <!-- script capital P = power set
|
||||
= Weierstrass p, U+2118 ISOamso -->
|
||||
<!ENTITY image "ℑ"> <!-- black-letter capital I = imaginary part,
|
||||
U+2111 ISOamso -->
|
||||
<!ENTITY real "ℜ"> <!-- black-letter capital R = real part symbol,
|
||||
U+211C ISOamso -->
|
||||
<!ENTITY trade "™"> <!-- trade mark sign, U+2122 ISOnum -->
|
||||
<!ENTITY alefsym "ℵ"> <!-- alef symbol = first transfinite cardinal,
|
||||
U+2135 NEW -->
|
||||
<!-- alef symbol is NOT the same as hebrew letter alef,
|
||||
U+05D0 although the same glyph could be used to depict both characters -->
|
||||
|
||||
<!-- Arrows -->
|
||||
<!ENTITY larr "←"> <!-- leftwards arrow, U+2190 ISOnum -->
|
||||
<!ENTITY uarr "↑"> <!-- upwards arrow, U+2191 ISOnum-->
|
||||
<!ENTITY rarr "→"> <!-- rightwards arrow, U+2192 ISOnum -->
|
||||
<!ENTITY darr "↓"> <!-- downwards arrow, U+2193 ISOnum -->
|
||||
<!ENTITY harr "↔"> <!-- left right arrow, U+2194 ISOamsa -->
|
||||
<!ENTITY crarr "↵"> <!-- downwards arrow with corner leftwards
|
||||
= carriage return, U+21B5 NEW -->
|
||||
<!ENTITY lArr "⇐"> <!-- leftwards double arrow, U+21D0 ISOtech -->
|
||||
<!-- Unicode does not say that lArr is the same as the 'is implied by' arrow
|
||||
but also does not have any other character for that function. So lArr can
|
||||
be used for 'is implied by' as ISOtech suggests -->
|
||||
<!ENTITY uArr "⇑"> <!-- upwards double arrow, U+21D1 ISOamsa -->
|
||||
<!ENTITY rArr "⇒"> <!-- rightwards double arrow,
|
||||
U+21D2 ISOtech -->
|
||||
<!-- Unicode does not say this is the 'implies' character but does not have
|
||||
another character with this function so rArr can be used for 'implies'
|
||||
as ISOtech suggests -->
|
||||
<!ENTITY dArr "⇓"> <!-- downwards double arrow, U+21D3 ISOamsa -->
|
||||
<!ENTITY hArr "⇔"> <!-- left right double arrow,
|
||||
U+21D4 ISOamsa -->
|
||||
|
||||
<!-- Mathematical Operators -->
|
||||
<!ENTITY forall "∀"> <!-- for all, U+2200 ISOtech -->
|
||||
<!ENTITY part "∂"> <!-- partial differential, U+2202 ISOtech -->
|
||||
<!ENTITY exist "∃"> <!-- there exists, U+2203 ISOtech -->
|
||||
<!ENTITY empty "∅"> <!-- empty set = null set, U+2205 ISOamso -->
|
||||
<!ENTITY nabla "∇"> <!-- nabla = backward difference,
|
||||
U+2207 ISOtech -->
|
||||
<!ENTITY isin "∈"> <!-- element of, U+2208 ISOtech -->
|
||||
<!ENTITY notin "∉"> <!-- not an element of, U+2209 ISOtech -->
|
||||
<!ENTITY ni "∋"> <!-- contains as member, U+220B ISOtech -->
|
||||
<!ENTITY prod "∏"> <!-- n-ary product = product sign,
|
||||
U+220F ISOamsb -->
|
||||
<!-- prod is NOT the same character as U+03A0 'greek capital letter pi' though
|
||||
the same glyph might be used for both -->
|
||||
<!ENTITY sum "∑"> <!-- n-ary summation, U+2211 ISOamsb -->
|
||||
<!-- sum is NOT the same character as U+03A3 'greek capital letter sigma'
|
||||
though the same glyph might be used for both -->
|
||||
<!ENTITY minus "−"> <!-- minus sign, U+2212 ISOtech -->
|
||||
<!ENTITY lowast "∗"> <!-- asterisk operator, U+2217 ISOtech -->
|
||||
<!ENTITY radic "√"> <!-- square root = radical sign,
|
||||
U+221A ISOtech -->
|
||||
<!ENTITY prop "∝"> <!-- proportional to, U+221D ISOtech -->
|
||||
<!ENTITY infin "∞"> <!-- infinity, U+221E ISOtech -->
|
||||
<!ENTITY ang "∠"> <!-- angle, U+2220 ISOamso -->
|
||||
<!ENTITY and "∧"> <!-- logical and = wedge, U+2227 ISOtech -->
|
||||
<!ENTITY or "∨"> <!-- logical or = vee, U+2228 ISOtech -->
|
||||
<!ENTITY cap "∩"> <!-- intersection = cap, U+2229 ISOtech -->
|
||||
<!ENTITY cup "∪"> <!-- union = cup, U+222A ISOtech -->
|
||||
<!ENTITY int "∫"> <!-- integral, U+222B ISOtech -->
|
||||
<!ENTITY there4 "∴"> <!-- therefore, U+2234 ISOtech -->
|
||||
<!ENTITY sim "∼"> <!-- tilde operator = varies with = similar to,
|
||||
U+223C ISOtech -->
|
||||
<!-- tilde operator is NOT the same character as the tilde, U+007E,
|
||||
although the same glyph might be used to represent both -->
|
||||
<!ENTITY cong "≅"> <!-- approximately equal to, U+2245 ISOtech -->
|
||||
<!ENTITY asymp "≈"> <!-- almost equal to = asymptotic to,
|
||||
U+2248 ISOamsr -->
|
||||
<!ENTITY ne "≠"> <!-- not equal to, U+2260 ISOtech -->
|
||||
<!ENTITY equiv "≡"> <!-- identical to, U+2261 ISOtech -->
|
||||
<!ENTITY le "≤"> <!-- less-than or equal to, U+2264 ISOtech -->
|
||||
<!ENTITY ge "≥"> <!-- greater-than or equal to,
|
||||
U+2265 ISOtech -->
|
||||
<!ENTITY sub "⊂"> <!-- subset of, U+2282 ISOtech -->
|
||||
<!ENTITY sup "⊃"> <!-- superset of, U+2283 ISOtech -->
|
||||
<!ENTITY nsub "⊄"> <!-- not a subset of, U+2284 ISOamsn -->
|
||||
<!ENTITY sube "⊆"> <!-- subset of or equal to, U+2286 ISOtech -->
|
||||
<!ENTITY supe "⊇"> <!-- superset of or equal to,
|
||||
U+2287 ISOtech -->
|
||||
<!ENTITY oplus "⊕"> <!-- circled plus = direct sum,
|
||||
U+2295 ISOamsb -->
|
||||
<!ENTITY otimes "⊗"> <!-- circled times = vector product,
|
||||
U+2297 ISOamsb -->
|
||||
<!ENTITY perp "⊥"> <!-- up tack = orthogonal to = perpendicular,
|
||||
U+22A5 ISOtech -->
|
||||
<!ENTITY sdot "⋅"> <!-- dot operator, U+22C5 ISOamsb -->
|
||||
<!-- dot operator is NOT the same character as U+00B7 middle dot -->
|
||||
|
||||
<!-- Miscellaneous Technical -->
|
||||
<!ENTITY lceil "⌈"> <!-- left ceiling = APL upstile,
|
||||
U+2308 ISOamsc -->
|
||||
<!ENTITY rceil "⌉"> <!-- right ceiling, U+2309 ISOamsc -->
|
||||
<!ENTITY lfloor "⌊"> <!-- left floor = APL downstile,
|
||||
U+230A ISOamsc -->
|
||||
<!ENTITY rfloor "⌋"> <!-- right floor, U+230B ISOamsc -->
|
||||
<!ENTITY lang "〈"> <!-- left-pointing angle bracket = bra,
|
||||
U+2329 ISOtech -->
|
||||
<!-- lang is NOT the same character as U+003C 'less than sign'
|
||||
or U+2039 'single left-pointing angle quotation mark' -->
|
||||
<!ENTITY rang "〉"> <!-- right-pointing angle bracket = ket,
|
||||
U+232A ISOtech -->
|
||||
<!-- rang is NOT the same character as U+003E 'greater than sign'
|
||||
or U+203A 'single right-pointing angle quotation mark' -->
|
||||
|
||||
<!-- Geometric Shapes -->
|
||||
<!ENTITY loz "◊"> <!-- lozenge, U+25CA ISOpub -->
|
||||
|
||||
<!-- Miscellaneous Symbols -->
|
||||
<!ENTITY spades "♠"> <!-- black spade suit, U+2660 ISOpub -->
|
||||
<!-- black here seems to mean filled as opposed to hollow -->
|
||||
<!ENTITY clubs "♣"> <!-- black club suit = shamrock,
|
||||
U+2663 ISOpub -->
|
||||
<!ENTITY hearts "♥"> <!-- black heart suit = valentine,
|
||||
U+2665 ISOpub -->
|
||||
<!ENTITY diams "♦"> <!-- black diamond suit, U+2666 ISOpub -->
|
23
htmlpurifier-4.10.0/docs/examples/basic.php
Executable file
23
htmlpurifier-4.10.0/docs/examples/basic.php
Executable file
@ -0,0 +1,23 @@
|
||||
<?php
|
||||
|
||||
// This file demonstrates basic usage of HTMLPurifier.
|
||||
|
||||
// replace this with the path to the HTML Purifier library
|
||||
require_once '../../library/HTMLPurifier.auto.php';
|
||||
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
|
||||
// configuration goes here:
|
||||
$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding
|
||||
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype
|
||||
|
||||
$purifier = new HTMLPurifier($config);
|
||||
|
||||
// untrusted input HTML
|
||||
$html = '<b>Simple and short';
|
||||
|
||||
$pure_html = $purifier->purify($html);
|
||||
|
||||
echo '<pre>' . htmlspecialchars($pure_html) . '</pre>';
|
||||
|
||||
// vim: et sw=4 sts=4
|
9
htmlpurifier-4.10.0/docs/fixquotes.htc
Executable file
9
htmlpurifier-4.10.0/docs/fixquotes.htc
Executable file
@ -0,0 +1,9 @@
|
||||
<public:attach event="oncontentready" onevent="init();" />
|
||||
<script>
|
||||
function init() {
|
||||
element.innerHTML = '“'+element.innerHTML+'”';
|
||||
}
|
||||
</script>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
188
htmlpurifier-4.10.0/docs/index.html
Executable file
188
htmlpurifier-4.10.0/docs/index.html
Executable file
@ -0,0 +1,188 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Index to all HTML Purifier documentation." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Documentation - HTML Purifier</title>
|
||||
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>Documentation</h1>
|
||||
|
||||
<p><strong><a href="http://htmlpurifier.org/">HTML Purifier</a></strong> has documentation for all types of people.
|
||||
Here is an index of all of them.</p>
|
||||
|
||||
<h2>End-user</h2>
|
||||
<p>End-user documentation that contains articles, tutorials and useful
|
||||
information for casual developers using HTML Purifier.</p>
|
||||
|
||||
<dl>
|
||||
|
||||
<dt><a href="enduser-id.html">IDs</a></dt>
|
||||
<dd>Explains various methods for allowing IDs in documents safely.</dd>
|
||||
|
||||
<dt><a href="enduser-youtube.html">Embedding YouTube videos</a></dt>
|
||||
<dd>Explains how to safely allow the embedding of flash from trusted sites.</dd>
|
||||
|
||||
<dt><a href="enduser-slow.html">Speeding up HTML Purifier</a></dt>
|
||||
<dd>Explains how to speed up HTML Purifier through caching or inbound filtering.</dd>
|
||||
|
||||
<dt><a href="enduser-utf8.html">UTF-8: The Secret of Character Encoding</a></dt>
|
||||
<dd>Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch.</dd>
|
||||
|
||||
<dt><a href="enduser-tidy.html">Tidy</a></dt>
|
||||
<dd>Tutorial for tweaking HTML Purifier's Tidy-like behavior.</dd>
|
||||
|
||||
<dt><a href="enduser-customize.html">Customize</a></dt>
|
||||
<dd>Tutorial for customizing HTML Purifier's tag and attribute sets.</dd>
|
||||
|
||||
<dt><a href="enduser-uri-filter.html">URI Filters</a></dt>
|
||||
<dd>Tutorial for creating custom URI filters.</dd>
|
||||
|
||||
</dl>
|
||||
|
||||
<h2>Development</h2>
|
||||
<p>Developer documentation detailing code issues, roadmaps and project
|
||||
conventions.</p>
|
||||
|
||||
<dl>
|
||||
|
||||
<dt><a href="dev-progress.html">Implementation Progress</a></dt>
|
||||
<dd>Tables detailing HTML element and CSS property implementation coverage.</dd>
|
||||
|
||||
<dt><a href="dev-naming.html">Naming Conventions</a></dt>
|
||||
<dd>Defines class naming conventions.</dd>
|
||||
|
||||
<dt><a href="dev-optimization.html">Optimization</a></dt>
|
||||
<dd>Discusses possible methods of optimizing HTML Purifier.</dd>
|
||||
|
||||
<dt><a href="dev-flush.html">Flushing the Purifier</a></dt>
|
||||
<dd>Discusses when to flush HTML Purifier's various caches.</dd>
|
||||
|
||||
<dt><a href="dev-advanced-api.html">Advanced API</a></dt>
|
||||
<dd>Specification for HTML Purifier's advanced API for defining
|
||||
custom filtering behavior.</dd>
|
||||
|
||||
<dt><a href="dev-config-schema.html">Config Schema</a></dt>
|
||||
<dd>Describes config schema framework in HTML Purifier.</dd>
|
||||
|
||||
</dl>
|
||||
|
||||
<h2>Proposals</h2>
|
||||
<p>Proposed features, as well as the associated rambling to get a clear
|
||||
objective in place before attempted implementation.</p>
|
||||
|
||||
<dl>
|
||||
<dt><a href="proposal-colors.html">Colors</a></dt>
|
||||
<dd>Proposal to allow for color constraints.</dd>
|
||||
</dl>
|
||||
|
||||
<h2>Reference</h2>
|
||||
<p>Miscellaneous essays, research pieces and other reference type material
|
||||
that may not directly discuss HTML Purifier.</p>
|
||||
|
||||
<dl>
|
||||
<dt><a href="ref-devnetwork.html">DevNetwork Credits</a></dt>
|
||||
<dd>Credits and links to DevNetwork forum topics.</dd>
|
||||
</dl>
|
||||
|
||||
<h2>Internal memos</h2>
|
||||
|
||||
<p>Plaintext documents that are more for use by active developers of
|
||||
the code. They may be upgraded to HTML files or stay as TXT scratchpads.</p>
|
||||
|
||||
<table class="table">
|
||||
|
||||
<thead><tr>
|
||||
<th style="width:10%">Type</th>
|
||||
<th style="width:20%">Name</th>
|
||||
<th>Description</th>
|
||||
</tr></thead>
|
||||
|
||||
<tbody>
|
||||
|
||||
<tr>
|
||||
<td>End-user</td>
|
||||
<td><a href="enduser-overview.txt">Overview</a></td>
|
||||
<td>High level overview of the general control flow (mostly obsolete).</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>End-user</td>
|
||||
<td><a href="enduser-security.txt">Security</a></td>
|
||||
<td>Common security issues that may still arise (half-baked).</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Development</td>
|
||||
<td><a href="dev-config-bcbreaks.txt">Config BC Breaks</a></td>
|
||||
<td>Backwards-incompatible changes in HTML Purifier 4.0.0</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Development</td>
|
||||
<td><a href="dev-code-quality.txt">Code Quality Issues</a></td>
|
||||
<td>Enumerates code quality issues and places that need to be refactored.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Proposal</td>
|
||||
<td><a href="proposal-filter-levels.txt">Filter levels</a></td>
|
||||
<td>Outlines details of projected configurable level of filtering.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Proposal</td>
|
||||
<td><a href="proposal-language.txt">Language</a></td>
|
||||
<td>Specification of I18N for error messages derived from MediaWiki (half-baked).</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Proposal</td>
|
||||
<td><a href="proposal-new-directives.txt">New directives</a></td>
|
||||
<td>Assorted configuration options that could be implemented.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Proposal</td>
|
||||
<td><a href="proposal-css-extraction.txt">CSS extraction</a></td>
|
||||
<td>Taking the inline CSS out of documents and into <code>style</code>.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Reference</td>
|
||||
<td><a href="ref-content-models.txt">Handling Content Model Changes</a></td>
|
||||
<td>Discusses how to tidy up content model changes using custom ChildDef classes.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Reference</td>
|
||||
<td><a href="ref-proprietary-tags.txt">Proprietary tags</a></td>
|
||||
<td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Reference</td>
|
||||
<td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td>
|
||||
<td>Provides a high-level overview of the concepts behind HTMLModules.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>Reference</td>
|
||||
<td><a href="ref-whatwg.txt">WHATWG</a></td>
|
||||
<td>How WHATWG plays into what we need to do.</td>
|
||||
</tr>
|
||||
|
||||
</tbody>
|
||||
|
||||
</table>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
49
htmlpurifier-4.10.0/docs/proposal-colors.html
Executable file
49
htmlpurifier-4.10.0/docs/proposal-colors.html
Executable file
@ -0,0 +1,49 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Proposal to allow for color constraints in HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>Proposal: Colors - HTML Purifier</title>
|
||||
|
||||
</head><body>
|
||||
|
||||
<h1 class="subtitled">Colors</h1>
|
||||
<div class="subtitle">Hammering some sense into those color-blind newbies</div>
|
||||
|
||||
<div id="filing">Filed under Proposals</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Your website probably has a color-scheme.
|
||||
<span style="color:#090; background:#FFF;">Green on white</span>,
|
||||
<span style="color:#A0F; background:#FF0;">purple on yellow</span>,
|
||||
whatever. When you give users the ability to style their content, you may
|
||||
want them to keep in line with your styling. If you're website is all
|
||||
about light colors, you don't want a user to come in and vandalize your
|
||||
page with a deep maroon.</p>
|
||||
|
||||
<p>This is an extremely silly feature proposal, but I'm writing it down anyway.</p>
|
||||
|
||||
<p>What if the user could constrain the colors specified in inline styles? You
|
||||
are only allowed to use these shades of dark green for text and these shades
|
||||
of light yellow for the background. At the very least, you could ensure
|
||||
that we did not have pale yellow on white text.</p>
|
||||
|
||||
<h2>Implementation issues</h2>
|
||||
|
||||
<ol>
|
||||
<li>Requires the color attribute definition to know, currently, what the text
|
||||
and background colors are. This becomes difficult when classes are thrown
|
||||
into the mix.</li>
|
||||
<li>The user still has to define the permissible colors, how does one do
|
||||
something like that?</li>
|
||||
</ol>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
23
htmlpurifier-4.10.0/docs/proposal-config.txt
Executable file
23
htmlpurifier-4.10.0/docs/proposal-config.txt
Executable file
@ -0,0 +1,23 @@
|
||||
|
||||
Configuration
|
||||
|
||||
Configuration is documented on a per-use case: if a class uses a certain
|
||||
value from the configuration object, it has to define its name and what the
|
||||
value is used for. This means decentralized configuration declarations that
|
||||
are nevertheless error checking and a centralized configuration object.
|
||||
|
||||
Directives are divided into namespaces, indicating the major portion of
|
||||
functionality they cover (although there may be overlaps). Please consult
|
||||
the documentation in ConfigDef for more information on these namespaces.
|
||||
|
||||
Since configuration is dependant on context, internal classes require a
|
||||
configuration object to be passed as a parameter. (They also require a
|
||||
Context object). A majority of classes do not need the config object,
|
||||
but for those who do, it is a lifesaver.
|
||||
|
||||
Definition objects are complex datatypes influenced by their respective
|
||||
directive namespaces (HTMLDefinition with HTML and CSSDefinition with CSS).
|
||||
If any of these directives is updated, HTML Purifier forces the definition
|
||||
to be regenerated.
|
||||
|
||||
vim: et sw=4 sts=4
|
34
htmlpurifier-4.10.0/docs/proposal-css-extraction.txt
Executable file
34
htmlpurifier-4.10.0/docs/proposal-css-extraction.txt
Executable file
@ -0,0 +1,34 @@
|
||||
|
||||
Extracting inline CSS from HTML Purifier
|
||||
voodoofied: Assigning semantics to elements
|
||||
|
||||
Sander Tekelenburg brought to my attention the poor programming style of
|
||||
inline CSS in HTML documents. In an ideal world, we wouldn't be using inline
|
||||
CSS at all: everything would be assigned using semantic class attributes
|
||||
from an external stylesheet.
|
||||
|
||||
With ExtractStyleBlocks and CSSTidy, this is now possible (when allowed, users
|
||||
can specify a style element which gets extracted from the user-submitted HTML, which
|
||||
the application can place in the head of the HTML document). But there still
|
||||
is the issue of inline CSS that refuses to go away.
|
||||
|
||||
The basic idea behind this feature is assign every element a unique identifier,
|
||||
and then move all of the CSS data to a style-sheet. This HTML:
|
||||
|
||||
<div style="text-align:center">Big <span style="color:red;">things</span>!</div>
|
||||
|
||||
into
|
||||
|
||||
<div id="hp-12345">Big <span id="hp-12346">things</span>!</div>
|
||||
|
||||
and a stylesheet that is:
|
||||
|
||||
#hp-12345 {text-align:center;}
|
||||
#hp-12346 {color:red;}
|
||||
|
||||
Beyond that, HTML Purifier can magically merge common CSS values together,
|
||||
and a whole manner of other heuristic things. HTML Purifier should also
|
||||
make it easy for an admin to re-style the HTML semantically. Speed is not
|
||||
an issue. Also, better WYSIWYG editors are needed.
|
||||
|
||||
vim: et sw=4 sts=4
|
211
htmlpurifier-4.10.0/docs/proposal-errors.txt
Executable file
211
htmlpurifier-4.10.0/docs/proposal-errors.txt
Executable file
@ -0,0 +1,211 @@
|
||||
Considerations for ErrorCollection
|
||||
|
||||
Presently, HTML Purifier takes a code-execution centric approach to handling
|
||||
errors. Errors are organized and grouped according to which segment of the
|
||||
code triggers them, not necessarily the portion of the input document that
|
||||
triggered the error. This means that errors are pseudo-sorted by category,
|
||||
rather than location in the document.
|
||||
|
||||
One easy way to "fix" this problem would be to re-sort according to line number.
|
||||
However, the "category" style information we derive from naively following
|
||||
program execution is still useful. After all, each of the strategies which
|
||||
can report errors still process the document mostly linearly. Furthermore,
|
||||
not only do they process linearly, but the way they pass off operations to
|
||||
sub-systems mirrors that of the document. For example, AttrValidator will
|
||||
linearly proceed through elements, and on each element will use AttrDef to
|
||||
validate those contents. From there, the attribute might have more
|
||||
sub-components, which have execution passed off accordingly.
|
||||
|
||||
In fact, each strategy handles a very specific class of "error."
|
||||
|
||||
RemoveForeignElements - element tokens
|
||||
MakeWellFormed - element token ordering
|
||||
FixNesting - element token ordering
|
||||
ValidateAttributes - attributes of elements
|
||||
|
||||
The crucial point is that while we care about the hierarchy governing these
|
||||
different errors, we *don't* care about any other information about what actually
|
||||
happens to the elements. This brings up another point: if HTML Purifier fixes
|
||||
something, this is not really a notice/warning/error; it's really a suggestion
|
||||
of a way to fix the aforementioned defects.
|
||||
|
||||
In short, the refactoring to take this into account kinda sucks.
|
||||
|
||||
Errors should not be recorded in order that they are reported. Instead, they
|
||||
should be bound to the line (and preferably element) in which they were found.
|
||||
This means we need some way to uniquely identify every element in the document,
|
||||
which doesn't presently exist. An easy way of adding this would be to track
|
||||
line columns. An important ramification of this is that we *must* use the
|
||||
DirectLex implementation.
|
||||
|
||||
1. Implement column numbers for DirectLex [DONE!]
|
||||
2. Disable error collection when not using DirectLex [DONE!]
|
||||
|
||||
Next, we need to re-orient all of the error declarations to place CurrentToken
|
||||
at utmost important. Since this is passed via Context, it's not always clear
|
||||
if that's available. ErrorCollector should complain HARD if it isn't available.
|
||||
There are some locations when we don't have a token available. These include:
|
||||
|
||||
* Lexing - this can actually have a row and column, but NOT correspond to
|
||||
a token
|
||||
* End of document errors - bump this to the end
|
||||
|
||||
Actually, we *don't* have to complain if CurrentToken isn't available; we just
|
||||
set it as a document-wide error. And actually, nothing needs to be done here.
|
||||
|
||||
Something interesting to consider is whether or not we care about the locations
|
||||
of attributes and CSS properties, i.e. the sub-objects that compose these things.
|
||||
In terms of consistency, at the very least attributes should have column/line
|
||||
numbers attached to them. However, this may be overkill, as attributes are
|
||||
uniquely identifiable. You could go even further, with CSS, but they are also
|
||||
uniquely identifiable.
|
||||
|
||||
Bottom-line is, however, this information must be available, in form of the
|
||||
CurrentAttribute and CurrentCssProperty (theoretical) context variables, and
|
||||
it must be used to organize the errors that the sub-processes may throw.
|
||||
There is also a hierarchy of sorts that may make merging this into one context
|
||||
variable more sense, if it hadn't been for HTML's reasonably rigid structure.
|
||||
A CSS property will never contain an HTML attribute. So we won't ever get
|
||||
recursive relations, and having multiple depths won't ever make sense. Leave
|
||||
this be.
|
||||
|
||||
We already have this information, and consequently, using start and end is
|
||||
*unnecessary*, so long as the context variables are set appropriately. We don't
|
||||
care if an error was thrown by an attribute transform or an attribute definition;
|
||||
to the end user these are the same (for a developer, they are different, but
|
||||
they're better off with a stack trace (which we should add support for) in such
|
||||
cases).
|
||||
|
||||
3. Remove start()/end() code. Don't get rid of recursion, though [DONE]
|
||||
4. Setup ErrorCollector to use context information to setup hierarchies.
|
||||
This may require a different internal format. Use objects if it gets
|
||||
complex. [DONE]
|
||||
|
||||
ASIDE
|
||||
More on this topic: since we are now binding errors to lines
|
||||
and columns, a particular error can have three relationships to that
|
||||
specific location:
|
||||
|
||||
1. The token at that location directly
|
||||
RemoveForeignElements
|
||||
AttrValidator (transforms)
|
||||
MakeWellFormed
|
||||
2. A "component" of that token (i.e. attribute)
|
||||
AttrValidator (removals)
|
||||
3. A modification to that node (i.e. contents from start to end
|
||||
token) as a whole
|
||||
FixNesting
|
||||
|
||||
This needs to be marked accordingly. In the presentation, it might
|
||||
make sense keep (3) separate, have (2) a sublist of (1). (1) can
|
||||
be a closing tag, in which case (3) makes no sense at all, OR it
|
||||
should be related with its opening tag (this may not necessarily
|
||||
be possible before MakeWellFormed is run).
|
||||
|
||||
So, the line and column counts as our identifier, so:
|
||||
|
||||
$errors[$line][$col] = ...
|
||||
|
||||
Then, we need to identify case 1, 2 or 3. They are identified as
|
||||
such:
|
||||
|
||||
1. Need some sort of semaphore in RemoveForeignElements, etc.
|
||||
2. If CurrentAttr/CurrentCssProperty is non-null
|
||||
3. Default (FixNesting, MakeWellFormed)
|
||||
|
||||
One consideration about (1) is that it usually is actually a
|
||||
(3) modification, but we have no way of knowing about that because
|
||||
of various optimizations. However, they can probably be treated
|
||||
the same. The other difficulty is that (3) is never a line and
|
||||
column; rather, it is a range (i.e. a duple) and telling the user
|
||||
the very start of the range may confuse them. For example,
|
||||
|
||||
<b>Foo<div>bar</div></b>
|
||||
^ ^
|
||||
|
||||
The node being operated on is <b>, so the error would be assigned
|
||||
to the first caret, with a "node reorganized" error. Then, the
|
||||
ChildDef would have submitted its own suggestions and errors with
|
||||
regard to what's going in the internals. So I suppose this is
|
||||
ok. :-)
|
||||
|
||||
Now, the structure of the earlier mentioned ... would be something
|
||||
like this:
|
||||
|
||||
object {
|
||||
type = (token|attr|property),
|
||||
value, // appropriate for type
|
||||
errors => array(),
|
||||
sub-errors = [recursive],
|
||||
}
|
||||
|
||||
This helps us keep things agnostic. It is also sufficiently complex
|
||||
enough to warrant an object.
|
||||
|
||||
So, more wanking about the object format is in order. The way HTML Purifier is
|
||||
currently setup, the only possible hierarchy is:
|
||||
|
||||
token -> attr -> css property
|
||||
|
||||
These relations do not exist all of the time; a comment or end token would not
|
||||
ever have any attributes, and non-style attributes would never have CSS properties
|
||||
associated with them.
|
||||
|
||||
I believe that it is worth supporting multiple paths. At some point, we might
|
||||
have a hierarchy like:
|
||||
|
||||
* -> syntax
|
||||
-> token -> attr -> css property
|
||||
-> url
|
||||
-> css stylesheet <style>
|
||||
|
||||
et cetera. Now, one of the practical implications of this is that every "node"
|
||||
on our tree is well-defined, so in theory it should be possible to either 1.
|
||||
create a separate class for each error struct, or 2. embed this information
|
||||
directly into HTML Purifier's token stream. Embedding the information in the
|
||||
token stream is not a terribly good idea, since tokens can be removed, etc.
|
||||
So that leaves us with 1... and if we use a generic interface we can cut down
|
||||
on a lot of code we might need. So let's leave it like this.
|
||||
|
||||
~~~~
|
||||
|
||||
Then we setup suggestions.
|
||||
|
||||
5. Setup a separate error class which tells the user any modifications
|
||||
HTML Purifier made.
|
||||
|
||||
Some information about this:
|
||||
|
||||
Our current paradigm is to tell the user what HTML Purifier did to the HTML.
|
||||
This is the most natural mode of operation, since that's what HTML Purifier
|
||||
is all about; it was not meant to be a validator.
|
||||
|
||||
However, most other people have experience dealing with a validator. In cases
|
||||
where HTML Purifier unambiguously does the right thing, simply giving the user
|
||||
the correct version isn't a bad idea, but problems arise when:
|
||||
|
||||
- The user has such bad HTML we do something odd, when we should have just
|
||||
flagged the HTML as an error. Such examples are when we do things like
|
||||
remove text from directly inside a <table> tag. It was probably meant to
|
||||
be in a <td> tag or be outside the table, but we're not smart enough to
|
||||
realize this so we just remove it. In such a case, we should tell the user
|
||||
that there was foreign data in the table, but then we shouldn't "demand"
|
||||
the user remove the data; it's more of a "here's a possible way of
|
||||
rectifying the problem"
|
||||
|
||||
- Giving line context for input is hard enough, but feasible; giving output
|
||||
line context will be extremely difficult due to shifting lines; we'd probably
|
||||
have to track what the tokens are and then find the appropriate out context
|
||||
and it's not guaranteed to work etc etc etc.
|
||||
|
||||
````````````
|
||||
|
||||
Don't forget to spruce up output.
|
||||
|
||||
6. Output needs to automatically give line and column numbers, basically
|
||||
"at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE]
|
||||
|
||||
- We need a standard CSS to apply (check demo.css for some starting
|
||||
styling; some buttons would also be hip)
|
||||
|
||||
vim: et sw=4 sts=4
|
137
htmlpurifier-4.10.0/docs/proposal-filter-levels.txt
Executable file
137
htmlpurifier-4.10.0/docs/proposal-filter-levels.txt
Executable file
@ -0,0 +1,137 @@
|
||||
|
||||
Filter Levels
|
||||
When one size *does not* fit all
|
||||
|
||||
It makes little sense to constrain users to one set of HTML elements and
|
||||
attributes and tell them that they are not allowed to mold this in
|
||||
any fashion. Many users demand to be able to custom-select which elements
|
||||
and attributes they want. This is fine: because HTML Purifier keeps close
|
||||
track of what elements are safe to use, there is no way for them to
|
||||
accidently allow an XSS-able tag.
|
||||
|
||||
However, combing through the HTML spec to make your own whitelist can
|
||||
be a daunting task. HTML Purifier ought to offer pre-canned filter levels
|
||||
that amateur users can select based on what they think is their use-case.
|
||||
|
||||
Here are some fuzzy levels you could set:
|
||||
|
||||
1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite,
|
||||
code, em, i, strike, strong; however, you could get away with only a, em and
|
||||
p; also having blockquote and pre tags would be helpful.
|
||||
2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote,
|
||||
pre, div, span and h[2-6] (the last three are for specially formatted
|
||||
posts, div and span require associated classes or inline styling enabled
|
||||
to be useful)
|
||||
3. Pages - As permissive as possible without allowing XSS. No protection
|
||||
against bad design sense, unfortunantely. Suitable for wiki and page
|
||||
environments. (probably what we have now)
|
||||
4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't
|
||||
get implemented as it would require routines for things like <object>
|
||||
and friends to be implemented, which is a lot of work for not a lot of
|
||||
benefit)
|
||||
|
||||
One final note: when you start axing tags that are more commonly used, you
|
||||
run the risk of accidentally destroying user data, especially if the data
|
||||
is incoming from a WYSIWYG editor that hasn't been synced accordingly. This may
|
||||
make forbidden element to text transformations desirable (for example, images).
|
||||
|
||||
|
||||
|
||||
== Element Risk Analysis ==
|
||||
|
||||
Although none of the currently supported elements presents a security
|
||||
threat per-say, some can cause problems for page layouts or be
|
||||
extremely complicated.
|
||||
|
||||
Legend:
|
||||
[danger level] - regular tags / uncommon tags ~ deprecated tags
|
||||
[danger level]* - rare tags
|
||||
|
||||
1 - blockquote, code, em, i, p, tt / strong, sub, sup
|
||||
1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp
|
||||
2 - b, br, del, div, pre, span / ins, s, strike ~ u
|
||||
3 - h2, h3, h4, h5, h6 ~ center
|
||||
4 - h1, big ~ font
|
||||
5 - a
|
||||
7 - area, map
|
||||
|
||||
These are special use tags, they should be enabled on a blanket basis.
|
||||
|
||||
Lists - dd, dl, dt, li, ol, ul ~ menu, dir
|
||||
Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead
|
||||
|
||||
Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea
|
||||
XSS - noscript, object, script ~ applet
|
||||
Meta - base, basefont, body, head, html, link, meta, style, title
|
||||
Frames - frame, frameset, iframe
|
||||
|
||||
And tag specific notes:
|
||||
|
||||
a - general problems involving linkspam
|
||||
b - too much bold is bad, typographically speaking bold is discouraged
|
||||
br - often misused
|
||||
center - CSS, usually no legit use
|
||||
del - only useful in editing context
|
||||
div - little meaning in certain contexts i.e. blog comment
|
||||
h1 - usually no legit use, as header is already set by application
|
||||
h* - not needed in blog comments
|
||||
hr - usually not necessary in blog comments
|
||||
img - could be extremely undesirable if linking to external pics (CSRF, goatse)
|
||||
pre - could use formatting, only useful in code contexts
|
||||
q - very little support
|
||||
s - transform into span with styling or del?
|
||||
small - technically presentational
|
||||
span - depends on attribute allowances
|
||||
sub, sup - specialized
|
||||
u - little legit use, prefer class with text-decoration
|
||||
|
||||
Based on the riskiness of the items, we may want to offer %HTML.DisableImages
|
||||
attribute and put URI filtering higher up on the priority list.
|
||||
|
||||
|
||||
== Attribute Risk Analysis ==
|
||||
|
||||
We actually have a suprisingly small assortment of allowed attributes (the
|
||||
rest are deprecated in strict, and thus we opted not to allow them, even
|
||||
though our output is XHTML Transitional by default.)
|
||||
|
||||
Required URI - img.alt, img.src, a.href
|
||||
Medium risk - *.class, *.dir
|
||||
High risk - img.height, img.width, *.id, *.style
|
||||
|
||||
Table - colgroup/col.span, td/th.rowspan, td/th.colspan
|
||||
Uncommon - *.title, *.lang, *.xml:lang
|
||||
Rare - td/th.abbr, table.summary, {table}.charoff
|
||||
Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc
|
||||
Presentational - {table}.align, {table}.valign, table.frame, table.rules,
|
||||
table.border
|
||||
Partially presentational - table.cellpadding, table.cellspacing,
|
||||
table.width, col.width, colgroup.width
|
||||
|
||||
|
||||
== CSS Risk Analysis ==
|
||||
|
||||
Currently, there is no support for fine-grained "allowed CSS" specification,
|
||||
mainly because I'm lazy, partially because no one has asked for it. However,
|
||||
this will be added eventually.
|
||||
|
||||
There are certain CSS elements that are extremely useful inline, but then
|
||||
as you get to more presentation oriented styling it may not always be
|
||||
appropriate to inline them.
|
||||
|
||||
Useful - clear, float, border-collapse, caption-side
|
||||
|
||||
These CSS properties can break layouts if used improperly. We have excluded
|
||||
any CSS properties that are not currently implemented (such as position).
|
||||
|
||||
Dangerous, can go outside container - float
|
||||
Easy to abuse - font-size, font-family (font), width
|
||||
Colored - background-color (background), border-color (border), color
|
||||
(see proposal-colors.html)
|
||||
Dramatic - border, list-style-position (list-style), margin, padding,
|
||||
text-align, text-indent, text-transform, vertical-align, line-height
|
||||
|
||||
Dramatic elements substantially change the look of text in ways that should
|
||||
probably have been reserved to other areas.
|
||||
|
||||
vim: et sw=4 sts=4
|
64
htmlpurifier-4.10.0/docs/proposal-language.txt
Executable file
64
htmlpurifier-4.10.0/docs/proposal-language.txt
Executable file
@ -0,0 +1,64 @@
|
||||
We are going to model our I18N/L10N off of MediaWiki's system. Their's is
|
||||
obviously quite complicated, so we're going to simplify it a bit for our needs.
|
||||
|
||||
== Caching ==
|
||||
|
||||
MediaWiki has lots of caching mechanisms built in, which make the code somewhat
|
||||
more difficult to understand. Before doing any loading, MediaWiki will check
|
||||
the following places to see if we can be lazy:
|
||||
|
||||
1. $mLocalisationCache[$code] - just a variable where it may have been stashed
|
||||
2. serialized/$code.ser - compiled serialized language file
|
||||
3. Memcached version of file (with expiration checking)
|
||||
|
||||
Expiration checking consists of by ensuring all dependencies have filemtime
|
||||
that match the ones bundled with the cached copy. Similar checking could be
|
||||
implemented for serialized versions, as it seems that they are not updated
|
||||
until manually recompiled.
|
||||
|
||||
== Behavior ==
|
||||
|
||||
Things that are localizable:
|
||||
|
||||
- Weekdays (and abbrev)
|
||||
- Months (and abbrev)
|
||||
- Bookstores
|
||||
- Skin names
|
||||
- Date preferences / Custom date format
|
||||
- Default date format
|
||||
- Default user option overrides
|
||||
-+ Language names
|
||||
- Timezones
|
||||
-+ Character encoding conversion via iconv
|
||||
- UpperLowerCase first (needs casemaps for some)
|
||||
- UpperLowerCase
|
||||
- Uppercase words
|
||||
- Uppercase word breaks
|
||||
- Case folding
|
||||
- Strip punctuation for MySQL search
|
||||
- Get first character
|
||||
-+ Alternate encoding
|
||||
-+ Recoding for edit (and then recode input)
|
||||
-+ RTL
|
||||
-+ Direction mark character depending on RTL
|
||||
-? Arrow depending on RTL
|
||||
- Languages where italics cannot be used
|
||||
-+ Number formatting (commafy, transform digits, transform separators)
|
||||
- Truncate (multibyte)
|
||||
- Grammar conversions for inflected languages
|
||||
- Plural transformations
|
||||
- Formatting expiry times
|
||||
- Segmenting for diffs (Chinese)
|
||||
- Convert to variants of language
|
||||
- Language specific user preference options
|
||||
- Link trails [[foo]]bar
|
||||
-+ Language code (RFC 3066)
|
||||
|
||||
Neat functionality:
|
||||
|
||||
- I18N sprintfDate
|
||||
- Roman numeral formatting
|
||||
|
||||
Items marked with a + likely need to be addressed by HTML Purifier
|
||||
|
||||
vim: et sw=4 sts=4
|
44
htmlpurifier-4.10.0/docs/proposal-new-directives.txt
Executable file
44
htmlpurifier-4.10.0/docs/proposal-new-directives.txt
Executable file
@ -0,0 +1,44 @@
|
||||
|
||||
Configuration Ideas
|
||||
|
||||
Here are some theoretical configuration ideas that we could implement some
|
||||
time. Note the naming convention: %Namespace.Directive. If you want one
|
||||
implemented, give us a ring, and we'll move it up the priority chain.
|
||||
|
||||
%Attr.RewriteFragments - if there's %Attr.IDPrefix we may want to transparently
|
||||
rewrite the URLs we parse too. However, we can only do it when it's a pure
|
||||
anchor link, so it's not foolproof
|
||||
|
||||
%Attr.ClassBlacklist,
|
||||
%Attr.ClassWhitelist,
|
||||
%Attr.ClassPolicy - determines what classes are allowed. When
|
||||
%Attr.ClassPolicy is set to Blacklist, only allow those not in
|
||||
%Attr.ClassBlacklist. When it's Whitelist, only allow those in
|
||||
%Attr.ClassWhitelist.
|
||||
|
||||
%Attr.MaxWidth,
|
||||
%Attr.MaxHeight - caps for width and height related checks.
|
||||
(the hack in Pixels for an image crashing attack could be replaced by this)
|
||||
|
||||
%URI.AddRelNofollow - will add rel="nofollow" to all links, preventing the
|
||||
spread of ill-gotten pagerank
|
||||
|
||||
%URI.HostBlacklistRegex - regexes that if matching the host are disallowed
|
||||
%URI.HostWhitelist - domain names that are excluded from the host blacklist
|
||||
%URI.HostPolicy - determines whether or not its reject all and then whitelist
|
||||
or allow all in then do specific blacklists with whitelist intervening.
|
||||
'DenyAll' or 'AllowAll' (default)
|
||||
|
||||
%URI.DisableIPHosts - URIs that have IP addresses for hosts are disallowed.
|
||||
Be sure to also grab unusual encodings (dword, hex and octal), which may
|
||||
be currently be caught by regular DNS
|
||||
%URI.DisableIDN - Disallow raw internationalized domain names. Punycode
|
||||
will still be permitted.
|
||||
|
||||
%URI.ConvertUnusualIPHosts - transform dword/hex/octal IP addresses to the
|
||||
regular form
|
||||
%URI.ConvertAbsoluteDNS - Remove extra dots after host names that trigger
|
||||
absolute DNS. While this is actually the preferred method according to
|
||||
the RFC, most people opt to use a relative domain name relative to . (root).
|
||||
|
||||
vim: et sw=4 sts=4
|
218
htmlpurifier-4.10.0/docs/proposal-plists.txt
Executable file
218
htmlpurifier-4.10.0/docs/proposal-plists.txt
Executable file
@ -0,0 +1,218 @@
|
||||
THE UNIVERSAL DESIGN PATTERN: PROPERTIES
|
||||
Steve Yegge
|
||||
|
||||
Implementation:
|
||||
get(name)
|
||||
put(name, value)
|
||||
has(name)
|
||||
remove(name)
|
||||
iteration, with filtering [this will be our namespaces]
|
||||
parent
|
||||
|
||||
Representations:
|
||||
- Keys are strings
|
||||
- It's nice to not need to quote keys (if we formulate our own language,
|
||||
consider this)
|
||||
- Property not present representation (key missing)
|
||||
- Frequent removal/re-add may have null help. If null is valid, use
|
||||
another value. (PHP semantics are weird here)
|
||||
|
||||
Data structures:
|
||||
- LinkedHashMap is wonderful (O(1) access and maintains order)
|
||||
- Using a special property that points to the parent is usual
|
||||
- Multiple inheritance possible, need rules for which to lookup first
|
||||
- Iterative inheritance is best
|
||||
- Consider performance!
|
||||
|
||||
Deletion
|
||||
- Tricky problem with inheritance
|
||||
- Distinguish between "not found" and "look in my parent for the property"
|
||||
[Maybe HTML Purifier won't allow deletion]
|
||||
|
||||
Read/write asymmetry (it's correct!)
|
||||
|
||||
Read-only plists
|
||||
- Allow ability to freeze [this is what we have already]
|
||||
- Don't overuse it
|
||||
|
||||
Performance:
|
||||
- Intern strings (PHP does this already)
|
||||
- Don't be case-insensitive
|
||||
- If all properties in a plist are known a-priori, you can use a "perfect"
|
||||
hash function. Often overkill.
|
||||
- Copy-on-read caching "plundering" reduces lookup, but uses memory and can
|
||||
grow stale. Use as last resort.
|
||||
- Refactoring to fields. Watch for API compatibility, system complexity,
|
||||
and lack of flexibility.
|
||||
- Refrigerator: external data-structure to hold plists
|
||||
|
||||
Transient properties:
|
||||
[Don't need to worry about this]
|
||||
- Use a separate plist for transient properties
|
||||
- Non-numeric override; numeric should ADD
|
||||
- Deletion: removeTransientProperty() and transientlyRemoveProperty()
|
||||
|
||||
Persistence:
|
||||
- XML/JSON are good
|
||||
- Text-based is good for readability, maintainability and bootstrapping
|
||||
- Compressed binary format for network transport [not necessary]
|
||||
- RDBMS or XML database
|
||||
|
||||
Querying: [not relevant]
|
||||
- XML database is nice for XPath/XQuery
|
||||
- jQuery for JSON
|
||||
- Just load it all into a program
|
||||
|
||||
Backfills/Data integrity:
|
||||
- Use usual methods
|
||||
- Lazy backfill is a nice hack
|
||||
|
||||
Type systems:
|
||||
- Flags: ReadOnly, Permanent, DontEnum
|
||||
- Typed properties isn't that useful [It's also Not-PHP]
|
||||
- Seperate meta-list of directive properties IS useful
|
||||
- Duck typing is useful for systems designed fully around properties pattern
|
||||
|
||||
Trade-off:
|
||||
+ Flexibility
|
||||
+ Extensibility
|
||||
+ Unit-testing/prototype-speed
|
||||
- Performance
|
||||
- Data integrity
|
||||
- Navagability/Query-ability
|
||||
- Reversability (hard to go back)
|
||||
|
||||
HTML Purifier
|
||||
|
||||
We are not happy with our current system of defining configuration directives,
|
||||
because it has become clear that things will get a lot nicer if we allow
|
||||
multiple namespaces, and there are some features that naturally lend themselves
|
||||
to inheritance, which we do not really support well.
|
||||
|
||||
One of the considered implementation changes would be to go from a structure
|
||||
like:
|
||||
|
||||
array(
|
||||
'Namespace' => array(
|
||||
'Directive' => 'val1',
|
||||
'Directive2' => 'val2',
|
||||
)
|
||||
)
|
||||
|
||||
to:
|
||||
|
||||
array(
|
||||
'Namespace.Directive' => 'val1',
|
||||
'Namespace.Directive2' => 'val2',
|
||||
)
|
||||
|
||||
The below implementation takes more memory, however, and it makes it a bit
|
||||
complicated to grab all values from a namespace.
|
||||
|
||||
The alternate implementation choice is to allow nested plists. This keeps
|
||||
iteration easy, but is problematic for inheritance (it would be difficult
|
||||
to distinguish a plist from an array) and retrieval (when specifying multiple
|
||||
namespaces we would need some multiple de-referencing).
|
||||
|
||||
----
|
||||
|
||||
We can bite the performance hit, and just do iteration with filter
|
||||
(the strncmp call should be relatively cheap). Then, users should be able
|
||||
to optimize doing something like:
|
||||
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
if (!file_exists('config.php')) {
|
||||
// set up $config
|
||||
$config->save('config.php');
|
||||
} else {
|
||||
$config->load('config.php');
|
||||
}
|
||||
|
||||
Or maybe memcache, or something. This means that "// set up $config" must
|
||||
not have any dynamic parts, or the user has to invalidate the cache when
|
||||
they do update it. We have to think about this a little more carefully; the
|
||||
file call might be more expensive.
|
||||
|
||||
----
|
||||
|
||||
This might get expensive, however, when we actually care about iterating
|
||||
over the configuration and want the actual values. So what about nesting the
|
||||
lists?
|
||||
|
||||
"ns.sub.directive" => values['ns']['sub']['directive']
|
||||
|
||||
We can distinguish between plists and arrays by using ArrayObjects for the
|
||||
plists, and regular arrays for the arrays? Alternatively, use ArrayObjects
|
||||
for the arrays, and regular arrays for the plists.
|
||||
|
||||
----
|
||||
|
||||
Implementation demands, and what has caused them:
|
||||
|
||||
1. DefinitionCache, the HTML, CSS and URI namespaces have caches attached to them
|
||||
Results:
|
||||
- getBatchSerial()
|
||||
- getBatch() : in general, the ability to traverse just a namespace
|
||||
|
||||
2. AutoFormat/Filter, this is a plugin architecture, directives not hard-coded
|
||||
- getBatch()
|
||||
|
||||
3. Configuration form
|
||||
- Namespaces used to organize directives
|
||||
|
||||
Other than that, we have a pure plist. PERHAPS we should maintain separate things
|
||||
for these different demands.
|
||||
|
||||
Issue 2: Directives for configuring the plugins are regular plists, but
|
||||
when enabling them, while it's "plist-ish", what you're really doing is adding
|
||||
them to an array of "autoformatters"/"filters" to enable. We can setup
|
||||
magic BC as well as in the new interface, but there should also be an
|
||||
add('AutoFormat', 'AutoParagraph'); which does the right thing.
|
||||
|
||||
One thing to consider is whether or not inheritance rules will apply to these.
|
||||
I'd say yes. That means that they're still plisty, in fact, the underlying
|
||||
implementation will probably be a plist. However, they will get their OWN
|
||||
plists, and will NOT support nesting.
|
||||
|
||||
Issue 1: Our current implementation is generally not efficient; md5(serialize($foo))
|
||||
is pretty expensive. So, I don't think there will be any problems if it
|
||||
gets "less" efficient, as long as we give users a properly fast alternative;
|
||||
DefinitionRev gives us a way to do this, by simply telling the user they must
|
||||
update it whenever they update Configuration directives as well. (There are
|
||||
obvious BC concerns here).
|
||||
|
||||
In such a case, we simply iterate over our plist (performing full retrievals
|
||||
for each value), grab the entries we care about, and then serialize and hash.
|
||||
It's going to be slow either way, due to the ability of plists to inherit.
|
||||
If we ksort(), we don't have to traverse the entire array, however, the
|
||||
cost of a ksort() call may not be worth it.
|
||||
|
||||
At this point, last time, I started worrying about the performance implications
|
||||
of allowing inheritance, and wondering whether or not I wanted to squash
|
||||
the plist. At first blush, our code might be under the assumption that
|
||||
accessing properties is cheap; but actually we prefer to copy out the value
|
||||
into a member variable if it's going to be used many times. With this is mind
|
||||
I don't think CPU consumption from a few nested function calls is going to
|
||||
be a problem. We *are* going to enforce a function only interface.
|
||||
|
||||
The next issue at hand is how we're going to manage the "special" plists,
|
||||
which should still be able to be inherited. Basically, it means that multiple
|
||||
plists would be attached to the configuration object, which is not the
|
||||
best for memory performance. The alternative is to keep them all in one
|
||||
big plist, and then eat the one-time cost of traversing the entire plist
|
||||
to grab the appropriate values.
|
||||
|
||||
I think at this point we can write the generic interface, and then set up separate
|
||||
plists if that ends up being necessary for performance (it probably won't.) Now
|
||||
lets code our generic plist implementation.
|
||||
|
||||
----
|
||||
|
||||
Iterating over the plist presents some problems. The way we've chosen to solve
|
||||
this is to squash all of the parents.
|
||||
|
||||
----
|
||||
|
||||
But I don't need iteration.
|
||||
|
||||
vim: et sw=4 sts=4
|
50
htmlpurifier-4.10.0/docs/ref-content-models.txt
Executable file
50
htmlpurifier-4.10.0/docs/ref-content-models.txt
Executable file
@ -0,0 +1,50 @@
|
||||
|
||||
Handling Content Model Changes
|
||||
|
||||
|
||||
1. Context
|
||||
|
||||
The distinction between Transitional and Strict document types is somewhat
|
||||
of an anomaly in the lineage of XHTML document types (following 1.0, no
|
||||
doctypes do not have flavors: instead, modularization is used to let
|
||||
document authors vary their elements). This transition is usually quite
|
||||
straight-forward, as W3C usually deprecates attributes or elements, which
|
||||
are quite easily handled using tag and attribute transforms.
|
||||
|
||||
However, for two elements, <blockquote>, <body> and <address>, W3C elected
|
||||
to also change the content model. <blockquote> and <body> originally
|
||||
accepted both inline and block elements, but in the strict doctype they
|
||||
only allow block elements. With <address>, the situation is inverted:
|
||||
<p> tags were now forbidden from appearing within this tag.
|
||||
|
||||
|
||||
2. Current situation
|
||||
|
||||
Currently, HTML Purifier treats <blockquote> specially during Tidy mode
|
||||
using a custom ChildDef class StrictBlockquote. StrictBlockquote
|
||||
operates similarly to Required, except that when it encounters an inline
|
||||
element, it will wrap it in a block tag (as specified by
|
||||
%HTML.BlockWrapper, the default is <p>). The naming suggests it can
|
||||
only be used for <blockquote>s, although it may be possible to
|
||||
genericize it to work on other cases of this nature (this would be of
|
||||
little practical application, as no other element in XHTML 1.1 or earlier
|
||||
has a block-only content model).
|
||||
|
||||
Tidy currently contains no custom, lenient implementation for <address>.
|
||||
If one were to be written, it would likely operate on the principle that,
|
||||
when a <p> tag were to be encountered, it would be replaced with a
|
||||
leading and trailing <br /> tag (the contents of <p>, being inline, are
|
||||
not an issue). There is no prior work with this sort of operation.
|
||||
|
||||
|
||||
3. Outside applicability
|
||||
|
||||
There are a number of other elements that contain restrictive content
|
||||
models, such as <ul> or <span> (the latter is restrictive in that it
|
||||
does not allow block elements). In the former case, an errant node
|
||||
is eliminated completely, in the latter case, the text of the node
|
||||
would is preserved (as the parent node does allow PCDATA). Custom
|
||||
content model implementations probably are not the best way of handling
|
||||
these cases, instead, node bubbling should be implemented instead.
|
||||
|
||||
vim: et sw=4 sts=4
|
30
htmlpurifier-4.10.0/docs/ref-css-length.txt
Executable file
30
htmlpurifier-4.10.0/docs/ref-css-length.txt
Executable file
@ -0,0 +1,30 @@
|
||||
|
||||
CSS Length Reference
|
||||
To bound, or not to bound, that is the question
|
||||
|
||||
It's quite a reasonable request, really, and it's already been implemented
|
||||
for HTML. That is, length bounding. It makes little sense to let users
|
||||
define text blocks that have a font-size of 63,360 inches (that's a mile,
|
||||
by the way) or a width of forty-fold the parent container.
|
||||
|
||||
But it's a little more complicated then that. There are multiple units
|
||||
one can use, and we have to a little unit conversion to get things working.
|
||||
Here's what we have:
|
||||
|
||||
Absolute:
|
||||
1 in ~= 2.54 cm
|
||||
1 cm = 10 mm
|
||||
1 pt = 1/72 in
|
||||
1 pc = 12 pt
|
||||
|
||||
Relative:
|
||||
1 em ~= 10.0667 px
|
||||
1 ex ~= 0.5 em, though Mozilla Firefox says 1 ex = 6px
|
||||
1 px ~= 1 pt
|
||||
|
||||
Watch out: font-sizes can also be nested to get successively larger
|
||||
(although I do not relish having to keep track of context font-sizes,
|
||||
this may be necessary, especially for some of the more advanced features
|
||||
for preventing things like white on white).
|
||||
|
||||
vim: et sw=4 sts=4
|
47
htmlpurifier-4.10.0/docs/ref-devnetwork.html
Executable file
47
htmlpurifier-4.10.0/docs/ref-devnetwork.html
Executable file
@ -0,0 +1,47 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Credits and links to DevNetwork forum topics on HTML Purifier." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
|
||||
<title>DevNetwork Credits - HTML Purifier</title>
|
||||
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>DevNetwork Credits</h1>
|
||||
|
||||
<div id="filing">Filed under Reference</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Many thanks to the DevNetwork community for answering questions,
|
||||
theorizing about design, and offering encouragement during
|
||||
the development of this library in these forum threads:</p>
|
||||
|
||||
<ul>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=52905">HTMLPurifier PHP Library hompeage</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53056">How much of CSS to implement?</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53083">Parsing URL only according to URI : Security Risk?</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53096">Gimme a name : URI and friends</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53415">How to document configuration directives</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53479">IPv6</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53539">http and ftp versus news and mailto</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53579">HTMLPurifier - Take your best shot</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53664">Need help optimizing a block of code</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=53861">Non-SGML characters</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=54283">Wordpress makes me cry</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=54478">Parameter Object vs. Parameter Array vs. Parameter Functions</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=54521">Convert encoding where output cannot represent characters</a></li>
|
||||
<li><a href="http://forums.devnetwork.net/viewtopic.php?t=56411">Reporting errors in a document without line numbers</a></li>
|
||||
</ul>
|
||||
|
||||
<p>...as well as any I may have forgotten.</p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
<!-- vim: et sw=4 sts=4
|
||||
-->
|
166
htmlpurifier-4.10.0/docs/ref-html-modularization.txt
Executable file
166
htmlpurifier-4.10.0/docs/ref-html-modularization.txt
Executable file
@ -0,0 +1,166 @@
|
||||
|
||||
The Modularization of HTMLDefinition in HTML Purifier
|
||||
|
||||
WARNING: This document was drafted before the implementation of this
|
||||
system, and some implementation details may have evolved over time.
|
||||
|
||||
HTML Purifier uses the modularization of XHTML
|
||||
<http://www.w3.org/TR/xhtml-modularization/> to organize the internals
|
||||
of HTMLDefinition into a more manageable and extensible fashion. Rather
|
||||
than have one super-object, HTMLDefinition is split into HTMLModules,
|
||||
each of which are responsible for defining elements, their attributes,
|
||||
and other properties (for a more indepth coverage, see
|
||||
/library/HTMLPurifier/HTMLModule.php's docblock comments). These modules
|
||||
are managed by HTMLModuleManager.
|
||||
|
||||
Modules that we don't support but could support are:
|
||||
|
||||
* 5.6. Table Modules
|
||||
o 5.6.1. Basic Tables Module [?]
|
||||
* 5.8. Client-side Image Map Module [?]
|
||||
* 5.9. Server-side Image Map Module [?]
|
||||
* 5.12. Target Module [?]
|
||||
* 5.21. Name Identification Module [deprecated]
|
||||
|
||||
These modules would be implemented as "unsafe":
|
||||
|
||||
* 5.2. Core Modules
|
||||
o 5.2.1. Structure Module
|
||||
* 5.3. Applet Module
|
||||
* 5.5. Forms Modules
|
||||
o 5.5.1. Basic Forms Module
|
||||
o 5.5.2. Forms Module
|
||||
* 5.10. Object Module
|
||||
* 5.11. Frames Module
|
||||
* 5.13. Iframe Module
|
||||
* 5.14. Intrinsic Events Module
|
||||
* 5.15. Metainformation Module
|
||||
* 5.16. Scripting Module
|
||||
* 5.17. Style Sheet Module
|
||||
* 5.19. Link Module
|
||||
* 5.20. Base Module
|
||||
|
||||
We will not be using W3C's XML Schemas or DTDs directly due to the lack
|
||||
of robust tools for handling them (the main problem is that all the
|
||||
current parsers are usually PHP 5 only and solely-validating, not
|
||||
correcting).
|
||||
|
||||
This system may be generalized and ported over for CSS.
|
||||
|
||||
== General Use-Case ==
|
||||
|
||||
The outwards API of HTMLDefinition has been largely preserved, not
|
||||
only for backwards-compatibility but also by design. Instead,
|
||||
HTMLDefinition can be retrieved "raw", in which it loads a structure
|
||||
that closely resembles the modules of XHTML 1.1. This structure is very
|
||||
dynamic, making it easy to make cascading changes to global content
|
||||
sets or remove elements in bulk.
|
||||
|
||||
However, once HTML Purifier needs the actual definition, it retrieves
|
||||
a finalized version of HTMLDefinition. The finalized definition involves
|
||||
processing the modules into a form that it is optimized for multiple
|
||||
calls. This final version is immutable and, even if editable, would
|
||||
be extremely hard to change.
|
||||
|
||||
So, some code taking advantage of the XHTML modularization may look
|
||||
like this:
|
||||
|
||||
<?php
|
||||
$config = HTMLPurifier_Config::createDefault();
|
||||
$def =& $config->getHTMLDefinition(true); // reference to raw
|
||||
$def->addElement('marquee', 'Block', 'Flow', 'Common');
|
||||
$purifier = new HTMLPurifier($config);
|
||||
$purifier->purify($html); // now the definition is finalized
|
||||
?>
|
||||
|
||||
== Inclusions ==
|
||||
|
||||
One of the nice features of HTMLDefinition is that piggy-backing off
|
||||
of global attribute and content sets is extremely easy to do.
|
||||
|
||||
=== Attributes ===
|
||||
|
||||
HTMLModule->elements[$element]->attr stores attribute information for the
|
||||
specific attributes of $element. This is quite close to the final
|
||||
API that HTML Purifier interfaces with, but there's an important
|
||||
extra feature: attr may also contain a array with a member index zero.
|
||||
|
||||
<?php
|
||||
HTMLModule->elements[$element]->attr[0] = array('AttrSet');
|
||||
?>
|
||||
|
||||
Rather than map the attribute key 0 to an array (which should be
|
||||
an AttrDef), it defines a number of attribute collections that should
|
||||
be merged into this elements attribute array.
|
||||
|
||||
Furthermore, the value of an attribute key, attribute value pair need
|
||||
not be a fully fledged AttrDef object. They can also be a string, which
|
||||
signifies a AttrDef that is looked up from a centralized registry
|
||||
AttrTypes. This allows more concise attribute definitions that look
|
||||
more like W3C's declarations, as well as offering a centralized point
|
||||
for modifying the behavior of one attribute type. And, of course, the
|
||||
old method of manually instantiating an AttrDef still works.
|
||||
|
||||
=== Attribute Collections ===
|
||||
|
||||
Attribute collections are stored and processed in the AttrCollections
|
||||
object, which is responsible for performing the inclusions signified
|
||||
by the 0 index. These attribute collections, too, are mutable, by
|
||||
using HTMLModule->attr_collections. You may add new attributes
|
||||
to a collection or define an entirely new collection for your module's
|
||||
use. Inclusions can also be cumulative.
|
||||
|
||||
Attribute collections allow us to get rid of so called "global attributes"
|
||||
(which actually aren't so global).
|
||||
|
||||
=== Content Models and ChildDef ===
|
||||
|
||||
An implementation of the above-mentioned attributes and attribute
|
||||
collections was applied to the ChildDef system. HTML Purifier uses
|
||||
a proprietary system called ChildDef for performance and flexibility
|
||||
reasons, but this does not line up very well with W3C's notion of
|
||||
regexps for defining the allowed children of an element.
|
||||
|
||||
HTMLPurifier->elements[$element]->content_model and
|
||||
HTMLPurifier->elements[$element]->content_model_type store information
|
||||
about the final ChildDef that will be stored in
|
||||
HTMLPurifier->elements[$element]->child (we use a different variable
|
||||
because the two forms are sufficiently different).
|
||||
|
||||
$content_model is an abstract, string representation of the internal
|
||||
state of ChildDef, while $content_model_type is a string identifier
|
||||
of which ChildDef subclass to instantiate. $content_model is processed
|
||||
by substituting all content set identifiers (capitalized element names)
|
||||
with their contents. It is then parsed and passed into the appropriate
|
||||
ChildDef class, as defined by the ContentSets->getChildDef() or the
|
||||
custom fallback HTMLModule->getChildDef() for custom child definitions
|
||||
not in the core.
|
||||
|
||||
You'll need to use these facilities if you plan on referencing a content
|
||||
set like "Inline" or "Block", and using them is recommended even if you're
|
||||
not due to their conciseness.
|
||||
|
||||
A few notes on $content_model: it's structure can be as complicated
|
||||
as you want, but the pipe symbol (|) is reserved for defining possible
|
||||
choices, due to the content sets implementation. For example, a content
|
||||
model that looks like:
|
||||
|
||||
"Inline -> Block -> a"
|
||||
|
||||
...when the Inline content set is defined as "span | b" and the Block
|
||||
content set is defined as "div | blockquote", will expand into:
|
||||
|
||||
"span | b -> div | blockquote -> a"
|
||||
|
||||
The custom HTMLModule->getChildDef() function will need to be able to
|
||||
then feed this information to ChildDef in a usable manner.
|
||||
|
||||
=== Content Sets ===
|
||||
|
||||
Content sets can be altered using HTMLModule->content_sets, an associative
|
||||
array of content set names to content set contents. If the content set
|
||||
already exists, your values are appended on to it (great for, say,
|
||||
registering the font tag as an inline element), otherwise it is
|
||||
created. They are substituted into content_model.
|
||||
|
||||
vim: et sw=4 sts=4
|
26
htmlpurifier-4.10.0/docs/ref-proprietary-tags.txt
Executable file
26
htmlpurifier-4.10.0/docs/ref-proprietary-tags.txt
Executable file
@ -0,0 +1,26 @@
|
||||
|
||||
Proprietary Tags
|
||||
<nobr> and friends
|
||||
|
||||
Here are some proprietary tags that W3C does not define but occasionally show
|
||||
up in the wild. We have only included tags that would make sense in an
|
||||
HTML Purifier context.
|
||||
|
||||
<align>, block element that aligns (extremely rare)
|
||||
<blackface>, inline that double-bolds text (extremely rare)
|
||||
<comment>, hidden comment for IE and WebTV
|
||||
<multicol cols=number gutter=pixels width=pixels>, multiple columns
|
||||
<nobr>, no linebreaks
|
||||
<spacer align=* type="vertical|horizontal|block">, whitespace in doc,
|
||||
use width/height for block and size for vertical/horizontal (attributes)
|
||||
(extremely rare)
|
||||
<wbr>, potential word break point: allows linebreaks. Only works in <nobr>
|
||||
|
||||
<listing>, monospace pre-variant (extremely rare)
|
||||
<plaintext>, escapes all tags to the end of document
|
||||
<xmp>, monospace, replace with pre
|
||||
|
||||
These should be put into their own Tidy module, not loaded by default(?). These
|
||||
all qualify as "lenient" transforms.
|
||||
|
||||
vim: et sw=4 sts=4
|
26
htmlpurifier-4.10.0/docs/ref-whatwg.txt
Executable file
26
htmlpurifier-4.10.0/docs/ref-whatwg.txt
Executable file
@ -0,0 +1,26 @@
|
||||
|
||||
Web Hypertext Application Technology Working Group
|
||||
WHATWG
|
||||
|
||||
== HTML 5 ==
|
||||
|
||||
URL: http://www.whatwg.org/specs/web-apps/current-work/
|
||||
|
||||
HTML 5 defines a kaboodle of new elements and attributes, as well as
|
||||
some well-defined, "quirks mode" HTML parsing. Although WHATWG professes
|
||||
to be targeted towards web applications, many of their semantic additions
|
||||
would be quite useful in regular documents. Eventually, HTML
|
||||
Purifier will need to audit their lists and figure out what changes need
|
||||
to be made. This process is complicated by the fact that the WHATWG
|
||||
doesn't buy into W3C's modularization of XHTML 1.1: we may need
|
||||
to remodularize HTML 5 (probably done by section name). No sense in
|
||||
committing ourselves till the spec stabilizes, though.
|
||||
|
||||
More immediately speaking though, however, is the well-defined parsing
|
||||
behavior that HTML 5 adds. While I have little interest in writing
|
||||
another DirectLex parser, other parsers like ph5p
|
||||
<http://jero.net/lab/ph5p/> can be adapted to DOMLex to support much more
|
||||
flexible HTML parsing (a cool feature I've seen is how they resolve
|
||||
<b>bold<i>both</b>italic</i>).
|
||||
|
||||
vim: et sw=4 sts=4
|
10
htmlpurifier-4.10.0/docs/specimens/LICENSE
Executable file
10
htmlpurifier-4.10.0/docs/specimens/LICENSE
Executable file
@ -0,0 +1,10 @@
|
||||
Licensing of Specimens
|
||||
|
||||
Some files in this directory have different licenses:
|
||||
|
||||
windows-live-mail-desktop-beta.html - donated by laacz, public domain
|
||||
img.png - LGPL, from <http://commons.wikimedia.org/wiki/Image:Pastille_chrome.png>
|
||||
|
||||
All other files are by me, and are licensed under LGPL.
|
||||
|
||||
vim: et sw=4 sts=4
|
165
htmlpurifier-4.10.0/docs/specimens/html-align-to-css.html
Executable file
165
htmlpurifier-4.10.0/docs/specimens/html-align-to-css.html
Executable file
@ -0,0 +1,165 @@
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||||
"http://www.w3.org/TR/html4/loose.dtd">
|
||||
<html>
|
||||
<head>
|
||||
<title>HTML align attribute to CSS - HTML Purifier Specimen</title>
|
||||
<style type="text/css">
|
||||
div.container {position:relative;height:110px;}
|
||||
div.container.legend .test {text-align:center;line-height:100px;}
|
||||
div.test {width:100px;height:100px;border:1px solid black;
|
||||
position:absolute;top:10px;}
|
||||
div.test.html {left:10px;}
|
||||
div.test.css {left:140px;}
|
||||
table {background:#F00;}
|
||||
img {border:1px solid #000;}
|
||||
hr {width:50px;}
|
||||
div.segment {width:250px; float:left; margin-top:1em;}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<h1>HTML align attribute to CSS</h1>
|
||||
|
||||
<p>Inspect source for methodology.</p>
|
||||
|
||||
<div class="container legend">
|
||||
<div class="test html">
|
||||
HTML
|
||||
</div>
|
||||
<div class="test css">
|
||||
CSS
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="segment">
|
||||
|
||||
<h2>table.align</h2>
|
||||
|
||||
<h3>left</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<table align="left"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<table style="float:left;"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>center</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<table align="center"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<table style="margin-left:auto; margin-right:auto;"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>right</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<table align="right"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<table style="float:right;"><tr><td>O</td></tr></table>a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ################################################################## -->
|
||||
|
||||
<div class="segment">
|
||||
<h2>img.align</h2>
|
||||
<h3>left</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<img src="img.png" align="left">a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<img src="img.png" style="float:left;">a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>right</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<img src="img.png" align="right">a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<img src="img.png" style="float:right;">a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>bottom</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<img src="img.png" align="bottom">a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<img src="img.png" style="vertical-align:baseline;">a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>middle</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<img src="img.png" align="middle">a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<img src="img.png" style="vertical-align:middle;">a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>top</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
a<img src="img.png" align="top">a
|
||||
</div>
|
||||
<div class="test css">
|
||||
a<img src="img.png" style="vertical-align:top;">a
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- ################################################################## -->
|
||||
|
||||
<div class="segment">
|
||||
|
||||
<h2>hr.align</h2>
|
||||
|
||||
<h3>left</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
<hr align="left" />
|
||||
</div>
|
||||
<div class="test css">
|
||||
<hr style="margin-right:auto; margin-left:0; text-align:left;" />
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>center</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
<hr align="center" />
|
||||
</div>
|
||||
<div class="test css">
|
||||
<hr style="margin-right:auto; margin-left:auto; text-align:center;" />
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h3>right</h3>
|
||||
<div class="container">
|
||||
<div class="test html">
|
||||
<hr align="right" />
|
||||
</div>
|
||||
<div class="test css">
|
||||
<hr style="margin-right:0; margin-left:auto; text-align:right;" />
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
</body>
|
||||
</html>
|
BIN
htmlpurifier-4.10.0/docs/specimens/img.png
Executable file
BIN
htmlpurifier-4.10.0/docs/specimens/img.png
Executable file
Binary file not shown.
After Width: | Height: | Size: 2.1 KiB |
129
htmlpurifier-4.10.0/docs/specimens/jochem-blok-word.html
Executable file
129
htmlpurifier-4.10.0/docs/specimens/jochem-blok-word.html
Executable file
@ -0,0 +1,129 @@
|
||||
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
|
||||
|
||||
<head>
|
||||
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
|
||||
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
|
||||
<!--[if !mso]>
|
||||
<style>
|
||||
v\:* {behavior:url(#default#VML);}
|
||||
o\:* {behavior:url(#default#VML);}
|
||||
w\:* {behavior:url(#default#VML);}
|
||||
..shape {behavior:url(#default#VML);}
|
||||
</style>
|
||||
<![endif]-->
|
||||
<style>
|
||||
<!--
|
||||
/* Font Definitions */
|
||||
@font-face
|
||||
{font-family:"Cambria Math";
|
||||
panose-1:2 4 5 3 5 4 6 3 2 4;}
|
||||
@font-face
|
||||
{font-family:Calibri;
|
||||
panose-1:2 15 5 2 2 2 4 3 2 4;}
|
||||
@font-face
|
||||
{font-family:Tahoma;
|
||||
panose-1:2 11 6 4 3 5 4 4 2 4;}
|
||||
@font-face
|
||||
{font-family:Verdana;
|
||||
panose-1:2 11 6 4 3 5 4 4 2 4;}
|
||||
/* Style Definitions */
|
||||
p.MsoNormal, li.MsoNormal, div.MsoNormal
|
||||
{margin:0cm;
|
||||
margin-bottom:.0001pt;
|
||||
font-size:10.0pt;
|
||||
font-family:"Verdana","sans-serif";}
|
||||
a:link, span.MsoHyperlink
|
||||
{mso-style-priority:99;
|
||||
color:blue;
|
||||
text-decoration:underline;}
|
||||
a:visited, span.MsoHyperlinkFollowed
|
||||
{mso-style-priority:99;
|
||||
color:purple;
|
||||
text-decoration:underline;}
|
||||
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
|
||||
{mso-style-priority:99;
|
||||
mso-style-link:"Balloon Text Char";
|
||||
margin:0cm;
|
||||
margin-bottom:.0001pt;
|
||||
font-size:8.0pt;
|
||||
font-family:"Tahoma","sans-serif";}
|
||||
span.EmailStyle17
|
||||
{mso-style-type:personal-compose;
|
||||
font-family:"Verdana","sans-serif";
|
||||
color:windowtext;}
|
||||
span.BalloonTextChar
|
||||
{mso-style-name:"Balloon Text Char";
|
||||
mso-style-priority:99;
|
||||
mso-style-link:"Balloon Text";
|
||||
font-family:"Tahoma","sans-serif";}
|
||||
..MsoChpDefault
|
||||
{mso-style-type:export-only;}
|
||||
@page Section1
|
||||
{size:612.0pt 792.0pt;
|
||||
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
|
||||
div.Section1
|
||||
{page:Section1;}
|
||||
-->
|
||||
</style>
|
||||
<!--[if gte mso 9]><xml>
|
||||
<o:shapedefaults v:ext="edit" spidmax="2050" />
|
||||
</xml><![endif]--><!--[if gte mso 9]><xml>
|
||||
<o:shapelayout v:ext="edit">
|
||||
<o:idmap v:ext="edit" data="1" />
|
||||
</o:shapelayout></xml><![endif]-->
|
||||
</head>
|
||||
|
||||
<body lang=NL link=blue vlink=purple>
|
||||
|
||||
<div class=Section1>
|
||||
|
||||
<p class=MsoNormal><img width=1277 height=994 id="Picture_x0020_1"
|
||||
src="cid:image001.png@01C8CBDF.5D1BAEE0"><o:p></o:p></p>
|
||||
|
||||
<p class=MsoNormal><o:p> </o:p></p>
|
||||
|
||||
<p class=MsoNormal><b>Name<o:p></o:p></b></p>
|
||||
|
||||
<p class=MsoNormal>E-mail : <a href="mailto:mail@example.com"><span
|
||||
style='color:windowtext'>mail@example.com</span></a><o:p></o:p></p>
|
||||
|
||||
<p class=MsoNormal><o:p> </o:p></p>
|
||||
|
||||
<p class=MsoNormal><b>Company<o:p></o:p></b></p>
|
||||
|
||||
<p class=MsoNormal>Address 1<o:p></o:p></p>
|
||||
|
||||
<p class=MsoNormal>Address 2<o:p></o:p></p>
|
||||
|
||||
<p class=MsoNormal><o:p> </o:p></p>
|
||||
|
||||
<p class=MsoNormal>Telefoon : +xx xx xxx xxx xx <span style='color:black'><o:p></o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US style='color:black'>Fax : +xx xx xxx xx xx<o:p></o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US style='color:black'>Internet : </span><span
|
||||
style='color:black'><a href="http://www.example.com/"><span lang=EN-US
|
||||
style='color:black'>http://www.example.com</span></a></span><span
|
||||
lang=EN-US style='color:black'><o:p></o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US style='color:black'>Kamer van koophandel
|
||||
xxxxxxxxx<o:p></o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US style='color:black'><o:p> </o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US style='font-size:7.5pt;color:black'>Op deze
|
||||
e-mail is een disclaimer van toepassing, ga naar </span><span lang=EN-US
|
||||
style='font-size:7.5pt'><a
|
||||
href="http://www.example.com/disclaimer"><span
|
||||
style='color:black'>www.example.com/disclaimer</span></a><br>
|
||||
<span style='color:black'>A disclaimer is applicable to this email, please
|
||||
refer to </span><a href="http://www.example.com/disclaimer"><span
|
||||
style='color:black'>www.example.com/disclaimer</span></a><o:p></o:p></span></p>
|
||||
|
||||
<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p>
|
||||
|
||||
</div>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
74
htmlpurifier-4.10.0/docs/specimens/windows-live-mail-desktop-beta.html
Executable file
74
htmlpurifier-4.10.0/docs/specimens/windows-live-mail-desktop-beta.html
Executable file
@ -0,0 +1,74 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||||
<HTML ChildAreas="4" xmlns:canvas><HEAD>
|
||||
<META http-equiv=Content-Type content=text/html;charset=windows-1257>
|
||||
<STYLE></STYLE>
|
||||
|
||||
<META content="MSHTML 6.00.6000.16414" name=GENERATOR></HEAD>
|
||||
<BODY id=MailContainerBody
|
||||
style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; FONT-SIZE: 10pt; COLOR: #000000; PADDING-TOP: 15px; FONT-FAMILY: Arial"
|
||||
bgColor=#ff6600 leftMargin=0 background="" topMargin=0
|
||||
name="Compose message area" acc_role="text" CanvasTabStop="false">
|
||||
<DIV
|
||||
style="BORDER-TOP: #dddddd 1px solid; FONT-SIZE: 10pt; WIDTH: 100%; MARGIN-RIGHT: 10px; PADDING-TOP: 5px; BORDER-BOTTOM: #dddddd 1px solid; FONT-FAMILY: Verdana; HEIGHT: 25px; BACKGROUND-COLOR: #ffffff"><NOBR><SPAN
|
||||
title="View a slideshow of the pictures in this e-mail message."
|
||||
style="PADDING-RIGHT: 20px"><A style="COLOR: #0088e4"
|
||||
href="http://g.msn.com/5meen_us/171?path=/photomail/{6fc0065f-ffdd-4ca6-9a4c-cc5a93dc122f}&image=47D7B182CFEFB10!127&imagehi=47D7B182CFEFB10!125&CID=323550092004883216">Play
|
||||
slideshow </A></SPAN><SPAN style="COLOR: #909090"><SPAN>|</SPAN><SPAN
|
||||
style="PADDING-LEFT: 20px"> Download the highest quality version of a picture by
|
||||
clicking the + above it </SPAN></SPAN></NOBR></DIV>
|
||||
<DIV
|
||||
style="PADDING-RIGHT: 5px; PADDING-LEFT: 7px; PADDING-BOTTOM: 2px; WIDTH: 100%; PADDING-TOP: 2px">
|
||||
<OL>
|
||||
<LI><IMG title="Angry smile emoticon"
|
||||
style="FLOAT: none; MARGIN: 0px; POSITION: static" tabIndex=-1
|
||||
alt="Angry smile emoticon" src="cid:49F0C856199E4D688D2D740680733D74@wc"
|
||||
MSNNonUserImageOrEmoticon="true">Un ka <FONT style="BACKGROUND-COLOR: #800000"
|
||||
color=#cc99ff><STRONG>Tev</STRONG></FONT> iet, un ko tu dari?
|
||||
<LI>Aha!</LI></OL>
|
||||
|
||||
<UL>
|
||||
<LI>Buletets
|
||||
<LI>
|
||||
<DIV align=justify><A title=http://laacz.lv/blog/
|
||||
href="http://laacz.lv/blog/">http://laacz.lv/blog/</A> un <A
|
||||
title=http://google.com/ href="http://google.com/">gugle</A></DIV>
|
||||
<LI>Sarakstucitis</LI></UL></DIV><SPAN><SPAN xmlns:canvas="canvas-namespace-id"
|
||||
layoutEmptyTextWellFont="Tahoma"><SPAN
|
||||
style="MARGIN-BOTTOM: 15px; OVERFLOW: visible; HEIGHT: 16px"></SPAN><SPAN
|
||||
style="MARGIN-BOTTOM: 25px; VERTICAL-ALIGN: top; OVERFLOW: visible; MARGIN-RIGHT: 25px; HEIGHT: 234px">
|
||||
<TABLE style="DISPLAY: inline">
|
||||
<TBODY>
|
||||
<TR>
|
||||
|
||||
<TD>
|
||||
<DIV
|
||||
style="FONT-WEIGHT: bold; FONT-SIZE: 12pt; FONT-FAMILY: arial; TEXT-ALIGN: center"><A
|
||||
id=HiresARef
|
||||
title="Click here to view or download a high resolution version of this picture"
|
||||
style="COLOR: #0088e4; TEXT-DECORATION: none"
|
||||
href="http://byfiles.storage.msn.com/x1pMvt0I80jTgT6DuaCpEMbprX3nk3jNv_vjigxV_EYVSMyM_PKgEvDEUtuNhQC-F-23mTTcKyqx6eGaeK2e_wMJ0ikwpDdFntk4SY7pfJUv2g2Ck6R2S2vAA?download">+</A></DIV>
|
||||
<DIV
|
||||
title="Click here to view the full image using the online photo viewer."
|
||||
style="DISPLAY: inline; OVERFLOW: hidden; WIDTH: 140px; HEIGHT: 140px"><A
|
||||
href="http://g.msn.com/5meen_us/171?path=/photomail/{6fc0065f-ffdd-4ca6-9a4c-cc5a93dc122f}&image=47D7B182CFEFB10!127&imagehi=47D7B182CFEFB10!125&CID=323550092004883216"
|
||||
border="0"><IMG
|
||||
style="MARGIN-TOP: 15px; DISPLAY: inline-block; MARGIN-LEFT: 0px"
|
||||
height=109 src="cid:006A71303B80404E9FB6184E55D6A446@wc" width=140
|
||||
border=0></A></DIV></TD></TR>
|
||||
<TR>
|
||||
<TD>
|
||||
<DIV
|
||||
style="FONT-SIZE: 10pt; WIDTH: 140px; FONT-FAMILY: verdana; TEXT-ALIGN: center"><EM><STRONG>This
|
||||
<U>is </U></STRONG><U>tit</U>le</EM> fo<STRONG>r <FONT
|
||||
face="Arial Black">t<FONT color=#800000 size=7>h<U>i</U></FONT>s
|
||||
</FONT>picture</STRONG></DIV></TD></TR></TBODY></TABLE></SPAN></SPAN></SPAN>
|
||||
|
||||
<DIV
|
||||
style="PADDING-RIGHT: 5px; PADDING-LEFT: 7px; PADDING-BOTTOM: 2px; WIDTH: 100%; PADDING-TOP: 2px; HEIGHT: 50px">
|
||||
<DIV> </DIV></DIV>
|
||||
<DIV
|
||||
style="BORDER-TOP: #dddddd 1px solid; FONT-SIZE: 10pt; MARGIN-BOTTOM: 10px; WIDTH: 100%; COLOR: #909090; MARGIN-RIGHT: 10px; PADDING-TOP: 9px; FONT-FAMILY: Verdana; HEIGHT: 42px; BACKGROUND-COLOR: #ffffff"><NOBR><SPAN
|
||||
title="Join Windows Live to share photos using Windows Live Photo E-mail.">Online
|
||||
pictures are available for 30 days. <A style="COLOR: #0088e4"
|
||||
href="http://g.msn.com/5meen_us/175">Get Windows Live Mail desktop to create
|
||||
your own photo e-mails. </A></SPAN></NOBR></DIV></BODY></HTML>
|
76
htmlpurifier-4.10.0/docs/style.css
Executable file
76
htmlpurifier-4.10.0/docs/style.css
Executable file
@ -0,0 +1,76 @@
|
||||
html {font-size:1em; font-family:serif; }
|
||||
body {margin-left:4em; margin-right:4em; }
|
||||
|
||||
dt {font-weight:bold; }
|
||||
pre {margin-left:2em; }
|
||||
pre, code, tt {font-family:monospace; font-size:1em; }
|
||||
|
||||
h1 {text-align:center; font-family:Garamond, serif;
|
||||
font-variant:small-caps;}
|
||||
h2 {border-bottom:1px solid #CCC; font-family:sans-serif; font-weight:normal;
|
||||
font-size:1.3em;}
|
||||
h3 {font-family:sans-serif; font-size:1.1em; font-weight:bold; }
|
||||
h4 {font-family:sans-serif; font-size:0.9em; font-weight:bold; }
|
||||
|
||||
/* For witty quips */
|
||||
.subtitled {margin-bottom:0em;}
|
||||
.subtitle , .subsubtitle {font-size:.8em; margin-bottom:1em;
|
||||
font-style:italic; margin-top:-.2em;text-align:center;}
|
||||
.subsubtitle {text-align:left;margin-left:2em;}
|
||||
|
||||
/* Used for special "See also" links. */
|
||||
.reference {font-style:italic;margin-left:2em;}
|
||||
|
||||
/* Marks off asides, discussions on why something is the way it is */
|
||||
.aside {margin-left:2em; font-family:sans-serif; font-size:0.9em; }
|
||||
blockquote .label {font-weight:bold; font-size:1em; margin:0 0 .1em;
|
||||
border-bottom:1px solid #CCC;}
|
||||
.emphasis {font-weight:bold; text-align:center; font-size:1.3em;}
|
||||
|
||||
/* A regular table */
|
||||
.table {border-collapse:collapse; border-bottom:2px solid #888; margin-left:2em; }
|
||||
.table thead th {margin:0; background:#888; color:#FFF; }
|
||||
.table thead th:first-child {-moz-border-radius-topleft:1em;}
|
||||
.table tbody td {border-bottom:1px solid #CCC; padding-right:0.6em;padding-left:0.6em;}
|
||||
|
||||
/* A quick table*/
|
||||
table.quick tbody th {text-align:right; padding-right:1em;}
|
||||
|
||||
/* Category of the file */
|
||||
#filing {font-weight:bold; font-size:smaller; }
|
||||
|
||||
/* Contains, without exception, Return to index. */
|
||||
#index {font-size:smaller; }
|
||||
|
||||
#home {font-size:smaller;}
|
||||
|
||||
/* Contains, without exception, $Id$, for SVN version info. */
|
||||
#version {text-align:right; font-style:italic; margin:2em 0;}
|
||||
|
||||
#toc ol ol {list-style-type:lower-roman;}
|
||||
#toc ol {list-style-type:decimal;}
|
||||
#toc {list-style-type:upper-alpha;}
|
||||
|
||||
q {
|
||||
behavior: url(fixquotes.htc); /* IE fix */
|
||||
quotes: '\201C' '\201D' '\2018' '\2019';
|
||||
}
|
||||
q:before {
|
||||
content: open-quote;
|
||||
}
|
||||
q:after {
|
||||
content: close-quote;
|
||||
}
|
||||
|
||||
/* Marks off implementation details interesting only to the person writing
|
||||
the class described in the spec. */
|
||||
.technical {margin-left:2em; }
|
||||
.technical:before {content:"Technical note: "; font-weight:bold; color:#061; }
|
||||
|
||||
/* Marks off sections that are lacking. */
|
||||
.fixme {margin-left:2em; }
|
||||
.fixme:before {content:"Fix me: "; font-weight:bold; color:#C00; }
|
||||
|
||||
#applicability {margin: 1em 5%; font-style:italic;}
|
||||
|
||||
/* vim: et sw=4 sts=4 */
|
Reference in New Issue
Block a user