From 10831dd274ff65d4852b47dbc398adae61845206 Mon Sep 17 00:00:00 2001 From: Bernhard Posselt Date: Sat, 4 May 2013 00:15:41 +0200 Subject: use html purifier for sanitation --- .../htmlpurifier/docs/ref-html-modularization.txt | 166 +++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 3rdparty/htmlpurifier/docs/ref-html-modularization.txt (limited to '3rdparty/htmlpurifier/docs/ref-html-modularization.txt') diff --git a/3rdparty/htmlpurifier/docs/ref-html-modularization.txt b/3rdparty/htmlpurifier/docs/ref-html-modularization.txt new file mode 100644 index 000000000..0fd88c839 --- /dev/null +++ b/3rdparty/htmlpurifier/docs/ref-html-modularization.txt @@ -0,0 +1,166 @@ + +The Modularization of HTMLDefinition in HTML Purifier + +WARNING: This document was drafted before the implementation of this + system, and some implementation details may have evolved over time. + +HTML Purifier uses the modularization of XHTML + to organize the internals +of HTMLDefinition into a more manageable and extensible fashion. Rather +than have one super-object, HTMLDefinition is split into HTMLModules, +each of which are responsible for defining elements, their attributes, +and other properties (for a more indepth coverage, see +/library/HTMLPurifier/HTMLModule.php's docblock comments). These modules +are managed by HTMLModuleManager. + +Modules that we don't support but could support are: + + * 5.6. Table Modules + o 5.6.1. Basic Tables Module [?] + * 5.8. Client-side Image Map Module [?] + * 5.9. Server-side Image Map Module [?] + * 5.12. Target Module [?] + * 5.21. Name Identification Module [deprecated] + +These modules would be implemented as "unsafe": + + * 5.2. Core Modules + o 5.2.1. Structure Module + * 5.3. Applet Module + * 5.5. Forms Modules + o 5.5.1. Basic Forms Module + o 5.5.2. Forms Module + * 5.10. Object Module + * 5.11. Frames Module + * 5.13. Iframe Module + * 5.14. Intrinsic Events Module + * 5.15. Metainformation Module + * 5.16. Scripting Module + * 5.17. Style Sheet Module + * 5.19. Link Module + * 5.20. Base Module + +We will not be using W3C's XML Schemas or DTDs directly due to the lack +of robust tools for handling them (the main problem is that all the +current parsers are usually PHP 5 only and solely-validating, not +correcting). + +This system may be generalized and ported over for CSS. + +== General Use-Case == + +The outwards API of HTMLDefinition has been largely preserved, not +only for backwards-compatibility but also by design. Instead, +HTMLDefinition can be retrieved "raw", in which it loads a structure +that closely resembles the modules of XHTML 1.1. This structure is very +dynamic, making it easy to make cascading changes to global content +sets or remove elements in bulk. + +However, once HTML Purifier needs the actual definition, it retrieves +a finalized version of HTMLDefinition. The finalized definition involves +processing the modules into a form that it is optimized for multiple +calls. This final version is immutable and, even if editable, would +be extremely hard to change. + +So, some code taking advantage of the XHTML modularization may look +like this: + +getHTMLDefinition(true); // reference to raw + $def->addElement('marquee', 'Block', 'Flow', 'Common'); + $purifier = new HTMLPurifier($config); + $purifier->purify($html); // now the definition is finalized +?> + +== Inclusions == + +One of the nice features of HTMLDefinition is that piggy-backing off +of global attribute and content sets is extremely easy to do. + +=== Attributes === + +HTMLModule->elements[$element]->attr stores attribute information for the +specific attributes of $element. This is quite close to the final +API that HTML Purifier interfaces with, but there's an important +extra feature: attr may also contain a array with a member index zero. + +elements[$element]->attr[0] = array('AttrSet'); +?> + +Rather than map the attribute key 0 to an array (which should be +an AttrDef), it defines a number of attribute collections that should +be merged into this elements attribute array. + +Furthermore, the value of an attribute key, attribute value pair need +not be a fully fledged AttrDef object. They can also be a string, which +signifies a AttrDef that is looked up from a centralized registry +AttrTypes. This allows more concise attribute definitions that look +more like W3C's declarations, as well as offering a centralized point +for modifying the behavior of one attribute type. And, of course, the +old method of manually instantiating an AttrDef still works. + +=== Attribute Collections === + +Attribute collections are stored and processed in the AttrCollections +object, which is responsible for performing the inclusions signified +by the 0 index. These attribute collections, too, are mutable, by +using HTMLModule->attr_collections. You may add new attributes +to a collection or define an entirely new collection for your module's +use. Inclusions can also be cumulative. + +Attribute collections allow us to get rid of so called "global attributes" +(which actually aren't so global). + +=== Content Models and ChildDef === + +An implementation of the above-mentioned attributes and attribute +collections was applied to the ChildDef system. HTML Purifier uses +a proprietary system called ChildDef for performance and flexibility +reasons, but this does not line up very well with W3C's notion of +regexps for defining the allowed children of an element. + +HTMLPurifier->elements[$element]->content_model and +HTMLPurifier->elements[$element]->content_model_type store information +about the final ChildDef that will be stored in +HTMLPurifier->elements[$element]->child (we use a different variable +because the two forms are sufficiently different). + +$content_model is an abstract, string representation of the internal +state of ChildDef, while $content_model_type is a string identifier +of which ChildDef subclass to instantiate. $content_model is processed +by substituting all content set identifiers (capitalized element names) +with their contents. It is then parsed and passed into the appropriate +ChildDef class, as defined by the ContentSets->getChildDef() or the +custom fallback HTMLModule->getChildDef() for custom child definitions +not in the core. + +You'll need to use these facilities if you plan on referencing a content +set like "Inline" or "Block", and using them is recommended even if you're +not due to their conciseness. + +A few notes on $content_model: it's structure can be as complicated +as you want, but the pipe symbol (|) is reserved for defining possible +choices, due to the content sets implementation. For example, a content +model that looks like: + +"Inline -> Block -> a" + +...when the Inline content set is defined as "span | b" and the Block +content set is defined as "div | blockquote", will expand into: + +"span | b -> div | blockquote -> a" + +The custom HTMLModule->getChildDef() function will need to be able to +then feed this information to ChildDef in a usable manner. + +=== Content Sets === + +Content sets can be altered using HTMLModule->content_sets, an associative +array of content set names to content set contents. If the content set +already exists, your values are appended on to it (great for, say, +registering the font tag as an inline element), otherwise it is +created. They are substituted into content_model. + + vim: et sw=4 sts=4 -- cgit v1.2.3