Server geändert

This commit is contained in:
aschwarz
2023-04-25 13:17:57 +02:00
parent f1b11efb5d
commit f93fa77db7
603 changed files with 2951 additions and 2951 deletions

View File

@ -1,7 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
"https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="description" content="Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch." />
<link rel="stylesheet" type="text/css" href="./style.css" />
@ -22,7 +22,7 @@ own advice for sake of portability. -->
<div id="filing">Filed under End-User</div>
<div id="index">Return to the <a href="index.html">index</a>.</div>
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
<div id="home"><a href="https://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
<p>Character encoding and character sets are not that
difficult to understand, but so many people blithely stumble
@ -217,7 +217,7 @@ if your <code>META</code> tag claims that either:</p>
<p class="aside">The advice given here is for pages being served as
vanilla <code>text/html</code>. Different practices must be used
for <code>application/xml</code> or <code>application/xml+xhtml</code>, see
<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
<a href="https://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
document on XHTML media types</a> for more information.</p>
<p>If your <code>META</code> encoding and your real encoding match,
@ -237,7 +237,7 @@ of your real encoding.</p>
has to guess: and sometimes the guess is wrong. Hackers can manipulate
this guess in order to slip XSS past filters and then fool the
browser into executing it as active code. A great example of this
is the <a href="http://shiflett.org/archive/177">Google UTF-7
is the <a href="https://shiflett.org/archive/177">Google UTF-7
exploit</a>.</p>
<p>You might be able to get away with not specifying a character
encoding with the <code>META</code> tag as long as your webserver
@ -299,10 +299,10 @@ is slightly more difficult.</p>
yourself, via your programming language. Since you're using HTML
Purifier, I'll assume PHP, although it's not too difficult to do
similar things in
<a href="http://www.w3.org/International/O-HTTP-charset#scripting">other
<a href="https://www.w3.org/International/O-HTTP-charset#scripting">other
languages</a>. The appropriate code is:</p>
<pre><a href="http://php.net/function.header">header</a>('Content-Type:text/html; charset=UTF-8');</pre>
<pre><a href="https://php.net/function.header">header</a>('Content-Type:text/html; charset=UTF-8');</pre>
<p>...replacing UTF-8 with whatever your embedded encoding is.
This code must come before any output, so be careful about
@ -312,16 +312,16 @@ output excluding whitespace within &lt;?php ?&gt; tags).</p>
<h4 id="fixcharset-server-phpini">PHP ini directive</h4>
<p>PHP also has a neat little ini directive that can save you a
header call: <code><a href="http://php.net/ini.core#ini.default-charset">default_charset</a></code>. Using this code:</p>
header call: <code><a href="https://php.net/ini.core#ini.default-charset">default_charset</a></code>. Using this code:</p>
<pre><a href="http://php.net/function.ini_set">ini_set</a>('default_charset', 'UTF-8');</pre>
<pre><a href="https://php.net/function.ini_set">ini_set</a>('default_charset', 'UTF-8');</pre>
<p>...will also do the trick. If PHP is running as an Apache module (and
not as FastCGI, consult
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
<a href="https://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
across many PHP files:</p>
<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset &quot;UTF-8&quot;</pre>
<pre><a href="https://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset &quot;UTF-8&quot;</pre>
<blockquote class="aside"><p>As with all INI directives, this can
also go in your php.ini file. Some hosting providers allow you to customize
@ -340,11 +340,11 @@ techniques may work, or may not work.</p>
<p>On Apache, you can use an .htaccess file to change the character
encoding. I'll defer to
<a href="http://www.w3.org/International/questions/qa-htaccess-charset">W3C</a>
<a href="https://www.w3.org/International/questions/qa-htaccess-charset">W3C</a>
for the in-depth explanation, but it boils down to creating a file
named .htaccess with the contents:</p>
<pre><a href="http://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset">AddCharset</a> UTF-8 .html</pre>
<pre><a href="https://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset">AddCharset</a> UTF-8 .html</pre>
<p>Where UTF-8 is replaced with the character encoding you want to
use and .html is a file extension that this will be applied to. This
@ -353,7 +353,7 @@ or in the subdirectories of directory you place this file in.</p>
<p>If you're feeling particularly courageous, you can use:</p>
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> UTF-8</pre>
<pre><a href="https://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> UTF-8</pre>
<p>...which changes the character set Apache adds to any document that
doesn't have any Content-Type parameters. This directive, which the
@ -363,7 +363,7 @@ with the <code>META</code> tag. If you would prefer Apache not to be
butting in on your character encodings, you can tell it not
to send anything at all:</p>
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
<pre><a href="https://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
<p>...making your internal charset declaration (usually the <code>META</code> tags)
the sole source of character encoding
@ -445,7 +445,7 @@ overrides the <code>META</code> tag. In reality, this happens only when the
XHTML is actually served as legit XML and not HTML, which is almost always
never due to Internet Explorer's lack of support for
<code>application/xhtml+xml</code> (even though doing so is often
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good
argued to be <a href="https://www.hixie.ch/advocacy/xhtml">good
practice</a> and is required by the XHTML 1.1 specification).</p>
<p>For XML, however, this XML Declaration is extremely important.
@ -554,7 +554,7 @@ when it became far to cumbersome to support foreign languages. Bots
will now actually go through articles and convert character entities
to their corresponding real characters for the sake of user-friendliness
and searchability. See
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
<a href="https://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
page on special characters</a> for more details.
</p></blockquote>
@ -575,7 +575,7 @@ which may be used by POST, and is required when you want to upload
files.</p>
<p>The following is a summarization of notes from
<a href="http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html">
<a href="https://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html">
<code>FORM</code> submission and i18n</a>. That document contains lots
of useful information, but is written in a rambly manner, so
here I try to get right to the point. (Note: the original has
@ -589,7 +589,7 @@ looks something like: <code>%C3%86</code>. There is no official way of
determining the character encoding of such a request, since the percent
encoding operates on a byte level, so it is usually assumed that it
is the same as the encoding the page containing the form was submitted
in. (<a href="http://tools.ietf.org/html/rfc3986#section-2.5">RFC 3986</a>
in. (<a href="https://tools.ietf.org/html/rfc3986#section-2.5">RFC 3986</a>
recommends that textual identifiers be translated to UTF-8; however, browser
compliance is spotty.) You'll run into very few problems
if you only use characters in the character encoding you chose.</p>
@ -762,7 +762,7 @@ knows about the change too. There are some caveats though:</p>
encodings is notoriously spotty. Refer to your respective database's
documentation on how to do this properly.</p>
<p>For <a href="http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
<p>For <a href="https://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
character encoding conversion for you. However, you have
to make sure that the text inside the column is what is says it is:
if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle
@ -772,7 +772,7 @@ and then finally to UTF-8. Many a website had pages irreversibly mangled
because they didn't realize that they'd been deluding themselves about
the character encoding all along; don't become the next victim.</p>
<p>For <a href="http://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
<p>For <a href="https://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
encoding of a database (as of 8.2). You will have to dump the data, and then reimport
it into a new table. Make sure that your client encoding is set properly:
this is how PostgreSQL knows to perform an encoding conversion.</p>
@ -832,15 +832,15 @@ converting reams of existing text and HTML files into UTF-8, as well as
making sure that all new files uploaded are properly encoded. Once again,
I can only point vaguely in the right direction for converting your
existing files: make sure you backup, make sure you use
<a href="http://php.net/ref.iconv">iconv</a>(), and
<a href="https://php.net/ref.iconv">iconv</a>(), and
make sure you know what the original character encoding of the files
is (or are, depending on the tidiness of your system).</p>
<p>However, I can proffer more specific advice on the subject of
text editors. Many text editors have notoriously spotty Unicode support.
To find out how your editor is doing, you can check out <a
href="http://www.alanwood.net/unicode/utilities_editors.html">this list</a>
or <a href="http://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
href="https://www.alanwood.net/unicode/utilities_editors.html">this list</a>
or <a href="https://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
I personally use Notepad++, which works like a charm when it comes to UTF-8.
Usually, you will have to <strong>explicitly</strong> tell the editor through some dialogue
(usually Save as or Format) what encoding you want it to use. An editor
@ -859,7 +859,7 @@ BOM below.</p>
<h3 id="migrate-bom">Byte Order Mark (headers already sent!)</h3>
<p>The BOM, or <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
<p>The BOM, or <a href="https://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
Order Mark</a>, is a magical, invisible character placed at
the beginning of UTF-8 files to tell people what the encoding is and
what the endianness of the text is. It is also unnecessary.</p>
@ -917,7 +917,7 @@ anyway. So we'll deal with the other two edge cases.</p>
would like to read your website but get heaps of question marks or
other meaningless characters. Fixing this problem requires the
installation of a font or language pack which is often highly
dependent on what the language is. <a href="http://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_and_input_help">Here is an example</a>
dependent on what the language is. <a href="https://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_and_input_help">Here is an example</a>
of such a help file for the Bengali language; I am sure there are
others out there too. You just have to point users to the appropriate
help file.</p>
@ -927,7 +927,7 @@ help file.</p>
<p>A prime example of when you'll see some very obscure Unicode
characters embedded in what otherwise would be very bland ASCII are
letters of the
<a href="http://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
<a href="https://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
Phonetic Alphabet (IPA)</a>, use to designate pronunciations in a very standard
manner (you probably see them all the time in your dictionary). Your
average font probably won't have support for all of the IPA characters
@ -947,10 +947,10 @@ to known good Unicode fonts.</p>
<p>Fortunately, the folks over at Wikipedia have already done all the
heavy lifting for you. Get the CSS from the horses mouth here:
<a href="http://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
<a href="https://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
and search for &quot;.IPA&quot; There are also a smattering of
other classes you can use for other purposes, check out
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
<a href="https://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
for more details. For you lazy ones, this should work:</p>
<pre>.Unicode {
@ -964,7 +964,7 @@ for more details. For you lazy ones, this should work:</p>
<p>The standard usage goes along the lines of <code>&lt;span class=&quot;Unicode&quot;&gt;Crazy
Unicode stuff here&lt;/span&gt;</code>. Characters in the
<a href="http://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
<a href="https://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
usually don't need to be fixed, but for anything else you probably
want to play it safe. Unless, of course, you don't care about IE6
users.</p>
@ -994,10 +994,10 @@ and yes, it is variable width. Other traits:</p>
<p>Each of these traits affect different domains of text processing
in different ways. It is beyond the scope of this document to explain
what precisely these implications are. PHPWact provides
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
a very good <a href="https://www.phpwact.org/php/i18n/utf-8">reference document</a>
on what to expect from each function, although coverage is spotty in
some areas. Their more general notes on
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
<a href="https://www.phpwact.org/php/i18n/charsets">character sets</a>
are also worth looking at for information on UTF-8. Some rules of thumb
when dealing with Unicode text:</p>
@ -1024,7 +1024,7 @@ usually won't matter since substr() also operates with byte indices!</p>
<p>You'll also need to make sure your UTF-8 is well-formed and will
probably need replacements for some of these functions. I recommend
using Harry Fuecks' <a href="http://phputf8.sourceforge.net/">PHP
using Harry Fuecks' <a href="https://phputf8.sourceforge.net/">PHP
UTF-8</a> library, rather than use mb_string directly. HTML Purifier
also defines a few useful UTF-8 compatible functions: check out
<code>Encoder.php</code> in the <code>/library/HTMLPurifier/</code>
@ -1042,12 +1042,12 @@ UTF-8 and internationalization, and I would like to defer to them for
a more in-depth look into character sets and encodings.</p>
<ul>
<li><a href="http://www.joelonsoftware.com/articles/Unicode.html">
<li><a href="https://www.joelonsoftware.com/articles/Unicode.html">
The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets
(No Excuses!)</a> by Joel Spolsky, provides a <em>very</em>
good high-level look at Unicode and character sets in general.</li>
<li><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>,
<li><a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>,
provides a lot of useful details into the innards of UTF-8, although
it may be a little off-putting to people who don't know much
about Unicode to begin with.</li>