Server geändert
This commit is contained in:
@ -1,7 +1,7 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head>
|
||||
"https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en"><head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||||
<meta name="description" content="Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch." />
|
||||
<link rel="stylesheet" type="text/css" href="./style.css" />
|
||||
@ -22,7 +22,7 @@ own advice for sake of portability. -->
|
||||
|
||||
<div id="filing">Filed under End-User</div>
|
||||
<div id="index">Return to the <a href="index.html">index</a>.</div>
|
||||
<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
<div id="home"><a href="https://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div>
|
||||
|
||||
<p>Character encoding and character sets are not that
|
||||
difficult to understand, but so many people blithely stumble
|
||||
@ -217,7 +217,7 @@ if your <code>META</code> tag claims that either:</p>
|
||||
<p class="aside">The advice given here is for pages being served as
|
||||
vanilla <code>text/html</code>. Different practices must be used
|
||||
for <code>application/xml</code> or <code>application/xml+xhtml</code>, see
|
||||
<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
|
||||
<a href="https://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's
|
||||
document on XHTML media types</a> for more information.</p>
|
||||
|
||||
<p>If your <code>META</code> encoding and your real encoding match,
|
||||
@ -237,7 +237,7 @@ of your real encoding.</p>
|
||||
has to guess: and sometimes the guess is wrong. Hackers can manipulate
|
||||
this guess in order to slip XSS past filters and then fool the
|
||||
browser into executing it as active code. A great example of this
|
||||
is the <a href="http://shiflett.org/archive/177">Google UTF-7
|
||||
is the <a href="https://shiflett.org/archive/177">Google UTF-7
|
||||
exploit</a>.</p>
|
||||
<p>You might be able to get away with not specifying a character
|
||||
encoding with the <code>META</code> tag as long as your webserver
|
||||
@ -299,10 +299,10 @@ is slightly more difficult.</p>
|
||||
yourself, via your programming language. Since you're using HTML
|
||||
Purifier, I'll assume PHP, although it's not too difficult to do
|
||||
similar things in
|
||||
<a href="http://www.w3.org/International/O-HTTP-charset#scripting">other
|
||||
<a href="https://www.w3.org/International/O-HTTP-charset#scripting">other
|
||||
languages</a>. The appropriate code is:</p>
|
||||
|
||||
<pre><a href="http://php.net/function.header">header</a>('Content-Type:text/html; charset=UTF-8');</pre>
|
||||
<pre><a href="https://php.net/function.header">header</a>('Content-Type:text/html; charset=UTF-8');</pre>
|
||||
|
||||
<p>...replacing UTF-8 with whatever your embedded encoding is.
|
||||
This code must come before any output, so be careful about
|
||||
@ -312,16 +312,16 @@ output excluding whitespace within <?php ?> tags).</p>
|
||||
<h4 id="fixcharset-server-phpini">PHP ini directive</h4>
|
||||
|
||||
<p>PHP also has a neat little ini directive that can save you a
|
||||
header call: <code><a href="http://php.net/ini.core#ini.default-charset">default_charset</a></code>. Using this code:</p>
|
||||
header call: <code><a href="https://php.net/ini.core#ini.default-charset">default_charset</a></code>. Using this code:</p>
|
||||
|
||||
<pre><a href="http://php.net/function.ini_set">ini_set</a>('default_charset', 'UTF-8');</pre>
|
||||
<pre><a href="https://php.net/function.ini_set">ini_set</a>('default_charset', 'UTF-8');</pre>
|
||||
|
||||
<p>...will also do the trick. If PHP is running as an Apache module (and
|
||||
not as FastCGI, consult
|
||||
<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
|
||||
<a href="https://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property
|
||||
across many PHP files:</p>
|
||||
|
||||
<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset "UTF-8"</pre>
|
||||
<pre><a href="https://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset "UTF-8"</pre>
|
||||
|
||||
<blockquote class="aside"><p>As with all INI directives, this can
|
||||
also go in your php.ini file. Some hosting providers allow you to customize
|
||||
@ -340,11 +340,11 @@ techniques may work, or may not work.</p>
|
||||
|
||||
<p>On Apache, you can use an .htaccess file to change the character
|
||||
encoding. I'll defer to
|
||||
<a href="http://www.w3.org/International/questions/qa-htaccess-charset">W3C</a>
|
||||
<a href="https://www.w3.org/International/questions/qa-htaccess-charset">W3C</a>
|
||||
for the in-depth explanation, but it boils down to creating a file
|
||||
named .htaccess with the contents:</p>
|
||||
|
||||
<pre><a href="http://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset">AddCharset</a> UTF-8 .html</pre>
|
||||
<pre><a href="https://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset">AddCharset</a> UTF-8 .html</pre>
|
||||
|
||||
<p>Where UTF-8 is replaced with the character encoding you want to
|
||||
use and .html is a file extension that this will be applied to. This
|
||||
@ -353,7 +353,7 @@ or in the subdirectories of directory you place this file in.</p>
|
||||
|
||||
<p>If you're feeling particularly courageous, you can use:</p>
|
||||
|
||||
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> UTF-8</pre>
|
||||
<pre><a href="https://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> UTF-8</pre>
|
||||
|
||||
<p>...which changes the character set Apache adds to any document that
|
||||
doesn't have any Content-Type parameters. This directive, which the
|
||||
@ -363,7 +363,7 @@ with the <code>META</code> tag. If you would prefer Apache not to be
|
||||
butting in on your character encodings, you can tell it not
|
||||
to send anything at all:</p>
|
||||
|
||||
<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
|
||||
<pre><a href="https://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre>
|
||||
|
||||
<p>...making your internal charset declaration (usually the <code>META</code> tags)
|
||||
the sole source of character encoding
|
||||
@ -445,7 +445,7 @@ overrides the <code>META</code> tag. In reality, this happens only when the
|
||||
XHTML is actually served as legit XML and not HTML, which is almost always
|
||||
never due to Internet Explorer's lack of support for
|
||||
<code>application/xhtml+xml</code> (even though doing so is often
|
||||
argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good
|
||||
argued to be <a href="https://www.hixie.ch/advocacy/xhtml">good
|
||||
practice</a> and is required by the XHTML 1.1 specification).</p>
|
||||
|
||||
<p>For XML, however, this XML Declaration is extremely important.
|
||||
@ -554,7 +554,7 @@ when it became far to cumbersome to support foreign languages. Bots
|
||||
will now actually go through articles and convert character entities
|
||||
to their corresponding real characters for the sake of user-friendliness
|
||||
and searchability. See
|
||||
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
|
||||
<a href="https://meta.wikimedia.org/wiki/Help:Special_characters">Meta's
|
||||
page on special characters</a> for more details.
|
||||
</p></blockquote>
|
||||
|
||||
@ -575,7 +575,7 @@ which may be used by POST, and is required when you want to upload
|
||||
files.</p>
|
||||
|
||||
<p>The following is a summarization of notes from
|
||||
<a href="http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html">
|
||||
<a href="https://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html">
|
||||
<code>FORM</code> submission and i18n</a>. That document contains lots
|
||||
of useful information, but is written in a rambly manner, so
|
||||
here I try to get right to the point. (Note: the original has
|
||||
@ -589,7 +589,7 @@ looks something like: <code>%C3%86</code>. There is no official way of
|
||||
determining the character encoding of such a request, since the percent
|
||||
encoding operates on a byte level, so it is usually assumed that it
|
||||
is the same as the encoding the page containing the form was submitted
|
||||
in. (<a href="http://tools.ietf.org/html/rfc3986#section-2.5">RFC 3986</a>
|
||||
in. (<a href="https://tools.ietf.org/html/rfc3986#section-2.5">RFC 3986</a>
|
||||
recommends that textual identifiers be translated to UTF-8; however, browser
|
||||
compliance is spotty.) You'll run into very few problems
|
||||
if you only use characters in the character encoding you chose.</p>
|
||||
@ -762,7 +762,7 @@ knows about the change too. There are some caveats though:</p>
|
||||
encodings is notoriously spotty. Refer to your respective database's
|
||||
documentation on how to do this properly.</p>
|
||||
|
||||
<p>For <a href="http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
|
||||
<p>For <a href="https://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the
|
||||
character encoding conversion for you. However, you have
|
||||
to make sure that the text inside the column is what is says it is:
|
||||
if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle
|
||||
@ -772,7 +772,7 @@ and then finally to UTF-8. Many a website had pages irreversibly mangled
|
||||
because they didn't realize that they'd been deluding themselves about
|
||||
the character encoding all along; don't become the next victim.</p>
|
||||
|
||||
<p>For <a href="http://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
|
||||
<p>For <a href="https://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the
|
||||
encoding of a database (as of 8.2). You will have to dump the data, and then reimport
|
||||
it into a new table. Make sure that your client encoding is set properly:
|
||||
this is how PostgreSQL knows to perform an encoding conversion.</p>
|
||||
@ -832,15 +832,15 @@ converting reams of existing text and HTML files into UTF-8, as well as
|
||||
making sure that all new files uploaded are properly encoded. Once again,
|
||||
I can only point vaguely in the right direction for converting your
|
||||
existing files: make sure you backup, make sure you use
|
||||
<a href="http://php.net/ref.iconv">iconv</a>(), and
|
||||
<a href="https://php.net/ref.iconv">iconv</a>(), and
|
||||
make sure you know what the original character encoding of the files
|
||||
is (or are, depending on the tidiness of your system).</p>
|
||||
|
||||
<p>However, I can proffer more specific advice on the subject of
|
||||
text editors. Many text editors have notoriously spotty Unicode support.
|
||||
To find out how your editor is doing, you can check out <a
|
||||
href="http://www.alanwood.net/unicode/utilities_editors.html">this list</a>
|
||||
or <a href="http://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
|
||||
href="https://www.alanwood.net/unicode/utilities_editors.html">this list</a>
|
||||
or <a href="https://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a>
|
||||
I personally use Notepad++, which works like a charm when it comes to UTF-8.
|
||||
Usually, you will have to <strong>explicitly</strong> tell the editor through some dialogue
|
||||
(usually Save as or Format) what encoding you want it to use. An editor
|
||||
@ -859,7 +859,7 @@ BOM below.</p>
|
||||
|
||||
<h3 id="migrate-bom">Byte Order Mark (headers already sent!)</h3>
|
||||
|
||||
<p>The BOM, or <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
|
||||
<p>The BOM, or <a href="https://en.wikipedia.org/wiki/Byte_Order_Mark">Byte
|
||||
Order Mark</a>, is a magical, invisible character placed at
|
||||
the beginning of UTF-8 files to tell people what the encoding is and
|
||||
what the endianness of the text is. It is also unnecessary.</p>
|
||||
@ -917,7 +917,7 @@ anyway. So we'll deal with the other two edge cases.</p>
|
||||
would like to read your website but get heaps of question marks or
|
||||
other meaningless characters. Fixing this problem requires the
|
||||
installation of a font or language pack which is often highly
|
||||
dependent on what the language is. <a href="http://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_and_input_help">Here is an example</a>
|
||||
dependent on what the language is. <a href="https://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_and_input_help">Here is an example</a>
|
||||
of such a help file for the Bengali language; I am sure there are
|
||||
others out there too. You just have to point users to the appropriate
|
||||
help file.</p>
|
||||
@ -927,7 +927,7 @@ help file.</p>
|
||||
<p>A prime example of when you'll see some very obscure Unicode
|
||||
characters embedded in what otherwise would be very bland ASCII are
|
||||
letters of the
|
||||
<a href="http://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
|
||||
<a href="https://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International
|
||||
Phonetic Alphabet (IPA)</a>, use to designate pronunciations in a very standard
|
||||
manner (you probably see them all the time in your dictionary). Your
|
||||
average font probably won't have support for all of the IPA characters
|
||||
@ -947,10 +947,10 @@ to known good Unicode fonts.</p>
|
||||
|
||||
<p>Fortunately, the folks over at Wikipedia have already done all the
|
||||
heavy lifting for you. Get the CSS from the horses mouth here:
|
||||
<a href="http://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
|
||||
<a href="https://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>,
|
||||
and search for ".IPA" There are also a smattering of
|
||||
other classes you can use for other purposes, check out
|
||||
<a href="http://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
|
||||
<a href="https://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a>
|
||||
for more details. For you lazy ones, this should work:</p>
|
||||
|
||||
<pre>.Unicode {
|
||||
@ -964,7 +964,7 @@ for more details. For you lazy ones, this should work:</p>
|
||||
|
||||
<p>The standard usage goes along the lines of <code><span class="Unicode">Crazy
|
||||
Unicode stuff here</span></code>. Characters in the
|
||||
<a href="http://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
|
||||
<a href="https://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a>
|
||||
usually don't need to be fixed, but for anything else you probably
|
||||
want to play it safe. Unless, of course, you don't care about IE6
|
||||
users.</p>
|
||||
@ -994,10 +994,10 @@ and yes, it is variable width. Other traits:</p>
|
||||
<p>Each of these traits affect different domains of text processing
|
||||
in different ways. It is beyond the scope of this document to explain
|
||||
what precisely these implications are. PHPWact provides
|
||||
a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a>
|
||||
a very good <a href="https://www.phpwact.org/php/i18n/utf-8">reference document</a>
|
||||
on what to expect from each function, although coverage is spotty in
|
||||
some areas. Their more general notes on
|
||||
<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a>
|
||||
<a href="https://www.phpwact.org/php/i18n/charsets">character sets</a>
|
||||
are also worth looking at for information on UTF-8. Some rules of thumb
|
||||
when dealing with Unicode text:</p>
|
||||
|
||||
@ -1024,7 +1024,7 @@ usually won't matter since substr() also operates with byte indices!</p>
|
||||
|
||||
<p>You'll also need to make sure your UTF-8 is well-formed and will
|
||||
probably need replacements for some of these functions. I recommend
|
||||
using Harry Fuecks' <a href="http://phputf8.sourceforge.net/">PHP
|
||||
using Harry Fuecks' <a href="https://phputf8.sourceforge.net/">PHP
|
||||
UTF-8</a> library, rather than use mb_string directly. HTML Purifier
|
||||
also defines a few useful UTF-8 compatible functions: check out
|
||||
<code>Encoder.php</code> in the <code>/library/HTMLPurifier/</code>
|
||||
@ -1042,12 +1042,12 @@ UTF-8 and internationalization, and I would like to defer to them for
|
||||
a more in-depth look into character sets and encodings.</p>
|
||||
|
||||
<ul>
|
||||
<li><a href="http://www.joelonsoftware.com/articles/Unicode.html">
|
||||
<li><a href="https://www.joelonsoftware.com/articles/Unicode.html">
|
||||
The Absolute Minimum Every Software Developer Absolutely,
|
||||
Positively Must Know About Unicode and Character Sets
|
||||
(No Excuses!)</a> by Joel Spolsky, provides a <em>very</em>
|
||||
good high-level look at Unicode and character sets in general.</li>
|
||||
<li><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>,
|
||||
<li><a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>,
|
||||
provides a lot of useful details into the innards of UTF-8, although
|
||||
it may be a little off-putting to people who don't know much
|
||||
about Unicode to begin with.</li>
|
||||
|
Reference in New Issue
Block a user