教程学院
图像设计 多媒体类 机械制图 办公软件 操作系统 系统编程 网站编程 网页制作 数据库类 网络路由 网络工程 网络安全 考试认证
酷网学院
CAD
AutoCad Cam350 ProEngineer GCcam MATLAB Unigraphics SolidWorks CAXA Solid3000 Cimatron EdgeCAM
系统
安全 防火墙 病毒 WinXP Win2003 Vista
数据库
编程
网络
精彩图库
  当前位置: 库库中文网 · 网站编程教程 · PHP教程 · PHP基础教程

html地标准里这样写地。

学院最新推荐文章
教程推荐
『html地标准里这样写地。』如果文章有大量图片,显示会较慢,请等待图片下载完成
 
点击数: 更新时间:2005-11-19 
HTML Document Character Set
Contents

The Document Character Set
Character entities

Human languages define a large number of text characters and human beings have invented a wide variety of systems for representing these characters in a computer. Unless proper precautions are taken, differing character representations may not be understood by user agents in all parts of the world.

The Document Character Set
To promote interoperability, SGML requires that each application (including HTML), as part of its definition, define its document character set. A document character set is a set of abstract characters (such as the Cyrillic letter "I", the Chinese character meaning "water", etc.) and a corresponding set of integer references to those characters. SGML considers a document to be a sequence of references in the document character set.

The document character set for HTML is the Universal Character Set (UCS) of [ISO10646]. This set is character-by-character equivalent to Unicode 2.0 ([UNICODE]). Both of these standards are updated from time to time with new characters and the amendments should be consulted at the respective Web sites.

In the current specification, references to ISO/IEC-10646 or Unicode imply the same document character set. However, the current document also refers to the Unicode specification for other issues such as the bidirectional text algorithm.

Conforming HTML user agents may receive or output a document, or represent a document internally, using any character encoding. A character encoding represents some subset of the document character set. Character encodings such as ISO-8859-1 (commonly referred to as "Latin-1" since it encodes most Western European languages), ISO-8859-5 (which supports Cyrillic), SHIFT_JIS (a Japanese encoding), and euc-jp (another Japanese encoding) save bandwidth by representing only slices of the document character set.

Thus, character encodings allow authors to work with a convenient subset of the document character. Authors should not have to know anything about the underlying character encoding of the document or tool they are using --- writing Japanese in a UTF-8 editor is as easy as writing Japanese in a JIS or SHIFT_JIS editor.

Character encodings also mean that authors are not required to enter a document's text in the form of references the document character set. Requiring authors to work with such a large character encoding would be cumbersome and wasteful (although encodings such as UTF-8 that cover all of Unicode do exist).

To allow this convenience, conforming user agents must correctly map to [UNICODE] all characters in any character encodings ("charsets") they recognize (or behave as if they did). A list of recommended character encodings for various scripts and languages will be provided in a separate document.

How does a user agent know which character encoding has been used to encode a given document?

In many cases, before a Web server sends an HTML document over the Web, it tries to figure out the character encoding (by a variety of techniques such as examining the first few bytes of the file, checking its encoding against a database of known files and encodings, etc.). The server transmits the document and the name of the character encoding to the receiving user agent by way of the charset parameter of the HTTP "Content-Type" field. For example, the following HTTP header announces that the character encoding is "euc-jp".

Content-Type: text/HTML; charset=euc-jp

The value of the "charset" parameter must be the name of a "charset" as defined in [RFC2045].

Unfortunately, not all servers send information about the character encoding (even when the character encoding is different from the widely used ISO-8859-1 encoding). HTML therefore allows authors a way to tell user agents which character encoding has been used by specifying it explicitly in the document header with the META element. For example, to specify that the character encoding of the current document is "euc-jp", include the following META declaration:

<META http-equiv="Content-Type" Content="text/HTML; charset=euc-jp">

This mechanism has a notable limit: the user agent cannot interpret the META element to determine the character encoding if it doesn't already know the character encoding of the document. The META declaration must only be used when the character encoding is organized such that ASCII characters stand for themselves at least until the META element is parsed. In this case, conforming user agents must correctly interpret the META element.

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding, (from highest priority to lowest):

Explicit user action to override erroneous behavior.
An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
The "charset" attribute set for the A and LINK elements.
User agent heuristics and user settings. For example, user agents typically assume that in the absence of other indicators, the character encoding is ISO-8859-1. This assumption may lead to an unreadable presentation of certain documents.
In all cases, the value of the "charset" attribute or parameter must be the name of a "charset" as defined in [RFC2045].

If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.

Note: Modern web servers can be configured with information about which document is using which character encoding. Webmasters should use these facilities but should take pains to configure the server properly.

Character entities
Your hardware and software configuration probably won't allow you to refer to all Unicode characters through simple input mechanisms, so SGML offers character encoding-independent mechanisms for specifying any character from the document character set.

Numeric character references (either decimal or hexadecimal form).
Named character references.
Numeric character references specify the integer reference of a Unicode character. A numeric character reference with the syntax &#D; refers to Unicode decimal character number D. A numeric character reference with the syntax &#xH; refers to Unicode hexadecimal character number H. The hexadecimal representation is a new SGML convention and is particularly useful since character standards use hexadecimal representations.

Here are some examples:


Entity å refers to the letter "a" with a small circle above it (used, for example, in Norwegian).
Entity å refers to the same character with the hexadecimal representation.
Entity И refers to the Cyrillic capital letter "I".
Entity 水 refers to the Chinese character for water with the hexadecimal representation.
To give authors a more intuitive way to refer to characters in the document character set, HTML offers a set of named character entities. Named character references replace integer references with symbolic names. The named entity å refers to the same Unicode character as å. There is no named entity for the Cyrillic capital letter "I". The full list of named character entities is included in this specification.

Four named character entities deserve special mention since they are frequently used to "escape" special characters: For text appearing as part of the content of an element, you should escape < as < to avoid possible confusion with the beginning of a tag. The & character should be escaped as & to avoid confusion with the beginning of an entity reference.

You should also escape & within attribute values since entity references are allowed within cdata attribute values. In addition, you should escape > as > to avoid problems with older user agents that incorrectly perceive this as the end of a tag when coming across this character in quoted attribute values.

Rather than worry about rules for quoting attribute values, its often easier to encode any instance of " by " and to always use " for quoting attribute values. Many people find it simpler to always escape these 4 characters in element content and attribute values.

"&" to represent the & sign.
"<" to represent the < sign.
">" to represent the > sign.
"" to represent the " mark.
Names of named character entities are case-sensitive. Thus, Å refers to a different character (upper case A, ring) than å (lower case a, ring).

Note: In SGML, it is possible to eliminate the final ";" after a numeric or named character reference in some cases (e.g., at a line break or directly before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

】【关闭窗口
·上页:
·下页:
相关文章
     网站编程教程 - PHP基础教程
推荐教程WINDOWS 2000下运用ISAPI方式安
推荐教程PHP5在Apache下地两种模式地安
推荐教程PHP程序加速探索之缓存输出
推荐教程PHP程序加速探索之脚本执行速度
推荐教程PHP程序加速探索之压缩输出gzi
推荐教程PHP程序加速探索之代码优化
推荐教程php.ini 配置详细选项
推荐教程PHP 命令行参数详细解说及运用
推荐教程Win下地PHP5.0安装配制详细解说
推荐教程Win下PHP5和Apache地安装与配置
推荐教程Win里面安装Apache2和PHP4权威
推荐教程PHP程序加速探索之服务器负载测
精彩图片汇集
advertisement
关于站点 - 广告服务 - 联系我们 - 版权隐私 - 免责声明 - 合作伙伴 - 程序支持 - 网站地图 - 返回顶部
网站文本地图
版权所有:库库中文 2005-2007 欢迎各种媒体转载我们的原创作品[转载请注明出处]
copyright © 2005-2008 www.QQGB.com online services. all rights reserved. 蜀ICP备05015578
Template designed by Virus. Optimized for 1024x768 to Firefox,Opera and MS-IE6. Site powered by EQL.
红盾
热爱电脑,热爱生活
拥有电脑,拥有生命
让我们享受拥有电脑的时光