Navigation path

Character encoding - UNICODE

Mandatory requirement

The correct encoding character on EUROPA websites is UTF-8.

 

View all IPG Rules

Unicode is a standardised encoding system that provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language, without any risk of corruption. Before Unicode, no single encoding could contain enough character to cover all languages used by European Union. It is no more case: Unicode is a superset of all other character set standards. 

 

Description

The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many other products.

Unicode is the official way to implement ISO/IEC 10646 (Universal Multiple-octet Code character Set or UCS)

The Unicode Consortium (non-profit organization founded to develop, extend and promote use of the Unicode Standard) cooperates with the W3C and ISO.

Members of the Consortium include major computer corporations, software producers, database vendors, research institutions, international agencies, various user groups, and interested individuals

Unicode encoding forms

The Unicode encoding forms specify how each character is to be expressed as a sequence of one or more code units (the three code units are 8-bit, 16-bit and 32-bit). The Unicode standard provides three distinct encoding forms named :

UTF8: it is a variable-width encoding form, using 1 to 4 8-bit code units. It is typically the preferred encoding form for HTML and particularly for the Internet. UTF-8 is reasonably compact in term of the number of bytes used (except for Asian implementations where all characters required at least a sequence of three 8-bit code units). All the US7ASCII characters are represented in UTF-8 by a single 8-bit code unit.

UTF-16: the almost characters are represented by a single 16-bit code unit, except supplementary characters (not used in the European Union environment) which are represented as pairs of 16-bit code units (known as “surrogate pairs”). It is the encoding form used by Java.

UTF-16 may be a preferred encoding form for applications that need to balance efficient access to characters with economic use of storage. It is the historical descendant of earliest form of Unicode.

UTF-16 is the Unicode form used by Microsoft programs (as MS Word, Visual Basic, MS Access, …) and operating systems.

UTF-32: each character is represented by a single 32-bit code unit. It is a fixed-width character encoding form.

UTF-16 is a preferred encoding form for processing characters on most Unix platforms.

Byte Order Mark (BOM)

The BOM is a signature, placed at the beginning of a data stream (a file is considered as a data stream) defining the used encoding form.

The values of the BOM on PC are :

X’FF FE 00 00’     UTF-32

X’FF FE’               UTF-16

X’EF BB BF ‘       UTF-8

The BOM is sometimes used by  ColdFusion application server (see below).

Unicode and the Oracle Data Bases

From 9i Oracle, it is possible to define two Unicode encoding forms in the same database :

- Through the Database Character Set (defined at Database creation time). The possible values are UTF8 or AL32UTF8 (corresponding respectively to Unicode encoding forms UTF-8 version 3.0 or Unicode encoding forms UTF-8 version 3.1).

- Through the National Character Set (defined at the Database creation time). The possible values are UTF8 or AL16UTF16 (corresponding respectively to Unicode encoding forms UTF-8 version 3.0 or Unicode encoding forms UTF-16 version 3.1). The national character set is used only for SQL data NCHAR, NVARCHAR2 and NCLOB data types.

How to define a HTML page as Unicode ?

The following META tag has to be added in the HEAD section of the HTML page :

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

The browser uses this tag to define the encoding of the page.

This tag indicates to the Web Editor Tools (as DreamWeaver or FrontPage) in which encoding the page will be saved.

How to define a ColdFusion page as Unicode ?

When a request for a ColdFusion page occurs, ColdFusion opens the page, processes the content, and returns the results back to the browser of the requestor. In order to process the ColdFusion page, ColdFusion has to interpret the page content. One piece of information used by ColdFusion is the Byte Order Mark (BOM) in a ColdFusion page (see above).

If your page does not contain a BOM, you can use the cfprocessingdirective tag to set the character encoding of the page. If you insert the “cfprocessing directive” tag on a page that has a BOM, the information specified by the “cfprocessing directive” tag must be the same as for the BOM; otherwise, ColdFusion issues an error.

If the page contains no BOM neither “cfprocessing directive” tag, ColdFusion consideres the page as iso-8859-1 (even is the META tag is utf-8 or
iso-8859-7).

Before ColdFusion can return a response to the client, it must determine the encoding to use for the data in the response. ColdFusion pages (.cfm pages) default to using the Unicode UTF-8 format for the response, even if you include the HTML meta tag in the page:

<meta … content="text/html; charset="iso-8859-1">

However, within a ColdFusion page you can use the cfcontent tag to override the default character encoding of the response. Use the type attribute of the cfcontent tag to specify the MIME type of the page output, including the character set, as follows:

<cfcontent type="text/html, charset=utf-8">

The encoding form of the response is sent to the browser through the HTTP header. In this case, the browser does not take in account the META tag.

Recommendations for using Unicode in ColdFusion pages:

1.      Include the BOM when saving page in DreawWeaver

2.      Use the “cfcontent” tag:
<cfcontent type="text/html; charset=utf-8">

3.      Use the “cfprocessing” tag:
<cfprocessingdirective  pageencoding = "utf-8">

4.      Use the META tag
<cfcontent type="text/html, charset=utf-8">
even if it is not used

Example: the next figure shows how UTF-8 data are transferred (and converted)  from an Oracle Database to the Browser through a ColdFusion Server

  

the next figure shows how UTF-8 data are transferred (and converted)  from an Oracle Database to the Browser through a ColdFusion Server

 
 

Use on EUROPA websites

All multilingual pages are concerned.

 
 

Guidelines and references

Unicode, Inc (official site of the Unicode Consortium)