2. Character Encoding
|
Although English is the language in which the Scholarly Societies
Project is presented, a large number of the societies covered have
names that require either the use of diacritics, or non-Latin
scripts, if they are to be properly displayed.
In the sub-sections below, issues concerning the encoding of non-English
characters are discussed.
|
2.1 Encoding Standards
|
Character Set
|
Encoding Standard
|
Examples
|
Western European characters with diacritics
|
These are encoded using the standard HTML entity references.
A fairly complete list of these is found at
ISO 8859-1 (Latin-1) Characters List (maintained at the University of
Toronto).
See also
HTML 4.01 Entities Reference (maintained at the
W3Schools).
|
é =
é
ü =
ü
|
All other character sets available in the Arial Unicode font
(which as of 2001, Sept.12 covers all characters in
Unicode Standard 2.1)
|
These are encoded using character references of the form
&#dddd; where dddd is the decimal value of the
hexidecimal number given by the
Unicode Standard 3.0 charts published by the
Unicode Consortium.
Note: Although it is possible to encode the hexadecimal value directly
(using the &#xhhhh; format where hhhh is the hex value),
this is not currently recommended,
since some browsers do not support this.
|
я =
я
(Cyrillic)
भ =
भ
(Devanagari)
ლ =
ლ
(Georgian)
理 =
理
(CJK Ideographs)
한 =
한
(Korean)
|
2.2 Encoding Techniques
|
Situation at the Society Website
|
Encoding Technique
|
The society name is encoded in the original script as text
somewhere at the website.
|
The text string is copied to the
Macchiato Unicode UTF Converter, and the
decimal code for the string of characters is retrieved, with the code for
each character preceded by and followed by ;.
The result is HTML code that will display a Unicode-compliant
representation of the character string.
|
The only occurrrence of the society's name in the original
script is as part of a graphic (in which case the individual
characters cannot be copied).
|
The characters in the graphic are matched one by one against the
Unicode Charts published by
the Unicode Consortium to identify
the Unicode hexadecimal value for the character.
This is then converted to a decimal value, and then encoded as in 2.1
above.
[Human-based pattern recognition of this sort can be rather
time-consuming, especially when a large set like the
CJK Unified Ideographs set (20,000+ characters) must be scanned.]
|
There is no occurrrence of the society's name in the original
script anywhere at the society website.
|
Other sources must be consulted in order to determine the society name in
the original script.
Priority is given to web resources that appear to be authoritative.
If the search is successful, then one of the two above-mentioned
techniques may then be employed in the encoding.
|
2.3 Verification of the Encodings
|
Once a first draft of an encoding of a society name has been completed,
the
resulting character string is tested to verify that:
- the string represents the society name and nothing more, and
- the society name is correctly rendered.
The preferred tools for verification are given below in order of priority.
|
Type of Verification Tool
|
Specific Tools
|
An online translation facility
|
Altavista's Babelfish Translator
|
an online dictionary, used word-by-word, and in conjunction with a
grammar of the language, where necessary.
|
Specific online dictionaires are located using
Your Dictionary.com's Language Dictionaries
(which links to 1800+ dictionaries covering 250+ languages)
|
a print dictionary, used word-by-word, and in conjunction with a
grammar of the language, where necessary.
|
This is the last (but frequent) resort, since it relies on exact pattern
recognition by a human, rather than by a machine.
|
2.4 Proper Viewing of the Encodings
|
Specific Problem
|
Solutions
|
Scripts Affected
|
The script is displaying as ????? (questions marks),
||||| (vertical lines) or
□□□□□ (square boxes).
|
You need to verify that your computer has a Unicode font for the
script in question.
At the moment, the most comprehensive Unicode font is the
Arial Unicode font, which includes all character sets in the
Unicode Standard 2.1.
There exist, however, numerous Unicode fonts for particular scripts;
see, for example, Allan Wood's
Unicode fonts for Windows computers.
Arial Unicode font is available with Microsoft Office XP
and Microsoft Publisher 2002.
|
Any script that your computer doesn't have a Unicode font for.
|
Conjunct glyphs are displaying as their separate components
|
If the problem is with Arabic conjunct glyphs,
you may be able to solve the problem by switching to either
(a.) Netscape 6.0 or higher, or
(b.) Internet Explorer 5.0 or higher.
If the problem is with conjunct glyphs in Devanagari and other
Indic scripts,
you may be able to solve the problem by switching to the
Microsoft Office XP operating system.
[For example, the Microsoft Windows 98 operating system definitely
does not handle Devanagari conjunct glyphs properly.]
|
Arabic
Devanagari and other Indic Scripts
|
Contextual glyphs are displaying as their isolated forms, rather
than changing as a function of their position in a word.
|
If the problem is with Arabic glyphs,
you can probably solve the problem by switching to either
(a.) Netscape 6.0 or higher, or
(b.) Internet Explorer 5.0 or higher.
|
Arabic
|
right-to-left scripts, like Hebrew and Arabic
are displaying backwards
|
If the problem is with Arabic or Hebrew,
you can probably solve the problem by switching to either
(a.) Netscape 6.0 or higher, or
(b.) Internet Explorer 5.0 or higher.
|
Hebrew
Arabic
|
2.5 Outstanding Issues
|
Character Set
|
Issues
|
certain character sets, or portions of a character set
|
The most comprehensive Unicode font is the
Arial Unicode font. At the present time, it does not cover
additional characters that were added to the Unicode Standard 2.1 to
create the Unicode Standard 3.0, much less later versions.
|
|