Archive for August, 2011


Characters in a computer system are stored using numeric codes. Characters in a computer system involve any alphanumeric character, punctuation characters, symbols like dollar or pound sign, and special non-printing control characters like ENTER.

The ASCII (American Standard Code for Information Interchange) code was created on the early 60’s to represent all characters in the English alphabet and it is referred to as a character set. Initially created using 7-bit codes, ASCII represented up to 128 characters, starting from code 0 up to code 127. ASCII was later extended to use 8-bit codes, and was capable of representing up to 256 characters, from code 0 up to code 255. The extended characters included letters which are not part of the English alphabet like ñ.

Clearly ASCII faced severe limitations when alphabets other than English had to be represented, like Chinese, Japanese, Arabic, Russian, and others where the alphabet has many more characters and are very different from those in the English alphabet.


A number of standards and character sets have been created to handle the limitations imposed by single octet coded sets like ASCII. The Unicode standard, introduced in the early 90’s by the Unicode Consortium, is a fairly recent character set designed to support not only the languages mentioned above but many other languages of the world. The Unicode standard, developed in parallel with the International Standard ISO/IEC10646 (also known as the Universal Character Set), identifies each character in the set by an unambiguous name and a positive integer number called its code point. Instead of mapping characters into single octets, Unicode separately defines what characters are available, how it maps each to a unique code point, and how it encodes those numbers.

The Unicode standard can potentially support over 1 million characters, each mapped to a code point between 0 and 1,114,112. This allows computers and electronic communication devices to represent and store several other alphabets like Latin, Greek, Hebrew, including ancient and modern alphabets. Introduced in 2011, Unicode 6.0 is the most current version of the standard.

One of the advantages of the Unicode standard is that Unicode’s first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding for Western European languages. As a result, the first 128 characters are also identical to ASCII.

So Where’s the Catch?

Software the runs on computers have to “marry” with one character set but also support other character sets. You can see that in operating systems, communications software, software tools, as well as in applications.

Software that runs on computers do so using several different languages around the world. Electronic documents like web pages or PDF documents are written in several different languages. If you want to be able to read those documents written in a language other than English or German or French, your computer system should support those other character encodings so that the content is correctly presented.

Добро пожаловать, Мария Шарапова любит теннис

The previous headline says ‘Welcome, Maria Sharapova likes tennis’ in Russian language, which uses the Cyrillic alphabet. If your computer system or web browser default character set supports the Cyrillic alphabet, you should be able to see something like:

If the default character set of your system or web browser does not include support for the Cyrillic alphabet then you will see some garbled text; check the text encoding used by your browser or your system (go to the View menu).

Typically, different computer operating systems use different default character sets, and usually have different ways of specifying the default character set to be used. When documents are created in different computer systems and are using different character sets, documents created in one system may not display text properly in the other system or may not display the text at all. For instance, if you want to read the Adobe Reader (PDF) version of the manual of your Japanese branded digital camera, your system needs to include support for the Japanese characters so that text will display correctly.

When computer systems are using different character sets but these character sets are compatible then only some characters may not display correctly; this usually happens when the character exists in one character set but not in the other, or when the character exits in both sets but, their numeric code is different or the character is represented by a different number of octets. This could be the case when visiting a website created in the German language where, only some characters are not found in the English alphabet.

Fortunately, many computer programs and software automatically translate characters behind the scenes when different character encodings are used between computer systems; we don’t even notice that’s occurring and we can happily read those documents.

You must be kidding! How did you type those letters!? We don’t have keyboards with the Russian alphabet (here in the U.S.) I don’t know that but, I only went to Settings, General, Keyboard, International Keyboards, in my iPad (get one if don’t have one) and added the Russian keyboard. Here it is.

Questo, que lotro, sănătate!


ASCII http://en.wikipedia.org/wiki/ASCII
Character Encoding http://en.wikipedia.org/wiki/Character_encoding
Unicode Consortium http://www.unicode.org/
Universal Character Set http://en.wikipedia.org/wiki/Universal_Character_Set
Another early character encoding http://en.wikipedia.org/wiki/EBCDIC



Read Full Post »