Codepages

Recently, I encountered the issue of codepages again. The problem was that the name of a Turkish city (İstanbul) was not represented correctly. Again a codepage issue. The character İ was not part of the application.

How did these codepages come around?

In the early days of Windows 95, we had ASCII. This had 127 different characters that were represented with 7 bit. For some network reason an additional bit was necessary which led to the situation that 8 bits (a full byte) was necessary to represent an ASCII character. The set of ASCII characters contains the normal a,b, c etc, both in lower case and upper case. This is added with numerics (0,1,2,3, ..) and signs like the question mark, comma, dot etc. We miss in the ASCII the language specific characters like È, ç characters.
To cover for the Western European languages an additional series of 128 characters was created. The original ASCII plus the West European special characters ( È, ç) could be represented in 8 bits. Hence one byte was sufficient to display all characters in the Western European languages. This list is indicated by extended ASCII, Codepage 1252 or ISO -8859-1 or Latin1. Roughly these codepages allow us to represent West European texts like French, English, Spanish, German and Dutch.

However if one would like to write İstanbul, one encounters the limitations of these codepages. Hence other codepages are created to represent Baltic, Turkish or Czech. In most cases the alphabets that are derived from Latin contain the original ASCII list plus a series of characters (İ, Ď ) that are specific to these languages. The idea is that one byte represents one character.
Likewise, other languages like Russian, Hebrew and Arabic have created a list of characters whereby one byte is used to represent a character.

However, one might run into problems if an attribute should contain both Turkish and French names. What codepage to use? To solve this a new codepage is created: the unicode. Here, roughly two bytes are used which allows to store a nearly infinite set of characters. Standard characters are represented with one byte; more exotic characters use 2, 3 or 4 characters. One might even store Japanese, Korean in unicode.

Unicode certainly has the future. As we speak, one sees many migration documents where a ASCII codepage is migrated to unicode. I am also informed that more and more dbms recommend unicode as their default codepage.

Breaking

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe