Strange characters

In some cases, you get unexpected weird results being returned from your database like: Testù�Summary. This may be expected as one inserted TestùęSummary. Apparently symbols like ę were not recognised and were subsequently translated into �.
A likely reason is that the so-called codepage is wrong. Characters like ę are not included in the common characterset and an extended characterset (like unicode) must be used.
Fortunately, most DBMS support the unicode. As an example, we take an example from Teradata. Look as this code:

CREATE SET  TABLE SAN_D_FAAPOC_01.TestUnicode ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
Ident VARCHAR(255) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
Serial INTEGER,
Node VARCHAR(64) CHARACTER SET UNICODE NOT CASESPECIFIC)
PRIMARY INDEX (Serial);

insert into SAN_D_FAAPOC_01.TestUnicode(ident,node,serial)
values('TestùęSummary','TestùęSummary',1235);

SELECT	Ident, Serial, Node
FROM	SAN_D_FAAPOC_01.TestUnicode;

The results are:

	Ident	Serial	Node
1	Testù�Summary	1,235	TestùęSummary

The nice thing about Teradata is that columns can be defined as unicode-columns. Hence nothing extra needs to be done to store such unicode characters.
A similar situation exists with MySQL. Also in that DBMS, we may store data in columns that are defined as being unicode. As an example, one may use this code snippet:

CREATE TABLE t1
(
    col0 CHAR(10),
    col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci
);

insert into t1 values('TestùęSumm','TestùęSumm');

Also here, we have an illustration of the purpose on Unicode. It is an extension of the standard ASCII characterset to include all characters from all living languages. I understand that even Gothic and Music characters are included in unicode. A subset of unicode is the ASCII set. On top of that characters are included that are not within the ASCII dataset.
I understand we have different version of unicode. One such version is UTF-8. This version uses one byte to store the common latin characters such as ‘A’, ‘B’,’1′ etc. For the more exotic characters, more byte are used. An example is recently introduced “€” that takes 3 bytes. Other characters use 4 bytes.
On average a western text is stored quite efficiently in UTF-8. As most characters only use 1 byte, we end up with a file size (in terms of bytes) that equals the number of characters.
Another implementation is UTF-16 that uses 2 bytes per character. In that case, the file size, in terms of bytes, is double that of the number of characters. A western text, written in UTF-16 is then double as big as it would have been in UTF – 8.
As an example, I include two texts, one written in ASCII and one in UTF-8:

Breaking

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe