Encoding

ASCII,Unicode,UTF-8

As you all know, data is stored with binary 0 and 1.

American invented the ASCII encoding system that contains 128 chars (letter and special chars) but for non-english words that can’t enough be expressed with ASCII, such as French char é that can’t be expressed with ASCII.

But there are more than 10k Chinese characters that can’t be expressed using only one byte, actually Chinese characters are expressed with two byte and have the GB2312 encoding system (there is nothing to do with unicode and uff-8)

Unicode

In the world, there are a lot of encoding system which results one binary number are interpreted into different chars, that’s why you always see garbage characters, just imagine, if there is a encoding system including all the characters in the world, every word has the unique encoding, this is Unicode.

The unicode is a huge set contains 1 million chars so far you can access link unicode.org to get more detailed information.

Unicode Issue

You should be aware that Unicode just provides the character set and the binary expression of one char, which don’t tell you the way how the binary expression are stored.

There are two chief issues, one is how to distinguish the Unicode and ASCII? how the computer to know 3 bytes are expressing one character not 3 chars? we all know one English char only needs one byte if all the characters are stored in 3 or 4 bytes which is a big waste and that can’t be acceptable.

UTF-8/Unicode

Unicode is a worldwide character-encoding standard. Compared to the older mechanisms for handling character and string data, by using Unicode to represent character and string data in you applications you can enable universal data exchange capabilities for global marketing using a single binary file for every possible character code.

 

发布者

690130229

coder,喜欢安静,喜欢读书,wechat: leslie-liya

发表评论