The characters and glyphs of natural languages are stored in computer memory and on storage media, as is all information, as numbers. Many schemes for encoding characters have been devised. They all have as their object, the mapping of characters to unique numerical values. Early attempts to devise encoding schemes focused on character sets that were used in those parts of the world (most notably, Europe and the Americas) where computers had first been introduced. As use of the computer spread, infiltrating new regions of the world and new problem domains, the demand for representing larger and more diverse character sets grew. The response to this need was the invention of new and increasingly complex encoding schemes.
A variety of encoding schemes are used to map the symbols of our spoken, mathematical, and graphical languages into numerical codes. Each of these schemes defines a one-to-one mapping between a symbol and a number or sequence of numbers. The encoding schemes (e.g., ASCII, ANSI) for the relatively small character sets used in Latin-like written languages are simple and compact, with integer values chosen so that they fit neatly into a single byte. The integer values for these simple encoding schemes may be viewed as indices into character set tables. The desire to encode the symbols used in the special purpose languages of mathematics, music, and ancient languages has spurred development of a variety of encoding methods capable of representing all the languages of the world including the special-purpose languages of mathematics, music, finance, ancient languages, et cetera.
The original ASCII character set included 128 "characters. The set included punctionation marks, upper and lowercase Latin alphabetic characters, the digits 0 through 9, and a set of control characters. The ASCII mapping of this character set to 7-bit numeric values was devised by the American National Standards Institute (ANSI) to simplify the task of connecting character-oriented peripherals (e.g., printers, teletype machines, monitors, etc) to computers built by different manuafacturers.
| Dec | Hex | Char | Name | Dec | Hex | Char | Name | Dec | Hex | Char | Name |
| 0 | 00 | NUL | Null character | 43 | 2B | + | plus | 86 | 56 | V | Upper V |
| 1 | 01 | SOH | Start of Heading | 44 | 2C | cc | comma | 87 | 57 | W | Upper W |
| 2 | 02 | STX | Start of Text | 45 | 2D | - | hyphen | 88 | 58 | X | Upper X |
| 3 | 03 | ETX | End of Text | 46 | 2E | . | period | 89 | 59 | Y | Upper Y |
| 4 | 04 | EOT | End of Transmission | 47 | 2F | / | forward slash | 90 | 5A | Z | Upper Z |
| 5 | 05 | ENQ | Enquire | 48 | 30 | 0 | zero | 91 | 5B | [ | left bracket |
| 6 | 06 | ACK | Acknowledge | 49 | 31 | 1 | one | 92 | 5C | \ | backslash |
| 7 | 07 | BEL | Bell | 50 | 32 | 2 | two | 93 | 5D | ] | right bracket |
| 8 | 08 | BS | Backspace | 51 | 33 | 3 | three | 94 | 5E | ^ | caret |
| 9 | 09 | HT | Horizontal Tab | 52 | 34 | 4 | four | 95 | 5F | _ | underscore |
| 10 | 0A | LF | Line Feed | 53 | 35 | 5 | five | 96 | 60 | ` | left single quote |
| 11 | 0B | VT | Vertical Tab | 54 | 36 | 6 | six | 97 | 61 | a | Lower a |
| 12 | 0C | FF | Form Feed | 55 | 37 | 7 | seven | 98 | 62 | b | Lower b |
| 13 | 0D | CR | Carriage Return | 56 | 38 | 8 | eight | 99 | 63 | c | Lower c |
| 14 | 0E | SO | Shift Out | 57 | 39 | 9 | nine | 100 | 64 | d | Lower d |
| 15 | 0F | SI | Shift In | 58 | 3A | : | colon | 101 | 65 | e | Lower e |
| 16 | 10 | DLE | Data Link Escape | 59 | 3B | ; | semicolon | 102 | 66 | f | Lower f |
| 17 | 11 | DC1 | Device Control 1 | 60 | 3C | < | less than | 103 | 67 | g | Lower g |
| 18 | 12 | DC2 | Device Control 2 | 61 | 3D | = | equal | 104 | 68 | h | Lower h |
| 19 | 13 | DC3 | Device Control 3 | 62 | 3E | > | greater than | 105 | 69 | i | Lower i |
| 20 | 14 | DC4 | Device Control 4 | 63 | 3F | ? | question mark | 106 | 6A | j | Lower j |
| 21 | 15 | NAK | Neg. Acknowledgement | 64 | 40 | @ | at symbol | 107 | 6B | k | Lower k |
| 22 | 16 | SYN | Synchonous Idle | 65 | 41 | A | Upper A | 108 | 6C | l | Lower l |
| 23 | 17 | ETB | End Transmission Blk. | 66 | 42 | B | Upper B | 109 | 6D | m | Lower m |
| 24 | 18 | CAN | Cancel | 67 | 43 | C | Upper C | 110 | 6E | n | Lower n |
| 25 | 19 | EM | End of Medium | 68 | 44 | D | Upper D | 111 | 6F | o | Lower o |
| 26 | 1A | SUB | Substitute | 69 | 45 | E | Upper E | 112 | 70 | p | Lower p |
| 27 | 1B | ESC | Escape | 70 | 46 | F | Upper F | 113 | 71 | q | Lower q |
| 28 | 1C | FS | File Separator | 71 | 47 | G | Upper G | 114 | 72 | r | Lower r |
| 29 | 1D | GS | Group Separator | 72 | 48 | H | Upper H | 115 | 73 | s | Lower s |
| 30 | 1E | RS | Record Separator | 73 | 49 | I | Upper I | 116 | 74 | t | Lower t |
| 31 | 1F | US | Unit Separator | 74 | 4A | J | Upper J | 117 | 75 | u | Lower u |
| 32 | 20 | SP | Space | 75 | 4B | K | Upper K | 118 | 76 | v | Lower v |
| 33 | 21 | ! | Exclamation mark | 76 | 4C | L | Upper L | 119 | 77 | w | Lower w |
| 34 | 22 | DQ | Double quote | 77 | 4D | M | Upper M | 120 | 78 | x | Lower x |
| 35 | 23 | # | Pound sign | 78 | 4E | N | Upper N | 121 | 79 | y | Lower y |
| 36 | 24 | $ | Dollar sign | 79 | 4F | O | Upper O | 122 | 7A | z | Lower z |
| 37 | 25 | % | percent sign | 70 | 50 | P | Upper P | 123 | 7B | { | left brace |
| 38 | 26 | & | ampersand | 81 | 51 | Q | Upper Q | 124 | 7C | | | vertical bar |
| 39 | 27 | ' | single quote | 82 | 52 | R | Upper R | 125 | 7D | } | right brace |
| 40 | 28 | ( | left paren. | 83 | 53 | S | Upper S | 126 | 7E | ~ | tilda |
| 41 | 29 | ) | right paren | 84 | 54 | T | Upper T | 127 | 7F | DEL | Delete |
| 42 | 2A | * | asterisk | 85 | 55 | U | Upper U |
The ASCII numerical codes representing characters use the seven low-order bits of a byte memory unit. The high-order bit (i.e., most significant bit) of these character bytes were reserved to record byte parity [tooltip: a primitive error detection code whose value is set to one if the number of set bits in a memory unit is odd].
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Praesent aliquam, justo convallis luctus rutrum, erat nulla fermentum diam, at nonummy quam ante ac quam. Maecenas urna purus, fermentum id, molestie in, commodo porttitor, felis.
One response to the growing need for representing new characters was the extension of the ASCII character set from 128 to 256 characters. By using the high-order (parity) bit of the memory bytes used to contain ASCII character codes, the size of the character set could be doubled. Putting this (mostly unused) parity bit into service reduced the impact the expansion of the character set would have on existing hardware and software.
There has been, however, no universal agreement on which characters to include in the character set extension. Many different sets have been defined and are in use today. Consequently, there is no standard ASCII extension. One instance of an ASCII estension, known as the ANSI character set is shown in in Table 2. ANSI characters with code values 32 to 127 correspond to those in the 7-bit ASCII character set.
Another popular ASCII extension is the one defined by the the ISO 8859-1, a standard developed by the International Standards Organization. While there is no ASCII extension regarded as "the standard", ISO 8859-1 is, in fact, the only one of many ASCII extensions governed by a formal standards document. ISO 8859-1 is also referred to as the ISO Latin-1 set, and is widely used throughout North and SouthAmerica, Western Europe, Africa, and those countries in Asia which use Latin-like alphabets.
| Dec | Hex | Char | Name | Dec | Hex | Char | Name | Dec | Hex | Char | Name |
| 128 | 80 | € | Euro symbol | 171 | AB | « | Left, double angle quote | 214 | D6 | Ö | Upper O with diaeresis |
| 129 | 81 | Unassigned | 172 | AC | ¬ | Not symbol | 215 | D7 | × | Multiplication symbol | |
| 130 | 82 | ‚ | Single low-9 quote | 173 | AD | | Soft hyphen | 216 | D8 | Ø | Upper O with stroke |
| 131 | 83 | ƒ | Lower f with hook | 174 | AE | ® | Registered symbol | 217 | D9 | Ù | Upper U with grave |
| 132 | 84 | „ | Double low-9 quote | 175 | AF | ¯ | Macron | 218 | DA | Ú | Upper U with acute |
| 133 | 85 | … | Horizontal ellipsis | 176 | B0 | ° | Degree symbol | 219 | DB | Û | Upper U with circumflex |
| 134 | 86 | † | Dagger | 177 | B1 | ± | Plus-minus symbol | 220 | DC | Ü | Upper U with diaeresis |
| 135 | 87 | ‡ | Double dagger | 178 | B2 | ² | Superscript two | 221 | DD | Ý | Upper Y with acute accent |
| 136 | 88 | ˆ | Circumflex accent modifier | 179 | B3 | ³ | Superscript three | 222 | DE | Þ | Upper Thorn |
| 137 | 89 | ‰ | Per mille symbol | 170 | B4 | ´ | Acute accent | 223 | DF | ß | Lower sharp s |
| 138 | 8A | Š | Upper S with caron | 181 | B5 | µ | Micro symbol | 224 | E0 | à | Lower a with grave accent |
| 139 | 8B | ‹ | Left angle quote | 182 | B6 | ¶ | Pilcrow symbol | 225 | E1 | á | Lower a with acute accent |
| 140 | 8C | Œ | Latin capital ligature OE | 183 | B7 | · | Middle dot | 226 | E2 | â | Lower a with circumflex |
| 141 | 8D | Unassigned | 184 | B8 | ¸ | Cedilla | 227 | E3 | ã | Lower a with tilde | |
| 142 | 8E | Ž | Upper Z with caron | 185 | B9 | ¹ | Superscript one | 228 | E4 | ä | Lower a with diaeresis |
| 143 | 8F | Unassigned | 186 | BA | º | Masculine ordinal indicator | 229 | E5 | å | Lower a with ring | |
| 144 | 90 | Unassigned | 187 | BB | » | Right double angle quote | 230 | E6 | æ | Lower æ | |
| 145 | 91 | ‘ | Left single quote | 188 | BC | ¼ | Fraction=one quarter | 231 | E7 | ç | Lower c with cedilla |
| 146 | 92 | ’ | Right single quote | 189 | BD | ½ | Fraction=one half | 232 | E8 | è | Lower e with grave accent |
| 147 | 93 | “ | Left double quote | 190 | BE | ¾ | Fraction=three quarters | 233 | E9 | é | Lower e with acute accent |
| 148 | 94 | ” | Right double quote | 191 | BF | ¿ | Inverted question mark | 234 | EA | ê | Lower e with circumflex |
| 149 | 95 | • | Bullet | 192 | C0 | À | Upper A with grave | 235 | EB | ë | Lower e with diaeresis |
| 150 | 96 | – | En dash | 193 | C1 | Á | Upper A with acute | 236 | EC | ì | Lower i with grave accent |
| 151 | 97 | — | Em dash | 194 | C2 | Â | Upper A with circumflex | 237 | ED | í | Lower i with acute accent |
| 152 | 98 | ˜ | Small tilde | 195 | C3 | Ã | Upper A with tilde | 238 | EE | î | Lower i with circumflex |
| 153 | 99 | ™ | Trademark symbol | 196 | C4 | Ä | Upper A with diaeresis | 239 | EF | ï | Lower i with diaeresis |
| 154 | 9A | š | Lower s with caron | 197 | C5 | Å | Upper A with ring | 240 | F0 | ð | Lower eth |
| 155 | 9B | › | Right angle quote | 198 | C6 | Æ | Upper AE | 241 | F1 | ñ | Lower n with tilde |
| 156 | 9C | œ | Latin small ligature oe | 199 | C7 | Ç | Upper C with cedilla | 242 | F2 | ò | Lower o with grave accent |
| 157 | 9D | Unassigned | 200 | C8 | È | Upper E with grave | 243 | F3 | ó | Lower o with acute accent | |
| 158 | 9E | ž | Lower z with caron | 201 | C9 | É | Upper E with acute | 244 | F4 | ô | Lower o with circumflex |
| 159 | 9F | Ÿ | Upper Y with diaeresis | 202 | CA | Ê | Upper E with circumflex | 245 | F5 | õ | Lower o with tilde |
| 160 | A0 | Non-breaking space | 203 | CB | Ë | Upper E with diaeresis | 246 | F6 | ö | Lower o with diaeresis | |
| 161 | A1 | ¡ | Inverted exclamation mark | 204 | CC | Ì | Upper I with grave | 247 | F7 | ÷ | Division symbol |
| 162 | A2 | ¢ | Cent symbol | 205 | CD | Í | Upper I with acute | 248 | F8 | ø | Lower o with stroke |
| 163 | A3 | £ | Pound symbol | 206 | CE | Î | Upper I with circumflex | 249 | F9 | ù | Lower u with grave accent |
| 164 | A4 | ¤ | Currency symbol | 207 | CF | Ï | Upper I with diaeresis | 250 | FA | ú | Lower u with acute accent |
| 165 | A5 | ¥ | Yen symbol | 208 | D0 | Ð | Upper Eth | 251 | FB | û | Lower u with circumflex |
| 166 | A6 | ¦ | Broken bar | 209 | D1 | Ñ | Upper N with tilde | 252 | FC | ü | Lower u with diaeresis |
| 167 | A7 | § | Section symbol | 210 | D2 | Ò | Upper O with grave | 253 | FD | ý | Lower y with acute accent |
| 168 | A8 | ¨ | Diaeresis | 211 | D3 | Ó | Upper O with acute | 254 | FE | þ | Lower thorn |
| 169 | A9 | © | Copyright symbol | 212 | D4 | Ô | Upper O with circumflex | 255 | FF | ÿ | Lower y with diaeresis |
| 170 | AA | ª | Feminine ordinal indicator | 213 | D5 | Õ | Upper O with tilde |
Unicode is an unfinished computing industry standard whose designers aim to have it eventually replace older character encoding schemes that are incapable of representing many of the complex writing systems (e.g., Chinese) of the world. The Unicode Consortium manages the developement of this standard. Copies of the most recent version of the Unicode Standard are available at their website.
The Unicode standardization project reserves a range of integer values for identifying characters and glyphs. These reserved values lie in the closed interval, [0, 10FFFF]. This range, or codespace, includes 1,114,112 values. Each value, referred to as a code point, is associated with a distinct character or glyph. The codespace is divided into seventeen planes, numbered 0 to 16, each containing 65,536 points. These planes may be subdivided into blocks of varying sizes and used to encode symbols for a particular language or group of languages (e.g., ). The zeroth plane is referred to as the Basic Multilingual Plane (BMP) with code points in the closed interval [0, FFFF]. Some of the 65,536 code points in the BMP have already been assigned to characters.
The assignment of the first 256 code points in the Unicode codespace is identical to the assignments made in the ISO 8859-1 standard (see Section entitled, Extended ASCII and the ANSI Character Set). This choice simplifies the conversion of ASCII encoded text to the Unicode standard, and reduces the impact of the Unicode standard on legacy systems.
The Unicode standard defines two general methods for mapping code points to variable-length memory unit (8-bit, 16-bit, and 32-bit) sequences. These memory units are referred to as code units. The sequences produced by these encoding methods may be from one to four code units in length. The first of these general methods is referred to as the Unicode Transformation Format (UTF) encoding method. Several variants of this method are defined. They include, UTF-8, UTF-16, and UTF-32. The value appearing after the hyphen indicates the size of the code unit in the encoded sequences.
The second basic method is referred to as the Universal Character Set (UCS) encoding method. The two variants of this general encoding scheme are the UCS-2 and the UCS-4 mapping methods. Here the value following the hyphen in the method name indicates the number of bytes produced by the method during the mapping of a code point to a multi-byte sequence. The UCS-2 method is now obsolete, and the UCS-4 and UTF-32 methods are essentially equivalent.
UTF-8 and UTF-16 are the most widely employed methods for mapping Unicode code points to their memory-resident representations.
The UTF-8 method maps code points to a sequence of bytes ranging in length from 1 to 4 bytes. Each byte within the sequence contains both control bits and non-control bits. The control bits indicate how many bytes there are in a given sequence, and whether a given byte is the first in the sequence or one of the "trailing" bytes. The figure below illustrates how these control bits are interpreted.
The non-control bits of each byte in a sequence are used to record the character code value (i.e., code point) assigned by the Unicode Standard. The way this is accomplished is most easily explained by giving an example. The Unicode integer value assigned to the trademark symbol, ™, is 8,482 base 10. Expressed as a hexadecimal number, the value is 2122. The Unicode convention for expressing this code point is U+2122. The questions needing answers are these: "How is this value encoded using the UTF-8 method?"; and "How many bytes will be required?" The answers to these questions are found by first considering the binary representation of the hexadecimal number 2122, keeping in mind the encoding details depicted in Figure 1.
At the top of Figure 2, the binary representation of the code value for the trademark symbol is given. Its representation requires 14 bits (leading zeros may be ignored). Each byte of a UTF-8 code sequence, except for the first has six bit positions available for containing the Unicode character value (i.e., code point). The least significant six bits of the binary representation of the trademark code is inserted in the final byte of the UTF-8 sequence. The next six bits of the code is moved to the next to last byte of the UTF-8 sequence. This leaves only the two most significant bits of the trademark code to insert. These two bits are inserted in the low order bits positions of a third byte. The control bits, 5 through 7, of this third byte are set to indicate the resulting UTF-8 code sequence is of length 3. The control bit 4 is reset to zero to mark the end of the the initial chain of 1 bits.
Thus, the Unicode code point, U+2122, requires three bytes for its UTF-8 representation, and this three byte sequence expressed using hexadecimal digits is,
C2 84 A2.
An examination of Figure 1 reveals that a 4-byte UTF-8 sequence provides a total of 21 non-control bits. These 21 bits can be used to represent character points in the closed interval [0, 3FFFF], equivalent to the decimal range 0..262,143. However the Unicode Standard in its current form does not associate characters with all these possible values.
The UTF-16 encoding method maps code points into either one or two 16-bit code units. Characters in the Basic Multilingual Plane (BMP) (i.e., code points in the range 0 to FFFF) are mapped directly to a single 16-bit word. For all other characters the UTF-16 transformation of code points yields a pair of 16-bit words referred to as a surrogate pair. The 16-bit word containing the most significant bits of a code point is referred to as the leading or high surrogate, and the word containing the least signigant bits of the code point is called the trailing or low surrogate.

The method for mapping code points to surrogate pairs is depicted in Figure 3 using, as an example, the Unicode code point, U+1D160, representing the musical eighth note symbol (musical symbols: http://www.unicode.org/charts/PDF/U1D100.pdf). The high and low surrogates are first initialized to the hexadecimal values, D800 and DC00, respectively. The value of the most significant five-bits of the codepoint decremented by one is then moved to bit positions 6 through 10 of the high surrogate. The sixteen least significant bits of the code point are distributed between to bit positions 0 through 5 of the high surrogate and positions 0 through 9 of the low surrogate, as indicated in Figure 3.
Program logic expressed in both the C and Ada programming languages that illustrate the UTF-16 encoding scheme may be downloaded (C_version, Ada_version).
The equivalent UTF-32 and UCS-4 encoding methods employ a simple and very direct method to represent code points. All code points, regardless of value, are mapped directly into 32-bit code units.
Endianness refers to the order in which bytes within code units are ordered in memory. UTF-16 encodings in which the high-order byte of high and low surrogates precedes the low order bytes is said to be in Big Endian (abbreviated BE) order. The Big Endian ordering of the UTF-16 encoding of the musical eighth note symbol shown in Figure 3 would be,
D834 DD60.
UTF-16 encodings in which the low-order bytes precedes the high-order bytes is said to be in Little Endian (abbreviated LE) order. The equivalent Little Endian ordering of the UTF-16 encoding of the musical eighth note symbol would be,
34D8 60DD.
The endianness of UTF-16 encodings is indicated by appending the suffix "BE" or "LE" to the method name. The Unicode mapping method yielding UTF-16 encodings that have the low-order byte of each code unit appearing before the high-order byte are designated UTF-16LE. The mapping method in which the "natural" (high-order byte first) order is preserved is designated either UTF-16BE or simply UTF-16.
Two Byte Order Marks (BOMs), U+FFFE and U+FEFF, are defined to indicate the byte order of UTF-16 encodings within text streams. The BOM, U+FFFE, indicates the character encodings adhere to the UTF-16BE encoding scheme, while U+FEFF signals byte ordering according to the UTF-16LE scheme.
A powerful lex-
ical analyzer generator for the Ada, C++, C, Java, Ada, and PL/SQL pro-
gramming languages.
Learn more…
This compiler generator pro-
duces SLR1, LR1, and LALR1 parsers from attributed grammars. Learn more…
This collection of utilities and libraries is used to construct navigation and guidance programs. It features a user-definable shell for controlling and monitoring execution of flight programs. Learn more…
This document generator oper-
ates on program source code to produce detailed Software Design Descriptions conforming to government requirements.
Learn more…
This data migration tool transports legacy, file-based and rel-
ational data to newly designed relational databases
Learn more…