Unicode and .NET
Scope of this page
This is a big topic. Don't expect this page to do more than
scratch the surface - indeed, if you believe you're already fairly experienced
and knowledgeable about character encodings and the like, this page may well
not have anything new or useful for you. However, there are still many
people who don't understand the difference between binary and text, or know
what a character encoding is, etc. It is for these people that this page has
been written. It mentions a few advanced topics, but only to make the reader
aware of their existence, rather than to give much guidance on them.
The links below are probably all at least as useful as this page, and probably
more so - but there's more to read in them, too. I referred to all of them
(and more) when writing this page. There's a lot of good information, and while
there may be some inaccuracies on this page (if you spot any, please mail me at
firstname.lastname@example.org) these resources should
- The Unicode Web Site Main Page
The definitive resource about Unicode, this is somewhat intimidating but
will have all the answers you need about Unicode itself - somewhere! Some
of the links below are just helpful pages from the site.
- The Unicode Glossary
At-a-glance definitions of many of the terms used when discussing
character encoding (etc) issues.
- The Unicode FAQ
Answers to hundreds of common questions, divided into sections.
- Unix/Linux UTF-8/Unicode FAQ
Don't be put off by the title if you don't like Unix/Linux - most of the information here
is very relevant to .NET issues.
- The Unicode Character Encoding Model
Gives more information about precise meanings of "character encoding scheme" etc.
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
A page somewhat similar to this one, but without the .NET emphasis.
- On the goodness of Unicode
Another introductory page which is worth a read.
Binary and text - a big distinction
Most modern computer languages (and some older ones) make a big distinction
between "binary" content and "character" (or "text") content . The difference is
largely the same as the instinctive one, but for the purposes of clarity, I'll
define it here as:
Binary content is a sequence of octets (bytes in common parlance) with
no intrinsic meaning attached. Even though there may be external means
of understanding a piece of binary content to be, say, a picture, or
an executable file, the content itself is just a sequence of bytes.
(Note for pedantic readers: from now on, I won't use the word "octet".
I'll use "byte" instead, even though strictly speaking a byte needn't be
an octet. There have been architectures with 9-bit bytes, for instance.
I don't believe that's a particularly relevant or useful distinction to
make in this day and age, and readers are likely to be more comfortable
with the word "byte".)
Character content is a sequence of characters.
The Unicode Glossary defines a
The smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific
shape (see also glyph), though in code tables some form of visual
representation is essential for the reader's understanding.
Synonym for abstract character. (See Definition D3 in Section 3.3,
Characters and Coded Representations .)
The basic unit of encoding for the Unicode character encoding.
The English name for the ideographic written elements of Chinese origin.
(See ideograph (2).)
That may or may not be a terribly useful definition to you, but for the most
part you can again use your instinctive understanding - a character is something like
"the capital letter A", "the digit 1" etc. There are other characters which are less
obvious, such as: combining characters such as "an acute accent", control characters
such as "newline", and formatting characters (invisible, but affect surrounding
characters). The important thing is that these are fundamentally "text" in some form
or other. They have some meaning attached to them.
Now, unfortunately in the past, this distinction has been very blurred - C programmers
are often used to thinking of "byte" and "char" as being interchangeable, to the extent
that they will talk about reading a certain number of characters, even when the content
is entirely binary. In modern environments such as .NET and Java, where the distinction
is clear and present in the IO libraries, this can lead to people attempting to copy
binary files by reading and writing characters, resulting in corrupt output.
Where does Unicode come in?
The Unicode Consortium is a body trying to standardise the handling of character data,
including its transformation to and from binary form (otherwise known as encoding and
decoding). There is also a set of ISO standards (10646 in various versions) which do
similar things; Unicode and ISO 10646 can largely be regarded as "the same thing" in
that they are compatible in almost all respects. (In theory ISO 10646 defines a larger
potential set of characters, but this is never likely to become an issue.) Most modern
computer languages and environments, such as .NET and Java, use Unicode for character
representations. Unicode defines, amongst other things, an
abstract character repertoire (the set of characters it covers), a
coded character set (a mapping from each character in the repertoire to a
non-negative integer), some character encoding forms (mappings from the
non-negative integers in the coded character set to sequences of "code units" (eg bytes)),
and some character encoding schemes (mappings from sequences of code units into
a serialized byte sequences). The difference between a character encoding form and a
character encoding scheme is slightly subtle, but takes account of things like endianness.
(For instance, the UCS-2 code unit sequence 0xc2 0xa9 may be serialized as 0xc2 0xa9 or
0xa9 0xc2, and it's the character encoding scheme that decides that.)
The Unicode abstract character repertoire can, in theory, hold up to 1114112 characters,
although many are reserved to be invalid and the rest aren't all likely to ever be
assigned. Each character is coded as an integer between 0 and 1114111 (0x10ffff).
For instance, capital A is coded as 65. Until a few years ago, it was hoped that only
characters in the range 0 to 2^16-1 would be required, which would have meant that each
character would only have required 2 bytes to be represented. Unfortunately, more characters
were needed, surrogate pairs were introduced. They confuse things significantly (at least,
they confuse me significantly) and most of the rest of this page will ignore
their existence - I'll cover them briefly in the "nasty bits" section.
What does .NET provide?
If all of this sounds rather confusing, don't worry. It's worth being aware of the
distinctions above, but they don't often actually come to the fore. Most of the time
you just want to convert some bytes into some characters, and vice versa. This is
System.Text.Encoding class comes in, along with the
System.Char structure (aka
char in C#) and the
System.String class (aka
string in C#).
char is the most basic character type. Each
char is a
single Unicode character. It takes 2 bytes in memory, and can take a value of 0-65535.
Note that not all values are thus actually valid Unicode characters.
string is just a sequence of
chars, fundamentally. It's
immutable, which means that once you've created a string instance (however you've
done it) you can't change it - the various methods in the string class which suggest
that they're changing the string in fact just return a new string which is the original
character sequence with the changes applied.
System.Text.Encoding class provides facilities for converting
arrays of bytes to arrays of characters, or strings, and vice versa. The class itself
is abstract; various implementations are provided by .NET and can easily be instantiated,
and users can write their own derived classes if they wish. (This is quite a rare
requirement, however - most of the time you'll be fine with the built-in implementations.)
An encoding can also provide separate encoders and decoders, which maintain state between
calls. This is necessary for multi-byte character encoding schemes, where you may not be
able to decode all the bytes you have so far received from a stream. For instance, if a
UTF-8 decoder receives 0x41 0xc2, it can return the first character (a capital A) but must
wait for the third byte to determine what the second character is.
Built-in encoding schemes
.NET provides various encoding schemes "out of the box". What follows below is a
description (as far as I can find) of the various different encoding schemes,
and how they can be retrieved.
ASCII is one of the most commonly known and frequently misunderstood character encodings.
Contrary to popular belief, it is only 7 bit - there are no ASCII characters above 127.
If anyone says that they wish to encode (for example) "ASCII 154" they may well not
know exactly which encoding they actually mean. If pressed, they're likely to say it's
"extended ASCII". There is no encoding scheme called "extended ASCII". There are many
8-bit encodings which are supersets of ASCII, and usually it is one of these which is
meant - commonly whatever Windows Code Page is the default for their computer. Every ASCII
character has the same value in the ASCII encoded as in the Unicode coded character set
- in other words, ASCII x is the same character as Unicode x for all characters
within ASCII. The .NET
ASCIIEncoding class (an instance of which can be easily retrieved using the
Encoding.ASCII property) is slightly odd, in my view, as it appears to encode
by merely stripping away all bits above the bottom 7. This means that, for instance,
Unicode character 0xb5 ("micro sign") after encoding and decoding would become Unicode 0x35
("digit five"), rather than some character showing that it was the result of encoding
a character not contained within ASCII.
UTF-8 is a good general-purpose way of representing Unicode characters. Each character is
encoded as a sequence of 1-4 bytes. (All the characters < 65536 are encoded in 1-3 bytes;
I haven't checked whether .NET encodes surrogates as two sequences of 1-3 bytes, or as one
sequence of 4 bytes). It can represent all characters, it is "ASCII-compatible"
in that any sequence of characters in the ASCII set is encoded in UTF-8 to exactly the same
sequence of bytes as it would be in ASCII. In addition, the first byte is sufficient to say how
many additional bytes (if any) are required for the whole character to be decoded. UTF-8
itself needs no byte-ordering mark (BOM) although it could be used as a way of giving
evidence that the file is indeed in UTF-8 format. The UTF-8 encoded BOM is always
0xef 0xbb 0xbf. Obtaining a UTF-8 encoding in .NET is simple - use the
property. In fact, a lot of the time you don't even need to do that - many classes
StreamWriter) used UTF-8 by default when no encoding is specified.
(Don't be misled by
Encoding.Default - that's something else entirely!) I suggest
always specifying the encoding however, just for the sake of readability.
UTF-16 and UCS-2
UTF-16 is effectively how characters are maintained internally in .NET. Each
character is encoded as a sequence of 2 bytes, other than surrogates which
take 4 bytes. The opportunity of using surrogates is the only difference
between UTF-16 and UCS-2 (also known as just "Unicode"), the latter of
which can only represent characters 0-0xffff. UTF-16 can be big-endian,
little-endian, or machine-dependent with optional BOM (0xff 0xfe for
little-endianness, and 0xfe 0xff for big-endianness). In .NET itself, I
believe the surrogate issues are effectively forgotten, and each value in
the surrogate pair is treated as an individual character, making UCS-2 and
UTF-16 "the same" in a fuzzy sort of way. (The exact differences between
UCS-2 and UTF-16 rely on deeper understanding of surrogates than I have,
I'm afraid - if you need to know details of the differences, chances are
you'll know more than I do anyway.) A big-endian encoding may be
retrieved using Encoding.BigEndianUnicode, and a little-endian encoding
may be retrieved using Encoding.Unicode. Both are instances of
System.Text.UnicodeEncoding, which can also be constructed directly with
appropriate parameters for whether or not to emit the BOM and which
endianness to use when encoding. I believe (although I haven't tested)
that when decoding binary content, a BOM in the content overrides the
endianness of the encoder, so the programmer doesn't need to do any extra
work to decode appropriately if they either know the endianness or the
content contains a BOM.
UTF-7 is rarely used, in my experience, but encodes Unicode (possibly only the first 65535
characters) entirely into ASCII characters (not bytes!). This can be useful for mail where
the mail gateway may only support ASCII characters, or some subset of ASCII (in, for example,
the EBCDIC encoding). This description sounds fairly woolly for a reason: I haven't looked
into it in any detail, and don't intend to. If you need to use it, you'll probably understand
it reasonably well anyway, and if you don't absolutely have to use it, I'd suggest steering
clear. An encoding instance in .NET can be retrieved using
Windows/ANSI Code Pages
Windows Code Pages are usually either single or double byte character sets, encoding
up to 256 or 65536 characters respectively. Each is numbered, an encoding for a known
code page number can be retrieved using
pages are mostly useful for legacy data which is often stored in the "default code page".
An encoding for the default code page can be retrieved using
Encoding.Default. Again, I try
to avoid using code pages where possible. More information is available in the MSDN.
Like ASCII, every character in Latin-1 has the same code there as in Unicode. I haven't
been able to ascertain for certain whether or not Latin-1 has a "hole" of undefined
characters from 128 to 159, or whether it contains the same control characters there that
Unicode does. (I had begun to lean towards the "hole" idea, but
Wikipedia disagrees, so I'm still
sitting on the fence). Latin-1 is also code page 28591, so obtaining an encoding for it is
Streams, readers and writers
Streams are by their nature binary - they read and write bytes, fundamentally.
Anything which takes a string is going to do some kind of conversion to bytes, which may
or may not be what you want. The equivalents of streams for reading and writing text are
System.IO.TextWriter respectively. If
you have a stream already, you can use
(which derives from
TextWriter) respectively, constructing them with the stream and
the encoding you wish to use. If you don't specify the encoding, UTF-8 is assumed.
Here is some example code to convert a file from UTF-8 to UCS-2:
public class FileConverter
const int BufferSize = 8096;
public static void Main(string args)
if (args.Length != 2)
("Usage: FileConverter <input file> <output file>");
using (TextReader input = new StreamReader
(new FileStream (args, FileMode.Open),
using (TextWriter output = new StreamWriter
(new FileStream (args, FileMode.Create),
char buffer = new char[BufferSize];
while ( (len = input.Read (buffer, 0, BufferSize)) > 0)
output.Write (buffer, 0, len);
Note that this demonstrates using the constructors for
TextWriter which take streams. There are also constructors which take
filenames as parameters, so that you don't have to manually open a
FileStream in your code. Other parameters, such as the buffer size
and whether or not to detect a BOM if present, are available - see the documentation for
more details. Finally, as of .NET 2.0 you should also look at the
File class for all kinds
of convenience methods
Okay, so those are the basics of Unicode. There are then lots of extra bits, some of
which have already been hinted at, and which people ought to be aware of, even
if they deem them too unlikely to be relevant for their application to be worth sorting
out. I don't offer any general techniques or guiding principles here - I'm just trying
to raise some awareness. This is by no means an exhaustive list, either - these are
just some of the nasty bits. It's important to recognise that a lot of the difficulty
here is in no way the fault of the Unicode Consortium - just as with dates and times
and any number of other internationalisation problems, humanity has got itself into
a fundamentally tricky situation over the course of its history.
Culture-sensitive searching and casing
These are covered in my article on .NET string handling.
Now that Unicode has more than 65536 characters, it can't be represented in two bytes.
This means that a .NET
char value can't store all possible values. The
solution UTF-16 uses is that of surrogate pairs: pairs of 16-bit values where each
value is between 0xd800 and 0xdfff. In other words, two "sort of" characters make
one "real" character. (UCS-4 and UTF-32 get round this problem entirely by having
wider values to start with - when everything's four bytes, you can get all possible
characters in.) This is basically a headache - it means that a string of 10 chars can
actually represent anywhere between 5 and 10 "real" Unicode characters. Fortunately,
most applications which don't involve scientific/mathematical notation and Han characters
are unlikely to need to worry too much about them. Whether or not that applies to you
is a different matter - and exactly which bits of your code are sensitive to surrogates
will also vary between applications.
Not all characters should result in a single character being drawn on the screen.
An accented character can be represented as the unaccented character followed by the
accented combining character. Some GUI systems will support combining characters, some
won't - and the impact on your application will depend on what assumptions you're making.
Partly due to things like combining characters, there can be several ways of representing
what is in some senses a single character. Character sequences can be normalised to use
combining characters wherever possible, or to avoid using combining characters wherever
possible. Should your application treat two different sequences representing the same
actual character as different or the same? Do any components you need rely on sequences
being normalized in one particular way?
It can be cumbersome to work out some of the details of this by hand, so you can use the
enter into the text field. Currently I don't have any support for going the other way (e.g.
from UTF-16 code units to text) but hopefully this is still useful.
Enter text here:
This table breaks down the text in the text-box into Unicode characters. It does not
perform any kind of normalization, so an accented character may appear as one character or more,
depending on whether it is entered as a single character including the accent (e.g. é), or a non-accented
character followed by combining characters (e.g. é - yes, that really is different to the previous example; copy and paste them both to see!).
However, it does break the input into Unicode characters
instead of just UTF-16 code units; a surrogate pair is treated as a single character. For example,
𠬠 (which apparently isn't a valid Unicode character, but appears to have a commonly understood
meaning and glyph) is shown as U+20B20.
The first column simply displays the character. The second column displays the Unicode code point
(U+0000 to U+10FFFF), suitable for looking up in Unicode code charts.
The third column displays the UTF-16 code units which make up the character: these are the
Plane this will just be a single code unit; for other characters it will be the surrogate pair (high then low).
The fourth column displays the UTF-8 representation of the character in bytes.