Binary floating point and .NET

Lots of people are at first surprised when some of their arithmetic comes out "wrong" in .NET. This isn't something specific to .NET in particular - most languages/platforms use something called "floating point" arithmetic for representing non-integer numbers. This is fine in itself, but you need to be a bit aware of what's going on under the covers, otherwise you'll be surprised at some of the results.

It's worth noting that I am not an expert on this matter. Since writing this article, I've found another one - this time written by someone who really is an expert, Jeffrey Sax. I strongly recommend that you read his article on floating point concepts too.

What is floating point?

Computers always need some way of representing data, and ultimately those representations will always boil down to binary (0s and 1s). Integers are easy to represent (with appropriate conventions for negative numbers, and with well-specified ranges to know how big the representation is to start with) but non-integers are a bit more tricky. Whatever you come up with, there'll be a problem with it. For instance, take our own normal way of writing numbers in decimal: that can't (in itself) express a third. You end up with a recurring 3. Whatever base you come up with, you'll have the same problem with some numbers - and in particular, "irrational" numbers (numbers which can't be represented as fractions) like the mathematical constants pi and e are always going to give trouble.

You could store all the rational numbers exactly as two integers, with the number being the first number divided by the second - but the integers can grow quite large quite quickly even for "simple" operations, and things like square roots will tend to produce irrational numbers. There are various other schemes which also pose problems, but the one most systems use in one form or other is floating point. The idea of this is that basically you have one integer (the mantissa) which gives some scaled representation of the number, and another (the exponent) which says what the scale is, in terms of "where does the dot go". For instance, 34.5 could be represented in "decimal floating point" as mantissa 3.45 with an exponent of 1, whereas 3450 would have the same mantissa but an exponent of 3 (as 34.5 is 3.45x10¹, and 3450 is 3.45x10³). Now, that example is in decimal just for simplicity, but the most common formats of floating point are for binary. For instance, the binary mantissa 1.1 with an exponent of -1 would mean decimal 0.75 (binary 1.1==decimal 1.5, and the exponent of -1 means "divide by 2" in the same way that a decimal exponent of -1 means "divide by 10").

It's very important to understand that in the same way that you can't represent a third exactly in a (finite) decimal expansion, there are lots of numbers which look simple in decimal, but which have long or infinite expansions in a binary expansion. This means that (for instance) a binary floating point variable can't have the exact value of decimal 0.1. Instead, suppose you have some code like this:

double x = 0.1d;

The variable x will actually store the closest available double to that value. Once you can get your head round that, it becomes obvious why some calculations seem to be "wrong". If you were asked to add a third to a third, but could only represent the thirds using 3 decimal places, you'd get the "wrong" answer: the closest you could get to a third is 0.333, and adding two of those together gives 0.666, rather than 0.667 (which is closer to the exact value of two thirds). An example in binary floating point is that 3.65d+0.05d != 3.7d (although it may be displayed as 3.7 in some situations).

What floating point types are available in .NET?

The C# standard only lists double and float as floating points available (those being the C# shorthand for System.Double and System.Single), but the decimal type (shorthand for System.Decimal) is also a floating point type really - it's just it's decimal floating point, and the ranges of exponents are interesting. The decimal type is described in another article, so this one doesn't go into it any further - we're concentrating on double and float. Both of these are binary floating point types, conforming to IEEE 754 (a standard defining various floating point types). float is a 32 bit type (1 bit of sign, 23 bits of mantissa, and 8 bits of exponent), and double is a 64 bit type (1 bit of sign, 52 bits of mantissa and 11 bits of exponent).

Isn't it bad that results aren't what I'd expect?

Well, that depends on the situation. If you're writing financial applications, you probably have very rigidly defined ways of treating errors, and the amounts are also intuitively represented as decimal - in which case the decimal type is more likely to be appropriate than float or double. If, however, you're writing a scientific app, the link with the decimal representation is likely to be weaker, and you're also likely to be dealing with less precise amounts to start with (a dollar is exactly a dollar, but if you've measured a length to be a metre, that's likely to have some sort of inaccuracy in it to start with).

Comparing floating point numbers

One consequence of all of this is that you should very, very rarely be comparing binary floating point numbers for equality directly. It's usually fine to compare in terms of greater-than or less-than, but when you're interested in equality you should always consider whether what you actually want is near equality: is one number almost the same as another. One simple way of doing this is to subtract one from the other, use Math.Abs to find the absolute value of the difference, and then check whether this is lower than a certain tolerance level.

There are some cases which are particularly pathological though, and these are due to JIT optimisations. Look at the following code:

using System;

class Test
{
    static float f;
    
    static void Main(string[] args)
    {
        f = Sum (0.1f, 0.2f);
        float g = Sum (0.1f, 0.2f);
        Console.WriteLine (f==g);
//        g = g+1;
// Or...  Console.WriteLine(g);
// Or...  GC.KeepAlive(g);
    }
    
    static float Sum (float f1, float f2)
    {
        return f1+f2;
    }
}

It should always print True, right? Wrong, unfortunately. When running under debug, where the JIT can't make as many optimisations as normal, it will print True. When running normally, the JIT can store the result of the sum more accurately than a float can really represent - it can use the default x86 80-bit representation, for instance, for the sum itself, the return value, and the local variable. See the ECMA CLI spec, partition 1, section 12.1.3 for more details. Uncommenting one of the commented out lines in the above may make the JIT behave a bit more conservatively, leading to a result of True. However, this depends on the exact implementation, CLR version, processor etc - it's not something you should rely on. (Indeed, in some environments only some of the commented-out lines will affect the results.) This is another reason to avoid equality comparisons even if you're really sure that the results should be the same.

How does .NET format floating point numbers?

There's no built-in way to see the exact decimal value of a floating point number in .NET, although you can do it with a bit of work. (See the bottom of this article for some code to do this.) By default, .NET formats a double to 15 decimal places, and a float to 7. (In some cases it will use scientific notation; see the MSDN page on standard numeric format strings for more information.) If you use the round-trip format specifier ("r"), it formats the number to the shortest form which, when parsed (to the same type), will get back to the original number. If you are storing floating point numbers as strings and the exact value is important to you, you should definitely use the round-trip specifier, as otherwise you are very likely to lose data.

What exactly does a floating point number look like in memory?

As it says above, a floating point number basically has a sign, an exponent and a mantissa. All of these are integers, and the combination of the three of them specifies exactly what number is being represented. There are various classes of floating point number: normalised, subnormal, infinity and not a number (NaN). Most numbers are normalised, which means that the first bit of the binary mantissa is assumed to be 1, which means you don't actually need to store it. For instance, the binary number 1.01101 could be expressed as just .01101 - the leading 1 is assumed, as if it were 0 a different exponent would be used. That technique only works when the number is in the range where you can choose the exponent suitably. Numbers which don't lie in that range (very, very small numbers) are called subnormal, and no leading bit is assumed. "Not a number" (NaN) values are for things like the result of dividing 0 by 0, etc. There are various different classes of NaN, and there's some odd behaviour there as well. Subnormal numbers are also sometimes called denormalised numbers.

The actual representation of the sign, exponent and mantissa at the bit level is for each of them to be an unsigned integer, with the stored value being just the concatenation of the sign, then the exponent, then the mantissa. The "real" exponent is biased - for instance, in the case of a double, the exponent is biased by 1023, so a stored exponent value of 1026 really means 3 when you come to work out the actual value. The table below shows what each combination of sign, exponent and mantissa means, using double as an example. The same principles apply for float, just with slightly different values (such as the bias). Note that the exponent value given here is the stored exponent, before the bias is applied. (That's why the bias is shown in the "value" column.)

Sign (s, 1 bit) Stored exponent (e, 11 bits) Mantissa (m, 52 bits) Type of number Value

Any Non-zero Any Normal (-1)^s x 1.m (binary) x 2^e-1023

0 0 0 Zero +0

1 0 0 Zero +0

0 2047 0 Infinity Positive infinity

1 2047 0 Infinity Negative infinity

0 2047 Non-zero NaN n/a

Sign (s, 1 bit)	Stored exponent (e, 11 bits)	Mantissa (m, 52 bits)	Type of number	Value
Any	Non-zero	Any	Normal	(-1)^s x 1.m (binary) x 2^e-1023
0	0	0	Zero	+0
1	0	0	Zero	+0
0	2047	0	Infinity	Positive infinity
1	2047	0	Infinity	Negative infinity
0	2047	Non-zero	NaN	n/a

Worked example

Consider the following 64-bit binary number:
0100000001000111001101101101001001001000010101110011000100100011

As a double, this is split into:

Sign: 0
Exponent: 10000000100 binary = 1028 decimal
Mantissa: 0111001101101101001001001000010101110011000100100011

This is therefore a normal number of value
(-1)⁰ x 1.0111001101101101001001001000010101110011000100100011 (binary) x 2^1028-1023
which is more simply represented as
1.0111001101101101001001001000010101110011000100100011 (binary) x 2⁵
or
101110.01101101101001001001000010101110011000100100011

In decimal, this is 46.42829231507700882275457843206822872161865234375, but .NET will display it by default as 46.428292315077 or with the "round-trip" format specifier as 46.428292315077009.

Sample code

DoubleConverter.cs: this is a fairly simple class which allows you to convert a double to its exact decimal representation as a string. Note that although finite decimals don't always have finite binary expansions, all finite binaries have a finite decimal expansion (because 2 is a factor of 10, essentially). The class is extremely simple to use - just call DoubleConverter.ToExactString(value) and the exact string representation for value is returned.

NaNs

NaNs are odd beasts. There are two types of NaNs - signalling and quiet, or SNaN and QNaN for short. In terms of the bit pattern, a quiet NaN has the top bit of the mantissa set, whereas a signalling NaN has it cleared. Quiet NaNs are used to signify that the result of a mathematical operation is undefined, whereas signalling NaNs are used to signify exceptions (where the operation was invalid, rather than just having an indeterminate outcome).

The strangest thing most people are likely to see about NaNs is that they're not equal to themselves. For instance, Double.NaN==Double.NaN evaluates to false. Instead, you need to use Double.IsNaN to check whether a value is not a number. Fortunately, most people are unlikely to encounter NaNs at all except in articles like this.

Conclusion

Binary floating point arithmetic is fine so long as you know what's going on and don't expect values to be exactly the decimal ones you put in your program, and don't expect calculations involving binary floating point numbers to necessarily yield precise results. Even if two numbers are both exactly represented in the type you're using, the result of an operation involving those two numbers won't necessarily be exactly represented. This is most easily seen with division (eg 1/10 isn't exactly representable despite both 1 and 10 being exactly representable) but it can happen with any operation - even seemingly innocent ones such as addition and subtraction.

If you particularly want precise decimal numbers, consider using the decimal type instead - but expect to pay a performance penalty for doing so. (One very quickly devised test came out with multiplication of doubles being about 40 times faster than multiplication of decimals; don't pay particular heed to this exact figure, but take it as an indication that binary floating point is generally faster on current hardware than decimal floating point.)

In my experience, most business applications are likely to have the kinds of value which are better represented as decimal floating point numbers than binary floats. In particular, almost anything to do with money is likely to be more appropriately represented in decimal.