# Binary floating point and .NET

Lots of people are at first surprised when some of their arithmetic comes out "wrong" in .NET. This isn't something specific to .NET in particular - most languages/platforms use something called "floating point" arithmetic for representing non-integer numbers. This is fine in itself, but you need to be a bit aware of what's going on under the covers, otherwise you'll be surprised at some of the results.

It's worth noting that I am *not* an expert on this matter. Since
writing this article, I've found another one - this time written by
someone who really *is* an expert, Jeffrey Sax. I strongly
recommend that you read his
article on floating point concepts too.

## What is floating point?

Computers always need some way of representing data, and ultimately
those representations will always boil down to binary (0s and 1s).
Integers are easy to represent (with appropriate conventions for
negative numbers, and with well-specified ranges to know how big the
representation is to start with) but non-integers are a bit more
tricky. Whatever you come up with, there'll be a problem with it.
For instance, take our own normal way of writing numbers in decimal:
that can't (in itself) express a third. You end up with a recurring 3.
Whatever base you come up with, you'll have the same problem with
some numbers - and in particular, "irrational" numbers (numbers which
can't be represented as fractions) like the mathematical constants
*pi* and *e* are always going to give trouble.

You *could* store all the rational numbers exactly as two integers,
with the number being the first number divided by the second - but the
integers can grow quite large quite quickly even for "simple" operations,
and things like square roots will tend to produce irrational numbers.
There are various other schemes which also pose problems, but the one most
systems use in one form or other is *floating point*. The idea of
this is that basically you have one integer (the *mantissa*) which
gives some scaled representation of the number, and another (the *
exponent*) which says what the scale is, in terms of "where does the
dot go". For instance, 34.5 could be represented in "decimal floating
point" as mantissa 3.45 with an exponent of 1, whereas 3450 would have the
same mantissa but an exponent of 3 (as 34.5 is 3.45x10^{1}, and
3450 is 3.45x10^{3}). Now, that example is in decimal just for
simplicity, but the most common formats of floating point are for binary.
For instance, the binary mantissa 1.1 with an exponent of -1 would mean
decimal 0.75 (binary 1.1==decimal 1.5, and the exponent of -1 means
"divide by 2" in the same way that a decimal exponent of -1 means "divide
by 10").

It's very important to understand that in the same way that you can't represent
a third exactly in a (finite) decimal expansion, there are lots of numbers
which look simple in decimal, but which have long or infinite expansions
in a binary expansion. This means that (for instance) a binary floating point
variable *can't* have the exact value of decimal 0.1. Instead, suppose you
have some code like this:

`double x = 0.1d;`

The variable `x`

will actually
store the closest available double to that value. Once you can get your head
round that, it becomes obvious why some calculations seem to be "wrong". If
you were asked to add a third to a third, but could only represent the thirds
using 3 decimal places, you'd get the "wrong" answer: the closest you could get
to a third is 0.333, and adding two of those together gives 0.666, rather than
0.667 (which is closer to the exact value of two thirds). An example in binary
floating point is that 3.65d+0.05d != 3.7d (although it may be *displayed*
as 3.7 in some situations).

## What floating point types are available in .NET?

The C# standard only lists `double`

and `float`

as floating
points available (those being the C# shorthand for `System.Double`

and `System.Single`

), but the
`decimal`

type (shorthand for `System.Decimal`

) is also a floating
point type really - it's just it's *decimal* floating point, and the ranges of exponents
are interesting. The `decimal`

type is described in another article,
so this one doesn't go into it any further - we're concentrating on
`double`

and `float`

. Both of these are
binary floating point types, conforming to IEEE 754 (a standard defining
various floating point types). `float`

is a 32 bit type (1 bit of sign,
23 bits of mantissa, and 8 bits of exponent), and `double`

is a 64 bit type
(1 bit of sign, 52 bits of mantissa and 11 bits of exponent).

## Isn't it bad that results aren't what I'd expect?

Well, that depends on the situation. If you're writing financial applications,
you probably have very rigidly defined ways of treating errors, and the amounts
are also *intuitively* represented as decimal - in which case the
`decimal`

type is more likely to be appropriate than `float`

or `double`

. If, however, you're writing a scientific app, the link
with the decimal representation is likely to be weaker, and you're also likely
to be dealing with less precise amounts to start with (a dollar is exactly a dollar,
but if you've measured a length to be a metre, that's likely to have some sort
of inaccuracy in it to start with).

## Comparing floating point numbers

One consequence of all of this is that you should very, *very* rarely
be comparing binary floating point numbers for equality directly. It's usually fine to
compare in terms of greater-than or less-than, but when you're interested in
equality you should always consider whether what you actually want is *near*
equality: is one number *almost* the same as another. One simple way of
doing this is to subtract one from the other, use `Math.Abs`

to find the
absolute value of the difference, and then check whether this is lower than a certain
tolerance level.

There are some cases which are particularly pathological though, and these are due to JIT optimisations. Look at the following code:

```
using System;
class Test
{
static float f;
static void Main(string[] args)
{
f = Sum (0.1f, 0.2f);
float g = Sum (0.1f, 0.2f);
Console.WriteLine (f==g);
// g = g+1;
// Or... Console.WriteLine(g);
// Or... GC.KeepAlive(g);
}
static float Sum (float f1, float f2)
{
return f1+f2;
}
}
```

It should always print `True`

, right? Wrong, unfortunately.
When running under debug, where the JIT can't make as many optimisations as normal,
it will print `True`

. When running normally, the JIT can store the result
of the sum more accurately than a float can really represent - it can use the default
x86 80-bit representation, for instance, for the sum itself, the return value, and
the local variable. See the ECMA CLI spec, partition 1, section 12.1.3 for more details.
Uncommenting one of the commented out lines in the above *may* make the JIT
behave a bit more conservatively, leading to a result of `True`

. However,
this depends on the exact implementation, CLR version, processor etc - it's not something you
should rely on. (Indeed, in some environments only *some* of the commented-out lines
will affect the results.) This is another reason to avoid equality comparisons
even if you're really sure that the results should be the same.

## How does .NET format floating point numbers?

There's no built-in way to see the exact decimal value of a floating
point number in .NET, although you can do it with a bit of work. (See the
bottom of this article for some code to do this.) By default, .NET formats
a `double`

to 15 decimal places, and a `float`

to 7.
(In some cases it will use scientific notation; see the MSDN page on
standard numeric format strings for more information.) If you use the
round-trip format specifier ("`r`

"), it formats the number to
the shortest form which, when parsed (to the same type), will get back to
the original number. If you are storing floating point numbers as strings
and the exact value is important to you, you should definitely use the
round-trip specifier, as otherwise you are very likely to lose data.

## What exactly does a floating point number look like in memory?

As it says above, a floating point number basically has a sign, an
exponent and a mantissa. All of these are integers, and the combination of
the three of them specifies exactly what number is being represented.
There are various classes of floating point number: *normalised*, *
subnormal*, *infinity* and *not a number (NaN)*. Most numbers
are normalised, which means that the first bit of the binary mantissa is
assumed to be 1, which means you don't actually need to store it. For
instance, the binary number 1.01101 could be expressed as just .01101 -
the leading 1 is assumed, as if it were 0 a different exponent would be
used. That technique only works when the number is in the range where you
can choose the exponent suitably. Numbers which don't lie in that range
(very, very small numbers) are called subnormal, and no leading bit is
assumed. "Not a number" (NaN) values are for things like the result of
dividing 0 by 0, etc. There are various different classes of NaN, and
there's some odd behaviour there as well. Subnormal numbers are also
sometimes called denormalised numbers.

The actual representation of the sign, exponent and mantissa at the bit
level is for each of them to be an unsigned integer, with the stored value
being just the concatenation of the sign, then the exponent, then the
mantissa. The "real" exponent is biased - for instance, in the case of a
`double`

, the exponent is biased by 1023, so a stored exponent
value of 1026 really means 3 when you come to work out the actual value.
The table below shows what each combination of sign, exponent and mantissa
means, using `double`

as an example. The same principles apply
for `float`

, just with slightly different values (such as the
bias). Note that the exponent value given here is the stored exponent,
before the bias is applied. (That's why the bias is shown in the "value"
column.)

Sign (s, 1 bit) | Stored exponent (e, 11 bits) | Mantissa (m, 52 bits) | Type of number | Value |
---|---|---|---|---|

Any | Non-zero | Any | Normal | (-1)^{s} x 1.m (binary) x 2^{e-1023} |

0 | 0 | 0 | Zero | +0 |

1 | 0 | 0 | Zero | +0 |

0 | 2047 | 0 | Infinity | Positive infinity |

1 | 2047 | 0 | Infinity | Negative infinity |

0 | 2047 | Non-zero | NaN | n/a |

## Worked example

Consider the following 64-bit binary number:

0100000001000111001101101101001001001000010101110011000100100011

As a double, this is split into:

- Sign: 0
- Exponent: 10000000100 binary = 1028 decimal
- Mantissa: 0111001101101101001001001000010101110011000100100011

This is therefore a normal number of value

(-1)^{0} x 1.0111001101101101001001001000010101110011000100100011 (binary) x 2^{1028-1023}

which is more simply represented as

1.0111001101101101001001001000010101110011000100100011 (binary) x 2^{5}

or

101110.01101101101001001001000010101110011000100100011

In decimal, this is 46.42829231507700882275457843206822872161865234375, but .NET will display it by default as 46.428292315077 or with the "round-trip" format specifier as 46.428292315077009.

## Sample code

DoubleConverter.cs: this is a fairly simple class
which allows you to convert a double to its exact decimal representation as a string.
Note that although finite decimals don't always have finite binary expansions, all finite
binaries have a finite decimal expansion (because 2 is a factor of 10, essentially). The
class is extremely simple to use - just call `DoubleConverter.ToExactString(value)`

and the exact string representation for `value`

is returned.

## NaNs

NaNs are odd beasts. There are two types of NaNs - *signalling* and
*quiet*, or *SNaN* and *QNaN* for short. In terms of the bit pattern,
a quiet NaN has the top bit of the mantissa set, whereas a signalling NaN has it cleared.
Quiet NaNs are used to signify that the result of a mathematical operation is undefined,
whereas signalling NaNs are used to signify exceptions (where the operation was invalid,
rather than just having an indeterminate outcome).

The strangest thing most people are likely to see about NaNs is that they're not equal
to themselves. For instance, `Double.NaN==Double.NaN`

evaluates to false.
Instead, you need to use `Double.IsNaN`

to check whether a value is not a number.
Fortunately, most people are unlikely to encounter NaNs at all except in articles like
this.

## Conclusion

Binary floating point arithmetic is fine so long as you know what's going on and don't expect values to be exactly the decimal ones you put in your program, and don't expect calculations involving binary floating point numbers to necessarily yield precise results. Even if two numbers are both exactly represented in the type you're using, the result of an operation involving those two numbers won't necessarily be exactly represented. This is most easily seen with division (eg 1/10 isn't exactly representable despite both 1 and 10 being exactly representable) but it can happen with any operation - even seemingly innocent ones such as addition and subtraction.

If you particularly want precise decimal numbers, consider using the
`decimal`

type instead - but expect to pay a performance penalty for doing
so. (One very quickly devised test came out with multiplication of doubles
being about 40 times faster than multiplication of decimals; don't pay
particular heed to this exact figure, but take it as an indication that
binary floating point is generally faster on current hardware than decimal
floating point.)

In my experience, most business applications are likely to have the kinds of
value which are better represented as decimal floating point numbers than
binary floats. In particular, almost anything to do with money is likely to
be more appropriately represented in `decimal`

.