How C++ handles floating point numbers

Warning : If you find mathematics scary you might not like this tutorial. You don't need to know this to be able to program in C++ but it will help you to understand why sometimes your C++ programs don't calculate things as you might expect.

Please be aware that I'm not a mathmatician, I'm a programmer. This tutorial is based upon what I've learned by experience and by research. For some strange reason this kind of stuff interests me.

C++ allows you to declare three different data-types for dealing with floating point numbers; float, double and long double (in order of precision).

Curiously, C++ only needs 4 bytes to store a number that could have 38 decimal places.The way it does this is by limiting the accuracy of the floating point numbers that it handles.

In accordance with the IEEE Standard 754 (an international stanard that defines how floating point numbers should be handled by computers) C++ treats floating point numbers as three seperate components; the sign, the mantissa and the exponent.

The sign indicates whether the number is positive or negative, the mantissa is the significant portion of the number and the exponent is the power of ten that the number must be multipled by in order to return it to its normal form. If you understand scientific notation in mathematics this may be familiar to you already.

Converting to scientific notation from standard form

3456000000.0

There are nine digits after the 3 at the beginning.

3.4560000000

3.456

3.456 x 109

The problem with floating point numbers on a computer

The biggest problem with floating point numbers is that the computer doesn't handle floating point numbers as precisely as you might like. Take a look at this simple C++ code:

#include <iostream>

using namespace std;

void main(){
    // Create a floating point variable and assign 1.213 to it
    float fNumberA = 1.2112432f;
    // Create a floating point variable and assign 1.344 to it
    float fNumberB = 4.3443123f;
   
    // Set the output to 8 fixed decimal places
    cout.setf(ios::fixed,ios::floatfield);
    cout.precision(8);

    // Output the two numbers added together
    cout << "fNumber = " << float(fNumberA + fNumberB) << endl;
}

The addition as it should beI've deliberately used the numbers 1.2112432 and 4.3443123 because I know that when they're added together the result will be 5.5555555. Or it should be...

What does the computer think the result should be?

The program doesn't output what you might expectTake a look at the screenshot of my program running to the right. My computer seems to think that the answer should be 5.55555534.

What on earth is going on? Aren't computers supposed to be calculators?

How computers handle floating point numbers

Although the example above suggests that computers are not good at adding floating point numbers the real explanation is a little more complex. What is actually going on here is that the computer has to convert the number that you have provided from base-10 (decimal) into base-2 (binary).

This is achieved by splitting the 32 bits available for a floating point number into three parts; the sign bit, the exponent bits and the mantissa bits.

Diagram showing how the 32 bits are split up to represent a floating point number

The sign bit (bit 31)

The sign bit (the 32nd or left-most bit) is used to denote whether the overall number is positive or negative. If the sign bit is 0 then the number is positive, if it's 1 then the number is negative.

The exponent bits (bits 23 - 30)

The exponent - because the exponent needs to be positive or negative (to represent fractions and very large numbers) there needs to be someway to show this in the exponent bits. This is achieved by assigning a bias to the exponent. For a float the bias is 127, for a double it's 1023.

For a float, to find the exponent, the bias is subtracted from the binary value of the exponent bits. So, if the exponent bits add up to 214, the exponent would be 209 - 127 = 82 (i.e. x1082)

If the exponent bits add up to 114 then the exponent would be 114 - 127 = -13 (i.e. x10-13).

Representing infinity

When all the exponent bits add up to 255 (i.e. they're all set to 1) and all the mantissa bits are set to zero then the number is positively or negatively infinite (depending upon the sign bit).

Positive infinity
The bit pattern of positive infinity (All ones in the exponent bits and all zeros in the mantissa bits)

Negative infinity
When all the exponent bits are set to one and all the mantissa bits are set to zero (and the sign bit is one) then the number is negative infinity.

 

When all of the exponent bits are set to 1 (as they were for infinity above) but at least one mantissa bit is set to one then the number has been flagged as NaN (not a number). This is an error state that indicates that the number is invalid.

NaN (Not a number)

NaN (Not a Number)
When all the exponent bits are set to 1 and at least 1 mantissa bit is set to 1 then the number has been flagged as NaN or Not a Number (an error).
Also NaN (only the sign bit has changed)

 

The mantissa bits (bits 0 - 22)

The mantissa bits represent the significant figures of the floating point number and are usually stored in normalised form.

If the exponent works out to be greater than zero then the first binary digit of the mantissa can be assumed to be one followed by a decimal place followed by the binary digits of the mantissa.

And that's where the problems really start. We like to work in decimal (base 10) and computers don't. Computers work in base-2 and base-2 arithmetic has just the same sorts of quirks that occur in decimal mathematics. Here's an example from decimal -

10 / 3 = 3.33333333333333333333333333333333333333 (recurring forever)

Binary has the same sorts of quirks that will affect the conversion to or from decimal. This means it's a very bad idea to make a program-critical test (e.g. a while loop condition) dependant upon the value of a float.

Calculating the stored number

float = sign X Mantissa X  2e-E  

where e = the exponent and E = the bias used calculate if the exponent represents a positive or negative emponent.

[ I'll expand this section when I have a bit more time, showing some examples ]