How C++ handles floating point numbers
Please be aware that I'm not a mathmatician, I'm a programmer. This tutorial is based upon what I've learned by experience and by research. For some strange reason this kind of stuff interests me.
C++ allows you to declare three different data-types for dealing with floating point numbers; float, double and long double (in order of precision).
Curiously, C++ only needs 4 bytes to store a number that could have 38 decimal places.The way it does this is by limiting the accuracy of the floating point numbers that it handles.
In accordance with the IEEE Standard 754 (an international stanard that defines how floating point numbers should be handled by computers) C++ treats floating point numbers as three seperate components; the sign, the mantissa and the exponent.
The sign indicates whether the number is positive or negative, the mantissa is the significant portion of the number and the exponent is the power of ten that the number must be multipled by in order to return it to its normal form. If you understand scientific notation in mathematics this may be familiar to you already.
Converting to scientific notation from standard form
- Using the number:
3456000000.0
- Count how many digits are after the first digit to find the exponent.
There are nine digits after the 3 at the beginning.
- Move the decimal place nine places to the left (leaving it between the 3 and the 4).
3.4560000000
- Get rid of all the zeros at the end to get the mantissa
3.456
- append the exponent: X109 to show that the number needs to have the decimal place moved 10 places to the right when displayed in standard form. (In mathematical terms that means the number must be multiplied by 10 to the power of 9 or by 1000000000 to return it to its standard form).
3.456 x 109
The problem with floating point numbers on a computer
The biggest problem with floating point numbers is that the computer doesn't handle floating point numbers as precisely as you might like. Take a look at this simple C++ code:
#include <iostream>
using namespace std;
void main(){
// Create a floating point variable and assign 1.213 to it
float fNumberA = 1.2112432f;
// Create a floating point variable and assign 1.344 to it
float fNumberB = 4.3443123f;
// Set the output to 8 fixed decimal places
cout.setf(ios::fixed,ios::floatfield);
cout.precision(8);
// Output the two numbers added together
cout << "fNumber = " << float(fNumberA + fNumberB) << endl;
}
I've deliberately used the numbers 1.2112432 and 4.3443123 because I know that when they're added together the result will be 5.5555555. Or it should be...
What does the computer think the result should be?
Take a look at the screenshot of my program running to the right. My computer seems to think that the answer should be 5.55555534.
What on earth is going on? Aren't computers supposed to be calculators? 
How computers handle floating point numbers
Although the example above suggests that computers are not good at adding floating point numbers the real explanation is a little more complex. What is actually going on here is that the computer has to convert the number that you have provided from base-10 (decimal) into base-2 (binary).
This is achieved by splitting the 32 bits available for a floating point number into three parts; the sign bit, the exponent bits and the mantissa bits.
![]()
The sign bit (bit 31)
The sign bit (the 32nd or left-most bit) is used to denote whether the overall number is positive or negative. If the sign bit is 0 then the number is positive, if it's 1 then the number is negative.
The exponent bits (bits 23 - 30)
The exponent - because the exponent needs to be positive or negative (to represent fractions and very large numbers) there needs to be someway to show this in the exponent bits. This is achieved by assigning a bias to the exponent. For a float the bias is 127, for a double it's 1023.
For a float, to find the exponent, the bias is subtracted from the binary value of the exponent bits. So, if the exponent bits add up to 214, the exponent would be 209 - 127 = 82 (i.e. x1082)
If the exponent bits add up to 114 then the exponent would be 114 - 127 = -13 (i.e. x10-13).
Representing infinity
When all the exponent bits add up to 255 (i.e. they're all set to 1) and all the mantissa bits are set to zero then the number is positively or negatively infinite (depending upon the sign bit).
Positive infinity
![]()
Negative infinity
![]()
When all of the exponent bits are set to 1 (as they were for infinity above) but at least one mantissa bit is set to one then the number has been flagged as NaN (not a number). This is an error state that indicates that the number is invalid.
NaN (Not a number)
NaN (Not a Number)
![]()
Also NaN (only the sign bit has changed)
![]()
The mantissa bits (bits 0 - 22)
The mantissa bits represent the significant figures of the floating point number and are usually stored in normalised form.
If the exponent works out to be greater than zero then the first binary digit of the mantissa can be assumed to be one followed by a decimal place followed by the binary digits of the mantissa.
And that's where the problems really start. We like to work in decimal (base 10) and computers don't. Computers work in base-2 and base-2 arithmetic has just the same sorts of quirks that occur in decimal mathematics. Here's an example from decimal -
10 / 3 = 3.33333333333333333333333333333333333333 (recurring forever)
Binary has the same sorts of quirks that will affect the conversion to or from decimal. This means it's a very bad idea to make a program-critical test (e.g. a while loop condition) dependant upon the value of a float.
Calculating the stored number
float = sign X Mantissa X 2e-E
where e = the exponent and E = the bias used calculate if the exponent represents a positive or negative emponent.
[ I'll expand this section when I have a bit more time, showing some examples ]
