Computers represent numbers in binary, and if we represent positive integers using binary, we just do a simple conversion from decimal to binary. But sometimes, you need to deal with things that are not integers. Instead, they are floating point numbers, which involve the use of decimal points. How do computers represent these numbers? We’ll explore this problem in this article.
The Method
Floating point numbers are represented in two parts: the significand and the exponent. The significand multiplied by the exponent, is the floating point number involved. Take the number 96.66, for example. This number can be written as
96.66 = 9666 * 10^(-2)
Where 9666 is the significand (or the mantissa) and -2 is the exponent. This example is not the binary representation, but this is essentially how the system works by converting floating point numbers to a pair of integers.
IEEE 754 Standard
In the commonly-used IEEE 754 standard, a floating point number can be stored in 32 bits: 1 bit is the sign bit (0 for positive, 1 for negative), 23 bits are the significand bits, and 8 bits are the exponent bits. Note that the significand, in this case, is not a large number — instead, it is a fraction. That means the significand can be expressed by:
significand = 1 + (value of the 23 significand bits) / (2^23)
And also note that technically, the standard could also be said to have a 24-bit significand precision, but only 23 is explicitly stored.
The 8-bit exponent also doesn’t store the exponent itself. Instead, the exponent is calculated by:
exponent = (value of the 8 exponent bits) – 127
There is also a double-precision format involving a 64-bit floating point number instead of 32 bits. The bit assignments are as follows:
- 1 sign bit
- 52 significand bits (53-bit precision)
- 11 exponent bits
Example
Take 0.05 as an example for the 32-bit floating point representation. Multiply 0.05 by the correct power of 2 (regardless of whether the exponent is positive or negative) such that the product is between 1 and 2. You will get 0.05 = 2^(-5) * 1.6. That means the exponent should be -5, and the significand should be 1.6.
To get the exponent to be -5, the 8 bits in the exponent should be 01111010, which means 122 = 127 – 5 in decimal. And to get the significand to be 1.6, the 23 bits should represent a value of 0.6.
However, if you convert 0.6 into a fraction (3/5), you will realize that the denominator is not a power of 2. What does that mean? This means that we cannot express it perfectly in a finite number of binary digits! In fact, the significand value that gets stored is 5033165 converted into binary, as 5033165 / 8388608 is closest to 0.6 among fractions with denominator 2^23 = 8388608. This results in an error known as a floating-point error, which we will discuss in the next section.
And note that the standard stores the sign bit first, and then the exponent bit, and then finally the significand. So the number 0.05 will be represented as:
Floating Point Errors
It turns out that, when we sum up floating point numbers, we don’t always get the answer we expect! For example, it is obvious that 0.1 + 0.2 = 0.3. However, if we sum 0.1 and 0.2 up in Python (or in fact many other programming languages), you actually get:
0.30000000000000004
Instead of 0.3 exactly. Why does that happen? As we have mentioned above, floating point numbers have finite precision, which is not enough to represent the uncountably infinite number of real numbers. Even some fractions cannot be stored exactly in finite precision.
Let’s say we go back to base 10, and want to sum up 1/7 and 3/7, but we only have 10 decimal places of precision. This means that 1/7 will get represented by:
1/7 = 0.142857142857142857…… ≈ 0.1428571429
And 3/7 will be represented by:
3/7 = 0.428571428571428571…… ≈ 0.4285714286
And so the expected sum will be 4/7, which is
4/7 = 0.571428571428571428…… ≈ 0.5714285714
But instead if we only use those 10 decimal places, we get
4/7 = 0.5714285715
Which is not equal to what we expect.
Let’s go back to 0.1 and 0.2.
When we type in the 0.1 in your code, it gets converted into a 64-bit double-precision format. In that case, the significand bits, converted into decimal, read 2702159776422298, while the exponent reads 123. Likewise, since 0.2 = 0.1 * 2, the significand bits are the same, but the exponent reads 124. Thus, we have:
0.1 + 0.2
≈ (1 + (2702159776422298 / (2 ^ 52))) * (2 ^ (123 - 127)) + (1 + (2702159776422298 / (2 ^ 52))) * (2 ^ (124 - 127))
= (1 + (2702159776422298 / (2 ^ 52))) * (2 ^ (-4)) + (1 + (2702159776422298 / (2 ^ 52))) * (2 ^ (-3))
= 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125
= 0.30000000000000001665334536937734810635447502136230468750
≈ 0.30000000000000004 (approximation based on the significand precision)
That explains why the floating point errors occur when summing up these numbers.
Solutions to Floating Point Errors
Understandably, when we carry out calculations involving floating point values, these errors will accumulate and may cause deviations from the correct answer. For example, when we evaluate the statement 0.1 + 0.2 == 0.3, the statement should be true. Instead:
>>> 0.1 + 0.2 == 0.3
False
Instead, it returns false! How do you deal with it? You can use fractions instead of floating points instead. Rather than converting a number approximately into a pair of integers, it converts it exactly into two integers.
Take 0.3, for example. In the 64-bit example above, 0.3 is converted into the pair of integers 900719925474099 and 125, as
0.3 ≈ (1 + 900719925474099 / 2^52) * 2^(125 – 127).
However, in a fractional representation, 0.3 is converted nicely into the numbers 3 and 10, because
0.3 = 3 / 10
If we change the 0.1 + 0.2 == 0.3 statement such that the numbers are fractions instead of floating points, we get a “true” response.
>>> from fractions import Fraction
>>> Fraction(1, 10) + Fraction(2, 10) == Fraction(3, 10)
True
However, this solution only works for rational numbers. Irrational numbers cannot be represented by fractions, so we can only rely on floating point calculations in that case. The only thing we can do is adjust our precision based on trade-offs between the calculation speed and the precision. The lower the precision, the faster the calculation (since there are fewer bits to process).
Also, processing in fractions tends to be slower than processing floating point numbers, especially when the calculation process involves very large denominators.
>>> from time import time
>>> start = time(); Fraction(2978772, 986476) + Fraction(378237, 2608921); end = time(); end - start
Fraction(2036125636956, 643409488099)
0.00043582916259765625
>>> # the fractional method takes 0.00044 seconds to execute
>>> start = time(); 2978772 / 986476 + 378237 / 2608921; end = time(); end - start
3.1645875210387104
0.00012612342834472656
>>> # the floating point method takes 0.00012 seconds to execute
Conclusion
In this article, we’ve discussed how computers store floating point numbers, that are not whole integers, in binary, using the significand, exponent, and the IEEE 754 standard. We have also discussed the errors in the storage of floating point numbers, and some of the ways to account for these errors. If you have any questions on this issue, feel free to leave them in the comments below.