Basics of Floating-Point: Part 1

9 min readApr 13, 2020

I have been trying to understand David’s Goldberg’s seminal article What Every Computer Scientist Should Know About Floating-Point Arithmetic for some time now. As a programmer with no formal background in numerical analysis, floating-point was always (and still is) intimidating to me. And rightly so — floating-point is difficult and arcane, and if a programmer wants to write correct FP programs, he needs to understand how it works. In this article, I have tried to cover some of the basics of floating-point.

First, why floating-point? As we all know, the computer is a finite-state machine. It was assume only so many states as it has limited memory. Real numbers, on the other hand, are unlimited. Between 1.0 and 1.1, you can fit an infinite amount of numbers. Computers, therefore, have to make a compromise by approximating real numbers. A real-number that cannot be exactly represented in a given floating-point format would be rounded off so to be represented by a floating-point number. The set is a countable set, just like the set of all integers. A number as simple as 0.1 cannot be represented exactly on modern computers.

A floating-point format is specified by the following three parameters

A floating-point number would then be represented as

Note that the total precision includes the digit before the dot. For the IEEE 754 Single-Precision, it’s 24. For Double-Precision, it’s 53. The specifics of the format will be covered later.

Let us construct a toy floating-point system with

Here’s a list of all the possible numbers in this system. Not great, we have just 33 floating-point numbers that will approximate all the real numbers.

All Possible Positive Floating-Point Numbers In Our Toy System. Each Number Has A Negative Counterpart

A few quick observations:

All the numbers have been normalised (the first bit is 1). This is because this gives us a unique representation of these numbers. Otherwise the same number can have multiple representations that we can get by shifting the decimal point and changing the exponent.
The spacing b/w two floating-point numbers increases as we go left or right.
Including 0, we have a total of 33 floating-point numbers (16 positive and 16 negative) in this system.
Adjacent floating-point numbers are separated by a distance of 1 ULP. ULP stands for “Units In The Last Place”. It is the position value of the last digit in a number. So, for 1.01 x 10²³, the value of the ULP is 0.01 x 10²³.

Relative error is another way to measure the error that results from an approximation. It’s defined as

If you observed closely, the above formula tells us how many “actuals” fit in the absolute error. So if the above formula returns 1, it means you can fit one “actual” in the absolute error (implying a 100% error).

Rounding

As state before, a real-number that cannot be represented exactly as a floating-point number will be rounded off to one. A number needs to be rounded when:

It lies outside the range of the floating-point system (say, 7.2) in our toy system.
The number lies within the range, but requires more precision to be represented than available.

Here’re some examples of numbers that need to be rounded for the Single-Precision FP system:

The first number is larger than the largest SP FP number. The second is smaller than the smallest, and the last one requires 31 bits of precision. All of these numbers need to be approximated.

The second number is an interesting one, because it is a sub-normal number and can be represented exactly. Subnormal numbers help in gradual underflow, so that any number smaller than the smallest floating-point number does not round off to 0 immediately (for negative numbers, the same rule applies).

Consider the following definition for a real-number x:

Rounding Modes

The IEEE standard defines four rounding modes

You can refer to Theorem 5 in Goldberg’s paper to see for yourself why round-to-even is the preferred method. With round up, calculations can drift upwards. We’ll be using round-to-even in the rest of the article.

Let us approximate real-numbers on the above system by rounding-off to the nearest value (there’s an exception to this rounding mode, which I’ll cover later in this section). Note that in this rounding system, the error can be no larger than 0.5 ULPs. Why? Take 4.2 for instance. In our toy system, we’ll round it off to 4. So the error here is 0.2 ULPs. As soon as we cross 4.5, we’ll approximate it to 5. So the max error is when we’re approximating 4.5, which is 0.5 ULPs. (In binary, it’s like approximating 1.11 x 2² with 1.111 x 2²).

Generally speaking, this happens when

Is approximated by

The error in this case becomes:

Note that it above value has p units of 0s. Generally speaking, this error is:

This is the maximum absolute error that can occur while approximating a real-number using the round-to-nearest rounding system. Note that this value is the same for any number for a given exponent in the FP system. The actual value satisfied the relation:

Dividing the absolute value with the min. and the max. values, we get:

The upper-bound of the relative error is called the machine-epsilon.

Note that if we don’t round to nearest, the error can be can approach 1 ULP, and in this case, the upper bound of the relative error becomes

It is useful to remember that adjacent floating-point numbers differ by 1 ULP in a given exponent range. Take the IEEE 754 Single Precision system. We have 24 bits of precision. So if we want to know what’s the floating-point number next to 1.0000…..00000 x 2⁰, we just have to add 1 ULP = 0.0000….00001 x 2⁰, and we get 1.00000.00001 x 2⁰.

Calculating the next floating-point number

So the above sum is actually the value of the floating-point number next to 1 in IEEE 754 Single-Precision. Let us have a look at the following calculation:

In the above calculation, we try to add 1/2 ULP to 1. It ends up exactly in b/w 1 and the next floating point number (1 + 1/2²³). Therefore, it has to be rounded. Using round-to-even, we round it to the nearest number. But since this is a tie, we’ll round it off to whatever number is even. Therefore, we round it down to 1.

Now you should be able to answer the below question with ease.

Here’s a C program that demonstrates the above:

#include <stdio.h>
#include <math.h>

int main(int argc, char *argv[]) {
    float a = 1;
    float b = powf(2,-23);
    float c = (float)(a + b);
    printf("%.16f\n",c); //will not be rounded
    
    float d = powf(2,-24);
    c = (float)(a + d);
    printf("%.16f\n",c);   //will be rounded to 1
    
    float e = powf(2,-25);
    c = (float)(a + e);
    printf("%.16f\n",c);   //will be rounded to 1
    
    float f = 1.1 * powf(2,-24);
    c = (float)(a + f);
    printf("%.16f\n",c);   //will be rounded to the next floating point number, 1 + 2^-23    return 0;
}

The output is:

1.0000001192092896
1.0000000000000000
1.0000000000000000
1.0000001192092896

As you can see, the result is exactly what we expected. Here’s a pictorial representation.

Now we discuss some subtleties of rounding. If x > Nmax, then ideally, we should always be rounding it to Nmax. The reason is that theoretically speaking, x is always closed to Nmax than it is to positive infinity. However, always rounding to Nmax can give rise to misleading results. Imagine if you added the largest possible floating-point number to itself and got the number back. That would be…..wrong.

-x to be handled similarly

We have already derived the upper bound on the relative error when round-to-even mode is being used. If you re-write that formula, we get:

Note that the relative error is independent on the exponent. It depends just on the base and the precision. Also, as you can clearly see, for round-to-even, the error bounds are half of what they are when we use round-to-up.

I’ll end this section by demonstrating how weird floating-point arithmetic can be. Consider the program below:

#include <stdio.h>
#include <math.h>

int main(int argc, char *argp[]) {
  //demonstrates that floating-point addition is not associative;
  float x = 1;
  float y = powf(2,-25);
  float z = powf(2,-24);

  //(x + y) + z
  float s1 = (float) (x + y);
  float res1 = (float) (s1 + z);
  printf("%.16f\n",res1);

  //x + (y + z)
  float s2 = (float) (y + z);
  float res2 = (float) (s2 + x);
  printf("%.16f\n", res2);

  printf("%d\n", res1 == res2);
}

What do you think the result will be? Addition, as we know is associative. That is, a + (b + c) = (a + b) + c. But if you run this program, you’ll know that this does not hold true for floating-point arithmetic.

Here’s a follow-up question. Is floating-point addition commutative?

Correctly Rounded Operations

The IEEE 754 mandates that operations be correctly rounded. That is,

…and for other operations defined in the standard.

Correctly rounded operations are nothing but operations where you represent both the operands with infinite precision and then round the final result to fit the given floating-point system. Here’s an example:

We always shift right to match the smaller exponent with the larger exponent. Note that as mentioned before, we don’t really care about precision at this point because every number will be written in infinite precision (you might notice the 24th bit in the smaller number after shifting). Adding, both, we get:

Addition With Infinite Precision

We’ll now use the round-to-even rule to round this value off, as mandated in the standard. Note that this number is exactly 1/2 ULP away from 1.100….00001²³ x 2⁰. So, we’ll round to even. Thus, the final result will be 1.100….00001²³ x 2⁰ + 1 ULP = 1.100….00010²³ x 2⁰. This is the final value after rounding off.

Now since computers work in a finite number of bits (sentence taken from the paper), using infinite precision is not possible practically. Can we simply drop whatever we have after the 23rd bit blindly? No, as it can lead to some very inaccurate results. Floating-point hardware designers realised this and the problem was solved by preserving some limited number of bits beyond the permitted precision. I’ll be covering guard bits in the next part of the article.

References:

What Every Computer Scientist Should Know About Floating-Point Arithmetic (https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

2. Michael L. Overton, Numerical Computing With IEEE Floating Point Arithmetic, 2001.

Basics of Floating-Point: Part 1

Rounding

Rounding Modes

Correctly Rounded Operations

Written by Prashant Pandey

Responses (1)