Introduction
In scientific computation we use floating point numbers a lot. This article is a guide to picking the right floating point representation for you. In most programming languages there are two built-in precisions to pick from: 32-bit (single-precision) and 64-bit (double-precision). In the C family of languages these are known as float
and double
, and those are the names I will use in this article. There are other precisions: half
, quad
etc. I won’t cover these here, but a lot of the discussion makes sense for half
vs float
or double
vs quad
too. So to be clear: I will only talk about 32-bit and 64-bit IEEE 754 here. Continue reading