Represent each of the following values in the IEEE 754 32bit
Represent each of the following values in the IEEE 754 32-bit format, truncating inexact values. Indicate the percent error of the value stored compared to the exact value. This will require you to convert the 32 bit floating point number back to decimal and compare to the original value.
a) -37 1/4
b) +12 1/5
Solution
IEEE 754 float(4 byte) or double(8 byte) has 3 components (there is also an analogous 96-bit extended-precision format under IEEE-854):-a sign bit telling whether the no. is +ve or -ve an exponent giving its order of magnitude & a mantissa specifying the actual digits of the no.. Using single-precision floats as an example here is the bit layout:-
seeeeeeemmmmmmmmmmmmmmmmmmmmm meaning
31 0 bit #
s = sign bit, e = exponent, m = mantisssa
The value of the no. is the mantissa times 2^p, where p is the exponent. Notice that we are dealing with binary fractions, so that 0.1 (the leftmost mantissa bits) means 1/2 (the place values to the right of the decimal point are 2^-1, 2^-2 etc., just as we have 10^-1, 10^-2 etc. in decimal value).
Notice further that there is a potential problem with storing both an exponent & a mantissa : 2x10^-1 = 0.2x10^0 = 0.02x10^1 & so on. This would corresponding to lots of diff. bit patterns representing the same quantity which would be a huge waste (it would probably also make it slower & harder to implement math operations in hardware). This problem is circumvented by interpreting the mantissa as being to the right of the decimal(.) point with an implied 1 always present to the left of the decimal. I will refer to this as a \"1.m\" representation. \"But wait!\" you cry. \"What if I don\'t want a 1 there?\" Think of it is as follows: imagine writing a real no. in binary. Unless it\'s 0, it\'s gotta have a 1 somewhere. Shift your decimal point to just after the first 1 then do not bother to store that 1 since we know it\'s always implied to be there. Now all you have to do is set the exponent correctly to reproduce the original quantity i.
But what if the no. is 0? The good people at the IEEE st&ards committee solve this by making 0 a special case: if every bit is 0 (the sign bit being irrelevant), then the no. is considered 0.
Take a moment to think about that last sentence. Now it would seem we have no way to represent humble 1.0,which would have to be 1.0x2^0 (an exponent of 0, times the implied 1) The way-out of this is that the interpretation of the exponent bits is not straight forward either . The exponent of a single-precision float is \"shift 127\" encoded, meaning that the actual exponent is eeeeeeee - 127. So , we can get an exponent of 0 by storing 127 (0x7f). simply shifting the range of the exponent is not a panacea;; We yield instead at the low extreme of the spectrum of the representable magnitude which should be 2^-127. Due to shift 127 lowest possible exponent is actually -126 (1 - 127). It seems wise to give up the smallest exponent instead of giving up the ability to represent 1 or 0!
0 is not the only \"special case\" float. There are also representations for positive & negative infinity, & for a not-a-no. (NaN) value, for results that don\'t make sense (for example, non-real no., or the result of an operation like infinity times 0). How do these work? A no. is infinite if every bit of the exponent is the set (we lose another one,yep ) & is NaN if every bit of the exponent is the set + any mantissa bit are set. The sign bit still distinguishes +/-inf & +/-NaN.
To review, here are some sample floating point representations:
0 0x00000000
0.5 0x3f000000
1.0 0x3f800000
3 0x40400000
-inf 0xff800000
+inf 0x7f800000
+NaN 0x7fc00000 or 0x7ff00000
in general: no. = (sign ? -1:1) *2^(exponent)*1.(mantissa bit)
As a programmer, it is important to know certain characteristics of your FP representation. These are listed below, with example values for both single- & double-precision IEEE floating point no.:
Property Value for float Value for double
Largest representable no. 3.402823466e+38 1.7976931348623157e+308
Smallest no. without losing precision 1.175494351e-38 2.2250738585072014e-308
Smallest representable no.(*) 1.401298464e-45 5e-324
Mantissa bits 23 52
Exponent bits 8 11
Epsilon(**) 1.1929093e-7 2.220446049250313e-16
Note that all no. in the text of this article assume single-precision floats; doubles are included above for comparison & reference purposes.
(*)
Just to make life interesting, here we have yet another special case. It turns out that if you set the exponent bits to 0, you can represent no. other than 0 by setting mantissa bits. As long as we have an implied leading 1, the smallest no. we can get is clearly 2^-126, so to get these lower values we make an exception. The \"1.m\" interpretation disappears, & the no.\'s magnitude is determined only by bit positions.if you shift mantissa to the right apparent exponent will change try it. It may help clarify matters to point out that 1.401298464e-45 = 2^(-126-23), in other words the smallest exponent minus the no. of mantissa bits.
However, as I have implied in the above table, when using these extra-small no. you sacrifice precision. When there is no implied 1, all bits to the left of the lowest set bit are leading 0s, which add no information to a no. (as you know, you can write 0s to the left of any no. all day long if you want). Therefore the absolute smallest representable no. (1.401298464e-45, with only the lowest bit of the FP word set) has an appalling mere single bit of precision!
(**)
Epsilon is the smallest x such that 1+p> 1. It is the place value of the least significant bit when the exponent is 0 (i.e. stored as 0x7f).
3. Effective Floating Point Programming
Numeric programme is a huge area. if you need to develop sophisticated numerical algorithms then this article will not be sufficient. Asking how to compute with ideal accuracy & precission is like askking how to write the fastest program or asking how every piece of softwares should be designed the answer depends on the application & might require a book or two to communicate. Here I will just try to cover what I think every programmer should know.
Equality First let\'s tackle that pesky issue of equality: why is it so hard to know when two floats are equal? In one sense it really is not that hard. the == operators will in fact tell you if two floats are exactly equal (i.e., match bit for bit). You will agree, however, that it usually makes little sense to compare bits when some of those bits might be incorrect anyway, & that\'s the situation we have with the limited accuracy of floats. Results have to be rounded to fit in a finite word, & if the CPU &/or softwares does not round as you expected, your equality tests might fail.
Worse still, it often isn\'t the inherent inaccuracy of floats that bites you but the fact the many operation commonly done on floats are themselves inaccurate. For example, the standard C lib trig functions (sin, cos etc) are implemented as polynomial approximations. It might be too much to hope for that every bit of the cosine of pi/2 would be 0.
So question of equality spits another question back at you:- \"What do you mean by equality?\" For most people, equality means close enough. In this spirit programmers usually learn to test equality in defining some small distance as \"close enough\" & seeing if two no. are that close. It goes something like this:
#define EPSILON 1.0e7
#define flt_equals(x, y) (fabs((x)-(y)) < EPSILON)
People usually call this distance EPSILON, even though it is not the epsilon of the FP representation. I\'ll use EPSILON (all caps) to refer to such a constant, & epsilon (lower case) to refer to the actual epsilon of FP no..
This technique sometimes works, so it has caught on & become idiomatic. In reality this method can be very bad & you should be aware of whether it is appropriate for your application or not. The problem is that it does not take the exponents of the two no. into account; it assumes that the exponents are close to 0. How is that? It is because the precision of a float is not determined by magnitude (\"On this CPU, results are always within 1.0e7 of the answer\") but by the no. of correct bits. EPSILON above is a tolerance; it is a statement of how much precision you expect in your results. & precision is measured in significant digits not in magnitude. it makes no sense to talk of \"1.0e7 of precision\". A quick example makes this obvious:- say we have the no. 1.25e20 & 2.25e20. Their difference is 1e20, much less than EPSILON, but clearly we do not mean them to be equal. If however the no. were 1.2500000e20 & 1.2500001e20 then we might intend to call them equal.
The take-home message is that when you\'re defining how close is close enough you need to talk about how many significant digits you want to match. Answering this question might require some experimentation; try out your algorithm & see how close \"equal\" results can get. If you are only dealing with large no., then comparing using an EPSILON as above might be just the thing. If not, read on.
Here is a routine that can tell you if two floats are equal to a certain no. of significant decimal digits:
#include <ieee754.h>
int flt_equals(float x, float y, int sigfigs)
{
union ieee754_float *px, *py;
unsigned int xexp, yexp;
float sig_mag;
if (x == y){
return 1;
}
px = (union ieee754_float*)&x;
py = (union ieee754_float*)&y;
xexp = px->ieee.exponent;
yexp = py->ieee.exponent;
if (xexp != yexp || px->ieee.negative != py->ieee.negative){
return 0;
}
px->ieee.exponent = py->ieee.exponent = IEEE754_FLOAT_BIAS;
sig_mag = pow(10, -(float)sigfigs);
if (fabs(x-y) < sig_mag/2){
return 1 ;
return 0;
}
}
It\'s worthwhile to take a look at ieee754.h if it exists on your system.
How does this function work? First, if the no. are totally equal, then we know the answer right away. Next, we check the exponents & signs; if the no. have different exponents then they cannot be equal (because of the 1.m representation). In the rest of the function, we use the old fabs-less-than idiom, but validating its assumption by actually setting the exponents of the two no. to 0. We compare against 0.5*10^-sigfigs, which provides for rounding. Imagine sigfigs==2—then we would want 0.01 to equal 0.008 & 0.012, but not 0.018. Hence we compare the difference to .005, which is 0.5*10^-2.
In production code, you would want to avoid computing sig_mag every time probably by doing so in another func that sets the desired precision. Also note that I can\'t make any guarantees regarding the correctness or performance of this code.i like any code present it is intended only for instructional purposes.
You might have noticed that this function does not do a great job comparing no. to 0 (since 0\'s representation is a special case, 0 is a special case here as well). Experience has shown that no. which \"should be\" 0 do not tend to have a 0 exponent—they are not those special extra-small no.. Instead they are just some r&om value around epsilon, like 6.12303e-17, which is what I get for the double-precision cosine of pi/2 (this is probably because algorithm designers often aim to get their results within epsilon). In this case it is better to compare the absolute value of a no. to sig_mag/2 directly, without manipulating exponents.
This function is also deficient for comparing those extra-small no.. We would have to h&le their different representation, in which significant digits begin wherever the first set bit is. The function will typically conclude that two such no. are equal (since, effectively, it will just be comparing leading 0s), unless you ask for enough digits to reach all the way out to whatever bits are set.
Overflow
Let\'s move on. Due to that pesky finite-ness of real computers, numerical overflow is one of a programmer\'s most common concerns. If you add one to the largest possible unsigned integer, the no. rolls back to 0. Annoyingly, you can\'t tell that this integer overflowed just by looking at it; it looks the same as any 0. Most CPUs will actually set a flag bit whenever an operation overflows, & checking this bit is one of the few h&-coded assembly language optimizations that are not obsolete.
However, one of the truly nice things about floats is that when they overflow, you are conveniently left with +/-inf. These quantities tend to behave as expected: +inf is greater than any other no., -inf is less than any other no., inf+1 equals inf, & so on. This property makes floats useful for checking overflow in integer math as well. You can do a calculation in floating point, then simply compare the result to something like INT_MAX before casting back to integer.
Casting opens up its own can of worms. You have to be careful, because your float might not have enough precision to preserve an entire integer. A 32-bit integer can represent any 9-digit decimal no., but a 32-bit float only offers about 7 digits of precision. So if you have large integers, making this conversion will clobber them. Thankfully, doubles have enough precision to preserve a whole 32-bit integer (notice, again, the analogy between floating point precision & integer dynamic range). Also, there is some overhead associated with converting between numeric types, going from float to int or between float & double.
Whether you\'re using integers or not, sometimes a result is simply too big & that\'s all there is to it. However, you must try to avoid overflowing results needlessly. Often the final result of a computation is smaller than some of the intermediate values involved; even though your final result is representable, you might overflow during an intermediate step. Avoid this numerical faux pas! The classic example (from \"Numerical Recipes in C\") is computing the magnitude of a complex no.. The naive implementation is:
double magnitude(double re, double im)
{
return sqrt(res*res + imo*imo);
}
Let\'s say both components are 1e200. The magnitude is 1.4142135e200, well within the range of a double. However, squaring 1e200 yields 1e400, which is outside the range—you get infinity, whose square root is still infinity. Here is a much better way to write this function:
double magnitude(double res, double imo)
{
double r;
res = fabs(res);
im = fabs(imo);
if (res > imo) {
r = im/res;
return res*sqrt(1.0+r*r);
}
if (imo == 0.0){
return 0.0;
r = res/imo;
return im0*sqrt(1.0+r*r);
}
}
Finally, we come to issues of \"getting the right answer\". Uncertain equality is only the tip of the iceberg of problems caused by limited accuracy & precision. Having your decimals cut off at some point wreaks a surprising amount of havoc with mathematics. \"Loss of significance\" refers to a class of situations where you end up inadvertently losing precision (discarding information) & potentially ending up with laughably bad results.
All we did was rearrange the formula by bringing an re or im outside the square root. Which one we bring out depends on which one is bigger; if we square im/re when im is larger, we still risk overflow. If im is 1e200 & re is 1, clearly we don\'t want to square im/re, but squaring re/im is ok since it is 1e-400 which is rounded to 0—close enough to get the right answer. Notice the asymmetry: large magnitudes can get you lost at +inf, but small magnitudes end up as 0 (not -inf), which is a good approximation.
Loss of significance
As we have seen, the 1.m representation prevents waste by ensuring that nearly all floats have full precision. Even if only the rightmost bit of the mantissa is set (assuming a garden-variety exponent), all the 0s before it count as significant figures because of that implied 1. However, if we were to subtract two no. that were very close to each other, the implied ones would cancel, along with whatever mantissa digits matched. If the two no. differed only in their last bit, our answer would be accurate to only one bit! Ouch!
A related problem comes up when summing a series of no.. If some terms of your series are around an epsilonth of other terms, their contribution is effectively lost if the bigger terms are added first. For example, if we start with 1.0 (single precision float) & try to add 1e-8, the result will be 1.0 since 1e-8 is less than epsilon. In this case the small term is swallowed completely. In less extreme cases (with terms closer in magnitude), the smaller term will be swallowed partially—you will lose precision.
If you\'re lucky & the small terms of your series don\'t amount to much anyway, then this problem will not bite you. However, often a large no. of small terms can make a significant contribution to a sum. In these cases, if you\'re not careful you will keep losing precision until you are left with a mess. Sometimes people literally sort the terms of a series from smallest to largest before summing if this problem is a major concern.
A rule of thumb
An overwhelming amount of information is available describing numerical gotchas & their fixes—far more than all but the dedicated scientific programmer wants to deal with. To simplify things, the way we often think about loss of precision problems is that a float gradually gets \"corrupted\" as you do more & more operations on it. Take the aforementioned cosine of pi/2, 6.12303e-17. By itself it\'s not so bad, it\'s pretty close to 0. But if our next step was to divide by 1e-17, then we\'re left with about 6, which is a far cry from the 0 we would have expected.
Just like we avoided overflow in the complex magnitude function, there is essentially always a way to rearrange a computation to avoid subtracting very close quantities (I cover myself by saying \"essentially always\", since the math behind this is way beyond the scope of this article). Naturally there is no general method for doing this; my advice would be to just go through & take a hard look at all your subtractions any time you start getting suspicious results. An example of a technique that might work would be changing polynomials to be functions of 1/x instead of x (this can help when computing the quadratic formula, for one).
This makes algorithms with lots of \"feedback\" (taking previous outputs as inputs) suspect. Often you have a choice between modifying some quantity incrementally or explicitly; you could say \"x += inc\" on each iteration of a loop, or you could use \"x = n*inc\" instead. Incremental approaches tend to be faster, & in this simple case there isn\'t likely to be a problem, but for numerical stability \"refreshing\" a value by setting it in terms of stable quantities is preferred. Unfortunately, feedback is a powerful technique that can provide fast solutions to many important problems. All I can say here is that you should avoid it if it is clearly unnecessary; when you need a good algorithm for something like solving nonlinear equations, you\'ll need to look for specialized advice.
Don\'t forget about integers
Lastly, a reminder not to forget the humble integer: its accuracy can be a useful tool. Sometimes a program needs to keep track of a changing fraction of some kind, a scaling factor perhaps. In this situation you know that the no. you are storing is rational, so you can avoid all the problems of floating point math by storing it as an integer numerator & denominator. This is particularly easy for unit fractions; if you need to move around among 1/2, 1/3, 1/4 etc. you should clearly be storing only the denominator & regenerating 1.0/denom whenever you need the fraction as a float.




