TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Examples of floating point problems

255 pointsby grapplerover 2 years ago

29 comments

WalterBrightover 2 years ago
&gt; NaN&#x2F;infinity values can propagate and cause chaos<p>NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they&#x27;d react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!<p>NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.<p>&quot;This sounds terrible&quot; you might think.<p>But let&#x27;s study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the &quot;not found&quot; value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.<p>What&#x27;s the result of sqrt(-1.0)? It&#x27;s not a number, so it&#x27;s a NaN. If a NaN appears in your results, you know you&#x27;ve got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is <i>better</i> than having a bad result go unrecognized.<p>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.<p>This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.<p>NaN is your friend!
评论 #34376322 未加载
评论 #34376181 未加载
评论 #34374631 未加载
评论 #34374124 未加载
评论 #34374785 未加载
评论 #34376605 未加载
评论 #34374118 未加载
评论 #34377568 未加载
评论 #34376201 未加载
评论 #34377159 未加载
svatover 2 years ago
If you have only a couple of minutes to develop a mental model of floating-point numbers (and you have none currently), the most valuable thing IMO would be to spend them staring at a diagram like this one: <a href="https:&#x2F;&#x2F;upload.wikimedia.org&#x2F;wikipedia&#x2F;commons&#x2F;b&#x2F;b6&#x2F;FloatingPointPrecisionAugmented.png" rel="nofollow">https:&#x2F;&#x2F;upload.wikimedia.org&#x2F;wikipedia&#x2F;commons&#x2F;b&#x2F;b6&#x2F;Floating...</a> (uploaded to Wikipedia by user Joeleoj123 in 2020, made using Microsoft Paint) — it already covers the main things you need to know about floating-point, namely there are only finitely many discrete representable values (the green lines), and the gaps between them are narrower near 0 and wider further away.<p>With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of &quot;superstition&quot; like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error &#x2F; is &quot;random&quot;, etc. You won&#x27;t find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it&#x27;s better than &quot;superstition&quot;.) The second-most valuable thing, if you have 5–10 minutes, may be to go to <a href="https:&#x2F;&#x2F;float.exposed&#x2F;" rel="nofollow">https:&#x2F;&#x2F;float.exposed&#x2F;</a> and play with it for a while.<p>Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.<p>The page layout example (&quot;example 7&quot;) illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn&#x27;t matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)<p>I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic &#x2F; interval arithmetic are actually more feasible these days, and it&#x27;s no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.
评论 #34372295 未加载
guyomesover 2 years ago
Example 4 mentions that the result might be different with the same code. Here is an example that is particularly counter-intuitive.<p>Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.<p>However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:<p>p = uv + wu<p>Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.<p>With FMA, the compiler might optimize this code to:<p>p = FMA(u, v, wu)<p>That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!<p>So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.
评论 #34372138 未加载
评论 #34372226 未加载
评论 #34373341 未加载
评论 #34372124 未加载
kilotarasover 2 years ago
Story time.<p>Back in university I was taking part in programming competition. I don&#x27;t remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n &lt; 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1&#x2F;n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.
评论 #34373260 未加载
klochover 2 years ago
&gt; if you add very big values to very small values, you can get inaccurate results (the small numbers get lost!)<p>There is a simple workaround for this:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Kahan_summation" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Kahan_summation</a><p>It&#x27;s usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.
评论 #34370854 未加载
评论 #34370522 未加载
dunhamover 2 years ago
I had one issue where pdftotext would produce different output on different machines (Linux vs Mac). It broke some of our tests.<p>I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.<p>It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).<p>There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.
评论 #34372259 未加载
评论 #34377271 未加载
jordighover 2 years ago
One thing that pains me about this kind of zoo of problems is that people often have the takeaway, &quot;floating point is full of unknowable, random errors, never use floating point, you will never understand it.&quot;<p>Floating point is amazingly useful! There&#x27;s a reason why it&#x27;s implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:<p>1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.<p>2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you&#x27;d better have a plan ready.<p>3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).<p>But sure, don&#x27;t use floats for ID numbers, that&#x27;s always a problem. In fact, don&#x27;t use bigints either, nor any other arithmetic type for something you won&#x27;t be doing arithmetic on.
评论 #34371379 未加载
评论 #34370745 未加载
评论 #34372050 未加载
评论 #34371840 未加载
评论 #34371689 未加载
评论 #34371690 未加载
ogogmadover 2 years ago
Related: In numerical analysis, I found the distinction between forwards and backwards numerical error to be an interesting concept. The forwards error initially seems like the only right kind, but is often impossible to keep small in numerical linear algebra. In particular, Singular Value Decomposition cannot be computed with small forwards error. But the SVD can be computed with small backwards error.<p>Also: The JSON example is nasty. Should IDs then always be strings?
评论 #34371778 未加载
评论 #34371999 未加载
owisdover 2 years ago
My &#x27;favourite&#x27; is that the quadratic formula -b±sqrt(b²-4ac)&#x2F;2a falls apart when you solve for the positive solution using floating point for cases where ε=b²&#x2F;4ac is small, the workaround being to use the binomial expansion -b&#x2F;2a*(0.5ε-0.125ε²+O(ε³))
评论 #34375370 未加载
BeetleBover 2 years ago
&gt; example 4: different languages sometimes do the same floating point calculation differently<p>It&#x27;s worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.<p>I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you&#x27;re computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.<p>OS&#x27;s make a &quot;default&quot; choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.<p>There may be flags you can set to force one configuration regardless of the OS, but you shouldn&#x27;t assume your libraries do that.<p>&gt; In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:<p>&gt; math operations in libc (like sin&#x2F;log) behave differently in different implementations. So code using glibc could give you different results than code using musl<p>IEEE 754 doesn&#x27;t mandate a certain level of accuracy for transcendental functions like sin&#x2F;log. You shouldn&#x27;t expect different libraries to give you the same value. If you&#x27;re doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered &quot;very inaccurate&quot; by FP standards).
asicspover 2 years ago
See also &quot;Floating Point visually explained&quot;: <a href="https:&#x2F;&#x2F;fabiensanglard.net&#x2F;floating_point_visually_explained&#x2F;" rel="nofollow">https:&#x2F;&#x2F;fabiensanglard.net&#x2F;floating_point_visually_explained...</a>
mochomochaover 2 years ago
Regarding denormal&#x2F;subnormal numbers mentioned as &quot;weird&quot;: the main issue with them is that their hardware implementation is awfully slow, to the point of being unusable for most computation cases with even moderate FLOPs
mikehollingerover 2 years ago
Love it. I actually use Excel which even power users take for granted to highlight that people <i>really</i> need to understand the underlying system, or the system needs to have guard rails to prevent people from stubbing their toes. Microsoft even had to write a page explaining what might happen [1] with floating point wierdness.<p>[1] <a href="https:&#x2F;&#x2F;docs.microsoft.com&#x2F;en-us&#x2F;office&#x2F;troubleshoot&#x2F;excel&#x2F;floating-point-arithmetic-inaccurate-result" rel="nofollow">https:&#x2F;&#x2F;docs.microsoft.com&#x2F;en-us&#x2F;office&#x2F;troubleshoot&#x2F;excel&#x2F;f...</a>
BeetleBover 2 years ago
&gt; addition isn’t associative (x + (y + z)) is different from (x + y) + z))<p>A thousand thanks for not saying &quot;addition is not commutative&quot;.<p>(Addition <i>is</i> commutative in floating point. It merely is not associative).
dahfizzover 2 years ago
&gt; Javascript only has floating point numbers – it doesn’t have an integer type.<p>Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?
评论 #34371826 未加载
评论 #34371948 未加载
评论 #34372465 未加载
评论 #34374094 未加载
evancox100over 2 years ago
Example 7 really got me, can anyone explain that? I’m not sure how “modulo” operation would be implemented in hardware, if it is a native instruction or not, but one would hope it would give a result consistent with the matching divide operation.<p>Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and&#x2F;or compiler? Is this a library&#x2F;language bug rather than a Floating Point gotcha?
评论 #34370701 未加载
评论 #34371142 未加载
svnpennover 2 years ago
Here is Go version. works exactly as expected, no surprises. People just need to grow up and use a modern language, not a 50 year old out of date language:<p><pre><code> package main import &quot;fmt&quot; func main() { var i, iterations, meters float64 for iterations = 100_000_000; i &lt; iterations; i++ { meters += 0.01 } &#x2F;&#x2F; Expected: 1000.000000 km fmt.Printf(&quot;Expected: %f km\n&quot;, 0.01 * iterations &#x2F; 1000) &#x2F;&#x2F; Got: 1000.000001 km fmt.Printf(&quot;Got: %f km \n&quot;, meters &#x2F; 1000) }</code></pre>
Lind5over 2 years ago
AI already has led to a rethinking of computer architectures, in which the conventional von Neumann structure is replaced by near-compute and at-memory floorplans. But novel layouts aren’t enough to achieve the power reductions and speed increases required for deep learning networks. The industry also is updating the standards for floating-point (FP) arithmetic. <a href="https:&#x2F;&#x2F;semiengineering.com&#x2F;will-floating-point-8-solve-ai-ml-overhead&#x2F;" rel="nofollow">https:&#x2F;&#x2F;semiengineering.com&#x2F;will-floating-point-8-solve-ai-m...</a>
cratermoonover 2 years ago
Muller&#x27;s Recurrence is my favorite example of floating point weirdness. See <a href="https:&#x2F;&#x2F;scipython.com&#x2F;blog&#x2F;mullers-recurrence&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scipython.com&#x2F;blog&#x2F;mullers-recurrence&#x2F;</a> and <a href="https:&#x2F;&#x2F;latkin.org&#x2F;blog&#x2F;2014&#x2F;11&#x2F;22&#x2F;mullers-recurrence-roundoff-gone-wrong&#x2F;" rel="nofollow">https:&#x2F;&#x2F;latkin.org&#x2F;blog&#x2F;2014&#x2F;11&#x2F;22&#x2F;mullers-recurrence-roundo...</a>
marshallwardover 2 years ago
We work hard to retain floating point reproducibility in climate models. I have a presentation on this, if anyone is interested.<p><a href="https:&#x2F;&#x2F;www.marshallward.org&#x2F;fortrancon2021&#x2F;#&#x2F;title-slide" rel="nofollow">https:&#x2F;&#x2F;www.marshallward.org&#x2F;fortrancon2021&#x2F;#&#x2F;title-slide</a><p><a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=wQQuFXm6ZqU&amp;t=27s">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=wQQuFXm6ZqU&amp;t=27s</a>
dkarlover 2 years ago
I&#x27;m not on Mastodon, so I&#x27;ll share here: I inherited some numerical software that was used primarily to prototype new algorithms and check errors for a hardware product that solved the same problem. It was known that different versions of the software produced slightly different answers, for seemingly no reason. The hardware engineer who handed it off to me didn&#x27;t seem to be bothered by it. He wasn&#x27;t using version control, so I couldn&#x27;t dig into it immediately, but I couldn&#x27;t stop thinking about it.<p>Soon enough I had two consecutive releases in hand, which produced different results, and which had <i>identical numerical code</i>. The only code I had changed that ran during the numerical calculations was code that ran <i>between</i> iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.<p>How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don&#x27;t remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn&#x27;t specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.<p>The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.<p>Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.<p>Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that&#x27;s how code interleaved with numerical calculations can affected the results even if it semantically doesn&#x27;t change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.<p>It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there&#x27;s a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.
评论 #34371066 未加载
fuzzfactorover 2 years ago
On the hardware, it&#x27;s fundamentally integer arithmetic under the hood.<p>Floating point itself is an implementation on top of that.<p>With a slide rule it&#x27;s basically truncated integers all the way, you&#x27;re very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You&#x27;re constantly aware of any step where your significant figures drop below 3 because that&#x27;s such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.<p>The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.<p>On a proper computer, integer addition, subtraction, and multiplication are exact. It&#x27;s the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.<p>Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset&#x2F;scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.<p>The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don&#x27;t take full advantage of it.<p>What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while &quot;periodically&quot; dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.<p>If there&#x27;s nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.<p>Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.
Waterluvianover 2 years ago
Regarding example 2.1:<p>I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”<p>That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.
tails4eover 2 years ago
Got bitten by the big number being added to small number when converting code from double precision to single precision, hard to track down when subtle errors get introduced due to this.
toolsliveover 2 years ago
in computer graphics, some people try to accumulate the transformations in 1 matrix. So A_{n+1} = T_{n} * A_{n} where T_{n} is a small transformation like a rotation around an axis<p>They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that&#x27;s no longer orthogonal and will skew the image.<p>Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.
评论 #34375350 未加载
ape4over 2 years ago
All numbers in JavaScript are floats, unless you make an array with Int8Array(). <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;JavaScript&#x2F;Reference&#x2F;Global_Objects&#x2F;Int8Array" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;JavaScript&#x2F;Refe...</a><p>I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.
lifefeedover 2 years ago
My favorite floating point weirdness is that 0.1 can&#x27;t be exactly represented in floating point.
评论 #34370393 未加载
weakfortressover 2 years ago
Used to run into these problems all the time when I was doing work in numerical analysis.<p>The PATRIOT missile error (it wasn&#x27;t a <i>disaster</i>) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it&#x27;s target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, <i>truncated</i> into a 24 bit register. Moreover, they weren&#x27;t &quot;slightly off&quot;. The clock drift alone put the missile considerably off target.<p>While I don&#x27;t claim that floating points didn&#x27;t have a hand in this error it&#x27;s likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. &quot;Good enough for government work&quot; struck again here.
评论 #34369768 未加载
Aardwolfover 2 years ago
&gt; but I wanted to mention it because:<p>&gt; 1. it has a funny name<p>Reasoning accepted!
评论 #34371735 未加载