How Numbers Behave on a Chip With No FPU

Before you read: This reflects my current understanding. I am still learning and may have gotten things wrong. If something looks off, I would genuinely appreciate the correction. This is a practical companion to Unit 1 of The Mathematics ML Runs On, where I first wrote these claims down.

adimd/how-numbers-behave-no-fpu All five phases, build instructions, and the full serial output

→

What's in this page

Why I built this
How to read each probe
Phase 1: building the float, and watching it round
Phase 2: the float starts to break
When my own code made the same mistake
Phase 3: the two edges of the float
Phase 4: underneath the float, the integers
Phase 5: fixed-point, the engineered answer
What the chip taught me
The things that went wrong

Why I built this

Unit 1 of The Mathematics ML Runs On is where I wrote about the numbers themselves: how a real value gets squeezed into 32 bits, why 0.1 cannot be stored exactly, what a machine integer really is, and how fixed-point trades one kind of error for another. Writing it, I kept stating things as fact. The gap between representable floats grows as the numbers get larger. Adding one to a big enough float does nothing. Integers wrap around rather than saturating.

All of that is well documented, and I believe it is correct. But I had mostly described the behaviour rather than watched it, and I wanted to be sure I had it right. So I picked up an ESP32-C3-mini that had been sitting in a drawer and used it to check my own claims, one at a time, against the hardware.

The C3 is a good place for this because it runs a single 32-bit RISC-V core with no floating-point unit. Every floating-point operation is emulated in software, one library routine at a time, while integer and fixed-point math runs natively. A chip with no FPU does not hide the cost of floating-point behind dedicated silicon, so if a claim from the unit has a consequence, it tends to show up plainly here, sometimes as a wrong number and at least once as the board resetting itself.

ChipESP32-C3-mini · RISC-V

Clock speed160 MHz

Floating-point unitNone (software emulated)

ToolchainESP-IDF v5.5.3 · GCC 14.2.0

Optimization-O3 (deliberate)

OutputSerial only, 115200 baud

The whole approach is to treat the chip as a measuring instrument: state a claim, write the smallest program that forces the chip to confirm or deny it, and read the result off the serial port. What follows is that investigation in order, including the places where my own code tripped over the very things I was studying.

How to read each probe

From here on, every test (I call them probes) follows the same four beats, so it is always clear what is being claimed and whether the chip backed it up:

Hypothesis. The specific claim being put to the test.
The probe. The part of the code that forces the chip to reveal the answer.
Observed. What actually came back over the serial port.
Verdict. Confirmed, confirmed with a twist, or diverges, and the reason why.

One detail that matters throughout: I print floats as their raw IEEE-754 bits rather than as decimals. The chip's minimal C library cannot reliably format floats, and more importantly the bits are the ground truth while a decimal is only a lossy rendering of them. When a decimal and the bits seem to disagree, trust the bits. I also compiled the whole project at -O3, the most aggressive optimization level, which is not the usual default. The reason becomes clear in Phase 4, where it is the one choice that makes the final result visible at all.

Phase 1: building the float, and watching it round

Why start here? Because a float is the default container for almost every number in a machine-learning model: every weight, every activation, every input pixel. If you do not know exactly what that container can and cannot hold, you cannot reason about anything built on top of it. So before testing how floats break, I wanted to confirm what a float actually is on this chip, down to the bit.

A 32-bit float is three fields packed into one word: one sign bit, eight exponent bits, and twenty-three fraction bits, plus a "hidden 1" in front of the fraction that is never stored but always assumed.

Probe 1.1 — a float is exactly its bits

Hypothesis

A float's value can be reconstructed entirely from its three stored fields, with nothing else needed. If that is true, the format is exactly what the unit describes.

The probe

Reinterpret the float's memory as a raw integer, split out the three fields by shifting and masking, then rebuild the value as significand × 2^(exponent − 127) and check it matches.

uint32_t b = f32_bits(6.75f);     // reinterpret the 4 bytes as a raw integer
uint32_t sign = b >> 31;          // 1 sign bit
uint32_t exp  = (b >> 23) & 0xFF;   // 8 exponent bits
uint32_t frac =  b & 0x7FFFFF;       // 23 fraction bits
// value = (1 + frac/2^23) * 2^(exp - 127)

Observed

VALUE:    6.75
          raw bits = 0 10000001 10110000000000000000000
          exp_raw = 129 (=> 2) | mantissa = 0x580000
          rebuilt from bits alone = 6.750

Verdict

Confirmed. 6.75 is 1.6875 × 2². The exponent field reads 129, which is 2 after subtracting the bias of 127, and the fraction plus the hidden 1 gives a significand of 1.6875. The bits alone rebuild the number exactly, so the float really is nothing but those three fields read by a fixed rule.

Concept explainer

What is the "hidden 1," and why does the format save it?

A binary number in normalized form always begins with a 1 before the point, like 1.0110…, for the same reason scientific notation always writes a single non-zero digit before the decimal point: you shift the exponent until exactly one leading digit remains. In binary the only non-zero digit is 1, so that leading digit is always a 1.

Because it is always 1, the format does not store it. It assumes it and stores only the fraction that follows, which is why 23 stored bits buy 24 bits of precision. The one exception is when the exponent field is all zeros, the reserved flag for "no hidden 1 here," which covers zero and the subnormal numbers in Phase 3.

Probe 1.2 — most decimals are not on the grid

Hypothesis

A value like 0.1 has no exact binary form, so the chip cannot store it. It should keep the nearest representable value instead and report a small error.

The probe

Ask for 0.1, read back what was actually stored, print the difference, and dump the stored bits so the rounding is visible in the mantissa.

float stored = 0.1f;                  // request 0.1
double err = (double)stored - 0.1;     // how far off the stored value is
print_f32_fields_binary(f32_bits(stored));

Observed

VALUE:    0.1
          requested = 0.100000000
          stored    = 0.100000001   bits = 0 01111011 10011001100110011001101
          error     = 0.000000001

Verdict

Confirmed. One-tenth in binary is 0.0001100110011… repeating forever, and you can see that repeating 1001 pattern in the stored mantissa, cut off and rounded at the last bit (the trailing …101 is the round-up). The chip stored 0.100000001, the nearest value it can represent, and handed that back without comment. I also checked 0.2 and 0.3 (both snapped) and 0.5 (exact, zero error), because one-half is 2⁻¹ and lands precisely on the grid. The reason this matters: almost every decimal constant you type, a 0.1 learning rate, a 0.3 threshold, is stored slightly wrong before a single calculation runs, and every error in the next phase grows from this one.

Concept explainer

What it means for a number to be "on the grid"

Because the fraction has a fixed number of bits, a float can only represent a finite set of values, like marks ruled on a number line. When you ask for a value between two marks, the chip stores the nearest mark instead and silently gives you that. The substitution is the rounding error, and it happens to almost every decimal fraction you type.

The part that catches people out, and the thing the whole project keeps returning to, is that the marks are not evenly spaced. They are packed tightly near zero and spread out as the numbers grow. The picture below is the single most useful thing to hold in your head for everything that follows.

The float grid is not evenly spaced. The marks are dense near zero and spread out as the numbers grow, doubling in gap at every power of two. A value that falls between two marks is stored as the nearest one, which is the rounding "snap." Every strange thing in the next phase comes from this uneven spacing.

Phase 2: the float starts to break

Phase 1 showed that every stored number starts slightly wrong. Phase 2 is where those small errors compound, and it matters because the thing machine learning does most is add up many numbers: a dot product sums hundreds of terms, a loss accumulates across a whole dataset, a weight is nudged by millions of tiny gradient steps. Every one of those is a long running sum, which is exactly the situation where floats misbehave. It is also where one of my probes failed in a way that taught me something.

Probe 2.1 — the gap eventually swallows +1

Hypothesis

Past 2²⁴ (about 16.7 million), the gap between adjacent floats grows larger than 1. If so, adding 1 to such a number should change nothing, because there is no representable value one unit away.

The probe

Take 2²⁴, add 1.0, and check whether the result is actually different from where it started. Compare against 2²³, where the gap is still exactly 1.

float x = 16777216.0f;             // 2^24
float y = x + 1.0f;                 // try to step up by one
printf("changed? %s\n", (y != x) ? "yes" : "NO");

Observed

          gap just above 2^23 (8388608)  = 1.000000
          gap just above 2^24 (16777216) = 2.000000
          2^23 + 1: 8388608  + 1 = 8388609    (changed? yes)
          2^24 + 1: 16777216 + 1 = 16777216   (changed? NO)

Verdict

Confirmed. Just above 2²³ the gap is exactly 1, so +1 still works. One octave higher the gap has doubled to 2, so 16777216 + 1 rounds straight back to 16777216. The increment is smaller than the distance to the next mark, so it has nowhere to land.

Concept explainer

Why the gap doubles at every power of two

The 24-bit significand always carries 24 significant bits, wherever the number sits. Between 2²³ and 2²⁴ those bits are enough to name every integer, so the gap is 1. To represent numbers between 2²⁴ and 2²⁵, the value is twice as large while the significand still has 24 bits, so the smallest step it can express is now 2. The next octave makes it 4, then 8, and so on. Precision is a fixed number of significant figures, which means the absolute spacing grows right along with the size of the numbers. This is the same doubling drawn in the grid figure above.

Probe 2.2 — machine epsilon is the grid step at 1.0

Hypothesis

The smallest value you can add to 1.0 and still change it is machine epsilon, 2⁻²³ for float32. Anything smaller than half that step should round away.

The probe

Find the very next representable float above 1.0 by adding 1 to its bit pattern, subtract to get the step, and compare against the library constant. Then test that adding epsilon moves and adding half-epsilon does not.

uint32_t b = f32_bits(1.0f) + 1;       // next float above 1.0
float nextUp; memcpy(&nextUp, &b, 4);
float eps = nextUp - 1.0f;             // the grid step at 1.0

Observed

          next float above 1.0 - 1.0 = 0.000000119209
          FLT_EPSILON from <float.h>   = 0.000000119209
          1.0 + eps/2 == 1.0 ? yes (vanished)
          1.0 + eps   == 1.0 ? no  (stuck)

Verdict

Confirmed. The measured step matched the published constant to the digit. Adding exactly epsilon moves to the next mark; adding half a step rounds back, because it lands closer to where it started. Epsilon is simply the grid spacing at 1.0.

Probe 2.3 — accumulation freezes, creeps, or works

Hypothesis

A long running sum of a small increment will behave differently depending on how the increment compares to the local grid gap: below half the gap it freezes, just above epsilon it creeps and loses accuracy, and well above the gap it accumulates cleanly.

The probe

Add an increment a million times and compare how far the value actually moved against the exact expected total.

float v = start;
for (int i = 0; i < 1000000; i++) v += inc;  // a million tiny additions
// compare (v - start) against inc * 1000000

Observed

  inc < eps/2 : should add 0.05, actually moved 0.000000   (FROZEN)
  inc > eps   : should add 10.0, actually moved 9.870076   (CREEPS)
  inc >> gap  : should add 1000000, actually moved 1000000 (CLEAN)

Verdict

Confirmed. Same loop, three different fates. When the increment was below half the gap, a million additions moved the value by exactly nothing, because every single add rounded straight back. Just above epsilon it crept but lost real arithmetic to rounding, ending at 9.87 instead of 10.0. The outcome is decided entirely by the increment versus the local grid spacing.

Probe 2.4 — addition is not associative (and the test that fooled me)

Hypothesis

Summing the same numbers in a different order gives a different total, because where rounding happens depends on the running sum.

The probe

Sum a large value plus a long list of small ones, once smallest-to-largest and once largest-to-smallest, and print the difference.

float asc = 0, desc = 0;
for (int i = 0;   i < n;  i++) asc  += vals[i];  // small -> large
for (int i = n-1; i >= 0; i--) desc += vals[i];  // large -> small
printf("discrepancy = %f\n", asc - desc);

First attempt: it failed

My first run used values too close together in magnitude, and both orders produced an identical total. The probe printed a discrepancy of zero and reported that order did not matter, the exact opposite of what I was trying to show.

What went wrong: The effect only appears when the magnitudes are far enough apart that the small values fall below the large value's grid gap. My data was too gentle, so nothing was lost in either order. The chip had quietly handed me a precondition that the usual one-line phrasing of "addition is not associative" leaves out.

Observed (after pushing the large value above 2²⁴)

VALUE:    1e8 plus 1,000,000 x 1.0 (big above 2^24)
          exact (double)        = 101000000.000
          float small-to-large  = 101000000.000   (err 0.000)
          float large-to-small  = 100000000.000   (err -1000000.000)
          discrepancy (asc-desc)= 1000000.000

Verdict

Confirmed, with the precondition spelled out. The same million-and-one numbers differ by exactly one million depending on order. Small-first, the little values pile up into a sum big enough to register against the large one. Large-first, every later +1 falls into the big value's gap and disappears, one at a time, a million times over. The "wrong" answer is the format being honest about what it can hold. This is the reason a sum run on two different machines, or split across two different numbers of threads, can give two different totals from identical inputs, and why exact reproducibility in numerical code is harder than it looks.

Concept explainer

How Kahan summation rescues the lost bits

When a small number is added to a large running total, the bottom bits of the small number fall off the end and are lost. Kahan summation keeps a second variable that tracks exactly those discarded bits, and adds the leftover back in before the next step, so the bits that would have vanished get a second chance to count. It is bookkeeping, not magic. On ten thousand copies of 0.1, the naive sum was off by 0.097 and Kahan brought the error down to 0.000014, about six thousand times closer to the true total.

When my own code made the same mistake

This is the failure that taught me the most, because the tool failed for exactly the reason the project exists.

The C3's minimal C library does not reliably print floats with printf("%f"), so I wrote my own routine: take the integer part, then peel off decimal digits one at a time. It worked for ordinary numbers. Then I asked it to print 10000000.0, and later FLT_MAX (about 3.4 × 10³⁸), and it printed 2147483647, which is exactly the largest value a 32-bit signed integer can hold.

I had stored a float's value in a 32-bit integer to do the digit-peeling, and that integer overflowed the moment the value passed about two billion. The printer broke for precisely the reason this project exists: I had asked a 32-bit integer to hold a magnitude a float could represent but an integer could not. It happened three times, on three different oversized values, before I taught the printer to recognise when a value is too big for it and say so:

OBSERVED: FLT_MAX        = (>2^31, see bits)
          FLT_MAX bits   = 0 11111110 11111111111111111111111

Now it refuses to print a wrapped-around number and points at the raw bits instead, which are always correct. It is the clearest reason the rest of the project trusts bits over decimals.

Phase 3: the two edges of the float

The edges matter because they are where a program does not just compute a slightly wrong answer, it falls over. Anyone who has trained a model has watched the loss suddenly print nan and every number afterward turn to garbage; that failure lives in this phase. Values that explode run into infinity at the top, values that vanish underflow to zero at the bottom, and a single undefined operation produces NaN, which then spreads to everything it touches. Phase 3 walks the float off both edges and meets the value that refuses to behave like a number.

Probe 3.1 — overflow saturates to infinity

Hypothesis

There is a largest finite float. Doubling past it should not wrap or error; it should saturate to a reserved value, +infinity, encoded as an all-ones exponent with a zero mantissa.

The probe

Start near FLT_MAX, double repeatedly, and print the bits at each step to catch the moment the value stops being finite.

float x = FLT_MAX * 0.5f;
for (int s = 0; s < 4; s++) { x *= 2.0f; print_bits(f32_bits(x)); }

Observed

            step 0: x2 -> (>2^31, see bits)   bits = 0 11111110 11111111111111111111111
            step 1: x2 -> +infinity           bits = 0 11111111 00000000000000000000000
          infinity + 1   == infinity ? 1
          infinity - infinity        -> NaN (undefined)

Verdict

Confirmed. The largest finite float has exponent 11111110 and an all-ones mantissa. One doubling later the exponent becomes all ones and the mantissa goes to zero, the reserved code for infinity. Infinity behaves like a limit: infinity + 1 stays infinity, while infinity − infinity produces NaN, which sets up the last probe in this phase.

Probe 3.2 — underflow ramps down through the subnormals

Hypothesis

Below the smallest normal float, the format does not jump straight to zero. It enters the subnormal range, switching off the hidden 1 and losing one bit of precision per step, until it finally flushes to zero.

The probe

Start at the smallest normal float and halve repeatedly, printing the bits so you can watch the single significant bit walk to the right.

float v = FLT_MIN;                 // smallest normal float
while (v != 0.0f) { v /= 2.0f; print_bits(f32_bits(v)); }

Observed

            step  1: bits = 0 00000000 10000000000000000000000  -> subnormal
            step  3: bits = 0 00000000 00100000000000000000000  -> subnormal
            step 11: bits = 0 00000000 00000000001000000000000  -> subnormal
            step 24: bits = 0 00000000 00000000000000000000000  -> ZERO (flushed)

Verdict

Confirmed. Once the exponent field hits zero the hidden 1 is gone, and each halving shifts the lone significant bit one place right, holding fewer meaningful bits each time, until at step 24 the mantissa empties and the value is true zero. It is the mirror image of the overflow climb, at the other end of the scale.

Concept explainer

Why subnormals exist at all

Without them, the smallest normal float would sit some distance from zero, with nothing but zero below it, leaving a sudden cliff next to the origin. A subtraction of two close small numbers could then snap to zero even when the true answer was small but non-zero. Subnormals fill that cliff with a ramp: by switching off the hidden 1, the format keeps producing ever-smaller values, losing precision gradually rather than all at once. Watching the single bit walk right is watching the ramp descended one rung at a time.

The second failure: software floats trip the watchdog

While the underflow probe was running, the board reset itself with a "task watchdog" error. The watchdog expects the operating system's idle task to get a slice of CPU time periodically; if too long passes without that, it assumes the program has hung and recovers the system.

What went wrong: Nothing had hung. The Phase 2 accumulation loops, each a million software-emulated float operations strung back to back, were monopolising the processor so completely that the idle task never ran. Floating-point is slow enough on a chip with no FPU that an ordinary loop can starve the operating system. The cost of the missing hardware showed up not as a wrong number but as a system fault.

The fix was to yield the processor briefly between heavy probes so the idle task, and therefore the watchdog, stays satisfied. The lesson was the more useful takeaway: on this chip, "just add some floats in a loop" is not free, and the absence of an FPU reaches all the way up to the scheduler.

Probe 3.3 — NaN is not equal to itself

Hypothesis

The result of an undefined operation like 0.0/0.0 is NaN, a value that fails its own equality test, is unordered against everything, and contaminates any arithmetic it touches.

The probe

Make a NaN honestly from 0.0/0.0 (using volatile so the optimizer cannot fold it away) and test equality, ordering, and contamination, with infinity as a control.

volatile float zero = 0.0f;
float nan = zero / zero;             // 0/0 is undefined -> NaN
printf("nan == nan ? %d\n", (nan == nan));

Observed

          0.0/0.0 bits = 0 11111111 10000000000000000000000
          nan == nan ? 0     (every other value equals itself)
          nan <  1.0 ? 0
          nan >  1.0 ? 0     (both false: unordered, not just 'big')
          nan * 0.0  -> NaN  (not 0! NaN wins)
          (contrast) inf == inf ? 1

Verdict

Confirmed. NaN is the one value where x == x is false. It is neither greater nor less than 1.0; it sits outside the ordering entirely. And it is contagious: NaN × 0 is NaN, not zero, so once it enters a computation it spreads. The control is the telling part. Infinity shares the same all-ones exponent yet equals itself; the only difference between "infinity" and "not a number" is the mantissa, zero versus non-zero, and the chip shows that directly in the bits.

Phase 4: underneath the float, the integers

Floats are the expensive guest on this hardware; integers are native and fast. That gap is the whole reason quantized models exist: to run a network on a chip like this, you convert its floats into small integers and do the arithmetic in integer units, trading a little accuracy for a large amount of speed and memory. So the integer's own quirks are not a side topic, they are how efficient inference actually computes. Phase 4 drops to the integer world, which has edges of its own, and ends with the result that stuck with me most.

Probe 4.1 — signed integers wrap, they do not saturate

Hypothesis

A signed 8-bit integer holds −128 to 127. Pushing past the top should not saturate like the float did; it should wrap around to the most negative value.

The probe

Compute values just past the maximum and cast them into an int8_t, printing each as both a signed number and a raw byte.

int8_t v = (int8_t)128;        // one past the max of 127
printf("128 as int8 = %d (0x%02X)\n", v, (uint8_t)v);

Observed

          int8: 100 + 50 = -106   (150 doesn't fit in -128..127)
          127 as int8 =  127  (bits 0x7F)
          128 as int8 = -128  (bits 0x80)  <- WRAPPED to negative
          129 as int8 = -127  (bits 0x81)

Verdict

Confirmed. 100 + 50 comes out as −106. The maximum 0x7F and the minimum 0x80 are adjacent bit patterns, so counting one past the top lands you at the bottom. This is the integer mirror of overflow: the float hit a wall and stuck at infinity, the integer loops cleanly back to the most negative value.

Concept explainer

Why two's complement wraps instead of saturating

Signed integers are stored in two's complement, which lets one ordinary adder handle both positive and negative numbers without special cases. The bit patterns are arranged so that counting past the largest positive value rolls over to the most negative one; the arithmetic is really being done modulo 2ⁿ, like a clock face whose top half is labelled negative. So 100 + 50 genuinely is 150 inside the byte, but 150 in eight bits is 0x96, and that pattern means −106 when read as signed. The bits are right; the interpretation is what surprises you.

Probe 4.2 — integer division truncates toward zero

Hypothesis

Integer division rounds toward zero rather than toward negative infinity, the remainder takes the sign of the dividend, and the identity quotient × divisor + remainder == dividend always holds.

The probe

Run all four sign combinations and check the quotient, the remainder, and the reconstructing identity each time.

printf("%d / %d = %d, rem %d\n", -7, 2, -7/2, -7%2);

Observed

          -7 / 2  = -3   (not -4; truncates toward zero)
          -7 % 2  = -1   (remainder follows the dividend's sign)
          identity: -3*2 + -1 = -7   (== dividend? yes)
          -1 / 2  =  0   (not -1; the array-index trap)

Verdict

Confirmed. -7 / 2 is -3, not the -4 you would get from rounding down, and -1 / 2 is 0, the case that quietly breaks hand-rolled indexing and hashing. The reconstructing identity held in all four sign quadrants, so the rules are consistent; they are simply not the rules most people assume.

Probe 4.3 — the chip and the compiler disagree

Hypothesis

Signed integer overflow is undefined behaviour, not a defined wrap. So the same expression can give one answer when the hardware is forced to run it and a different answer when the optimizer is free to assume overflow never happens.

Concept explainer

What is undefined behaviour?

Some operations in C are left without a defined result by the language standard. Signed integer overflow is the classic case. The standard does not promise it wraps; it says the behaviour is undefined, which means the compiler is entitled to assume the situation never arises and optimize on that assumption. This is different from the friendly 8-bit cast above, which was a defined conversion. Overflow of a plain int mid-expression is a promise you made and then broke, and once broken the compiler owes you nothing in particular, including agreement with the hardware.

The probe

Compute the same thing two ways at INT_MAX. One path uses a volatile variable, forcing the compiler to emit a real add. The other puts (x + 1) > x in a function the optimizer can reason about.

volatile int x = INT_MAX;
int wrapped = x + 1;           // forced: the hardware actually adds
// in a separately optimized function:
int assumed = (x + 1) > x;     // optimizer may fold this to 'true'

Observed

          INT_MAX = 2147483647
          runtime  INT_MAX + 1 = -2147483648   (the chip wrapped to INT_MIN)
          compiler (x+1) > x at INT_MAX returns 1   (assumed no overflow)

Verdict

Diverges, and that is the point. Forced to execute, the chip wrapped INT_MAX + 1 to the most negative integer. Asked to reason about (x + 1) > x, the optimizer returned true, because it assumed adding 1 always makes a number larger. The hardware says the addition wrapped negative; the compiler says it grew. Both came from the same toolchain, on the same chip, in one run, and both are legal, because the operation has no defined answer. The practical lesson is blunt: you cannot rely on signed overflow "just wrapping," because the compiler is allowed to assume it never happens and quietly delete the code that handles it.

Why I compiled at -O3: at a gentle optimization level the compiler might just compute the wrap and print the expected answer, and the disagreement would never appear. The aggressive setting gives the optimizer both the room and the reason to act on its assumption and contradict the hardware out loud. The optimization level was the instrument's sensitivity dial, not a performance choice.

Phase 5: fixed-point, the engineered answer

Every phase so far showed a number system misbehaving. Fixed-point is the deliberate response, and it is worth understanding because it is essentially what a quantized model does: keep the speed of integers while recovering the ability to represent fractions, by agreeing on a hidden scale. On a chip with no FPU this is not an academic alternative to floating-point, it is the practical way to do fractional math at all. The question this phase answers is the one the whole project has been circling: given everything wrong with floats, why not always use fixed-point instead?

Concept explainer

How Q16.16 turns an integer into a fraction-holder

In Q16.16 you fix a scale of 2¹⁶ (65536) and agree that the integer you store represents the real value times that scale. To store 3.5 you keep 3.5 × 65536 = 229376; to read it back you divide by 65536. The decimal point is never stored, it lives in your agreement about the scale. Adding two values is just an integer add, since both share the scale; multiplying is an integer multiply followed by a shift right by 16 to remove the doubled scale. No floating-point instruction is involved, so on this chip there is no slow software emulation: it is the fast, native path to fractional math.

Probe 5.1 — fixed-point's 0.1 is worse than the float's

Hypothesis

Storing 0.1 in Q16.16 will also produce an error, since one-tenth is not a clean multiple of 1/65536 either. The interesting question is how that error compares to the float's.

The probe

Scale 0.1 to its nearest Q16.16 integer, read it back, and measure the error.

int32_t q = (int32_t)(0.1 * 65536 + 0.5);   // store 0.1 as a scaled integer
double back = (double)q / 65536.0;             // read it back

Observed

          float  0.1  error = 0.000000001       (from Phase 1)
          Q16.16 0.1  stored = 6554, error = 0.00000610

Verdict

Confirmed, and surprising. Fixed-point's error on 0.1 is about six thousand times larger than the float's. That feels backwards until you remember the grid figure from Phase 1: near zero, the float's marks are extraordinarily fine, far finer than Q16.16's fixed step of 1/65536. The next probe shows the price the float pays for that.

Probe 5.2 — uniform grid versus uneven grid

Hypothesis

The float's grid spacing grows with magnitude, while Q16.16's spacing is constant everywhere. Neither is strictly better; they trade range against uniformity.

The probe

Measure the gap to the next representable value for both formats at several magnitudes.

uint32_t b = f32_bits(m) + 1; float nx; memcpy(&nx, &b, 4);
double float_gap = (double)nx - m;   // grows with m
double q_gap = 1.0 / 65536.0;        // always the same

Observed

magnitude	float gap to next	Q16.16 gap
0.5	0.00000006	0.00001526
1.0	0.00000012	0.00001526
100	0.00000763	0.00001526
100000	0.00781250	out of range
2^24	2.00000000	out of range

Verdict

Confirmed, and neither format wins. The float is finer near zero but grows coarser as the numbers climb, out to a gap of 2.0 near 2²⁴. Q16.16 keeps one constant fine step everywhere, but runs out of range entirely past about ±32768, and one step beyond its maximum it wraps, exactly like the integers in Phase 4. The float buys enormous range by spending precision unevenly; fixed-point buys uniform precision by giving up range. Every misbehaviour from the earlier phases turns out to be one half of this single trade-off.

What the chip taught me

The confirmations were reassuring. 0.1 snaps, the gap grows, addition is not associative, integers wrap, NaN breaks equality. Every claim from the unit held. Seeing them happen, in raw bits rather than as sentences I had written, made me understand them better than writing the unit had.

The failures taught me more, and most of them were not in the unit. The decimal printer overflowed three times because I kept asking a 32-bit integer to hold a float's value, the exact mismatch I was studying. The summation probe proved nothing the first time because my test data was too gentle, which taught me that "addition is not associative" carries a precondition most explanations skip. The board tripped its watchdog because software floating-point is slow enough to starve an operating system. And the signed-overflow probe was the one result with no single ground truth, the chip and the compiler disagreeing about the same expression.

There is also one thread I did not chase on the board but understand more clearly now. The famous 0.1 + 0.2 != 0.3 demonstration is almost always shown in double precision; in single-precision float the rounding lands differently and the classic inequality does not play out the same way. Even the canonical examples are precision-specific, and "floating-point is weird" is too blunt to be useful. Which floating-point, and at what magnitude, is the real question, and it is one I want to answer more carefully the next time I write about it.

That is what doing this on hardware gave me that writing the unit did not. The math in Unit 1 is correct, but correctness on paper does not make you feel that a tenth costs precision differently at a hundred than at a hundred thousand, or that an ordinary loop can reset the board, or that INT_MAX + 1 has two legal answers depending on who you ask. The chip already knew all of that. I just had to build the instrument and ask.

Pulling it together, the five phases are one story told from different angles. A finite number of bits forces a trade between range, precision, and speed, and every quirk in this project is that single trade showing its hand. The float reaches enormous magnitudes and pays with uneven precision and dramatic edges. The integer is exact and fast but small, and wraps without warning. Fixed-point splits the difference, uniform and quick but boxed into a narrow range. These are not separate curiosities; they are the same constraint seen from three sides, and they are exactly the choices that sit under every model: floating-point for training, where range matters most, and integers or fixed-point for inference on hardware like this, where speed and size matter most.

That is also why this was worth building rather than only writing. The whole premise of The Mathematics ML Runs On is that the math is not abstract; it runs on real silicon with real limits. A chip with no FPU makes those limits impossible to look past, and turning each claim into a probe turned a unit I had written into something I had actually seen happen. If you have a board sitting in a drawer, I would recommend the exercise. It is humbling in the best way, and you come away trusting the numbers a little less and understanding them a lot more.

The things that went wrong

Gathered in one place, because they are the part you will not find in a tutorial, and they slowed me down more than the probing did.

Stumble 01

The printer that overflowed three times

My integer-based decimal printer held a float's value in a 32-bit int and overflowed on every oversized input: first 1e7, then FLT_MAX, then the reciprocal of machine epsilon. Three times the tooling failed in the exact manner of the thing it was measuring. The fix was a guard that detects values beyond what the integer can hold and prints (>2^31, see bits) instead of a confident, wrapped lie.

Stumble 02

The summation test that proved the opposite

My first attempt at non-associative addition used values too close in magnitude, so both orders produced the same total and the probe reported that order did not matter. The real lesson was inside the failure: the effect needs the small values to fall below the large value's grid gap, which only happens once the large value climbs above 2²⁴.

Stumble 03

Software floats tripped the task watchdog

Millions of emulated float operations back to back starved the idle task and reset the board. Nothing had hung; the math was simply that slow. Yielding the CPU between heavy probes fixed it, and the episode became a finding of its own: on a chip with no FPU, the cost of floating-point can surface as a full system fault rather than a wrong number.

Stumble 04

A RAM wall and a stricter compiler than I expected

A large static test array overflowed the chip's small RAM and had to be rewritten as an allocation-free probe. Separately, a print format specifier that is harmless on a desktop is a hard compile error on this target, because a 32-bit value is a long here and the build treats the mismatch as fatal. Both are the kind of detail you only meet by building for the hardware rather than reading about it.

View the full source on GitHub All five phases, build instructions, and the complete serial output

→

Environment: ESP-IDF v5.5.3 · riscv32-esp-elf GCC 14.2.0 · Board: ESP32-C3-mini · Target: esp32c3 · Compiled at -O3