The Mathematics ML Runs On · Class 5

When 100 + 50 Is Negative

We leave floating point behind for the world the edge actually runs on: plain integers, and the lightly disguised cousin called fixed-point. It is a clock that wraps without warning, a hidden scale that gives the integer fractions back, and in the end the exact arithmetic that lets a neural network run on a chip with no floating-point unit at all. This one is long, because the goal is to leave nothing hand-waved.

Two things a computer should not be able to do

Here is a temperature accumulator on an 8-bit sensor. Nothing exotic, just a small integer with a number added to it.

int8_t t = 100;
t = t + 50;
printf("%d\n", t);   // -106

One hundred plus fifty is one hundred and fifty. The machine printed −106. Not a big number, not a rounded number, a negative number, from adding two cheerful positives. And a quieter one, from the same integer world:

printf("%d\n", 1 / 2); // 0

Half of one is zero. Both of these are doors into how a chip really keeps a number, and especially how it keeps a negative one and a fractional one. By the end you will see that these same humble integers, dressed with a single idea, are exactly how machine learning runs when there is no floating-point unit to lean on. You already met the destination at the bottom of Class 4: the subnormals turned out to be integers times a frozen scale. This class makes that the whole show.

Section 1 · A whole number lives on a clock

Before negatives, before fractions, the most basic question there is: when a chip holds the number five, what is physically sitting there? A register is a row of switches, each either off or on, 0 or 1. That is the entire vocabulary the hardware has, so "storing a number" can only ever mean choosing a pattern of ons and offs. The whole game is which patterns mean which numbers.

The simplest agreement, called unsigned, is just ordinary binary counting. Each switch is worth a power of two, doubling as you move left, and the number is the sum of the switches that are on. For 8 bits the place values are 128, 64, 32, 16, 8, 4, 2, 1. Click the bits below and watch a pattern turn into a number.

A pattern of switches becomes a number

This is just base-ten place value with the base changed to two, because a switch has two states, not ten. Nothing here is special to computers yet. The next idea is what makes it strange.

So far this feels infinite, like the number line from Class 1. It is not, and here is where the physical reality bites. The register has a fixed number of switches, decided by the hardware and unchangeable, and a fixed number of switches can strike only a fixed number of patterns. Eight switches give exactly 2⁸ = 256 patterns, no more, so an 8-bit unsigned integer is precisely the 256 values 0 through 255. It is Class 1's combination lock: a four-dial lock has exactly ten thousand settings and there is no ten-thousand-and-first, because there is nowhere to put it.

Now ask the question the hardware cannot dodge: what is 255 + 1 in 8 bits? 255 is 11111111, every switch on. Add one and the carry would ripple up to light a ninth switch, but there is no ninth switch. The carry falls off the end and is gone, leaving 00000000, which is 0. This is not an error the chip noticed; the ones-place arithmetic happened correctly, the answer just did not fit. Going down, 0 - 1 wraps the other way to 255.

Once a carry off the top vanishes, the numbers stop behaving like a line and start behaving like a clock. On a 12-hour clock, 11 o'clock plus two hours is 1, not 13. An 8-bit register is a clock with 256 positions, and every addition and subtraction happens modulo 256: do the true arithmetic, then keep only where you land on the clock face. So 200 + 100 is 300, which is once around plus 44, landing on 44. The chip is not computing the wrong sum, it is reporting only your final position. Hold that clock, because it is what makes negatives almost free.

Section 2 · Two's complement: where the negatives live

We want some of those 256 patterns to mean negative numbers, and the hardware has no minus sign. A negative has to be one of the same patterns, read differently. The lazy idea is to steal the top bit as a sign flag, so 00000011 is +3 and 10000011 is −3. It reads beautifully and breaks two ways. First, it gives two zeros, 00000000 and 10000000, a wasted pattern. Second, and worse for hardware, addition stops being one operation: to compute 5 + (-3) you must notice the second is negative, switch to subtraction, compare magnitudes, and pick the sign. Your adder grows a decision tree, and on a chip counting transistors that is a real cost.

So let me ask the question the hardware designer actually wants answered: is there a way to assign the negatives so that subtraction is just addition, on the same adder, with no sign checking? There is, and Section 1's clock is the tool. On a 256-position clock, stepping back by b lands in the same place as stepping forward by 256 - b, the way going back 3 hours equals going forward 9 on a 12-clock. So represent -b by the pattern 256 - b. Then a + (256 - b) on the wrapping clock is a - b, with one adder and no branches. This is two's complement.

And it computes without a real subtraction. Write 256 - b as (255 - b) + 1. Now 255 is all ones, and subtracting b from all ones never borrows, it simply flips each bit of b. So negation is flip every bit and add one:

// negate 5 in 8 bits
   5  = 00000101
  ~5  = 11111010      // flip every bit
~5+1  = 11111011  =  -5   // add one

Here is the whole 4-bit world on its clock, since 16 positions you can see at once. The single ring of patterns carries two readings: the outer number is the plain unsigned value, the inner number is the two's complement signed value. Step it and watch the seam.

The integer clock: one ring, two readings

unsigned

signed (two's complement)

Below 8 the two readings agree. Cross into the top half and the inner number turns negative: the negatives fill the upper half of the clock, so every one of them starts with a 1, and the top bit reads as a sign without anyone wiring it that way.

That split is worth drawing on its own, because it is why the "sign bit" exists at all.

The byte, cut in half

There is a −128 but no +128, because 0 ate a slot on the non-negative side. So negating −128 flips and adds one and lands back on −128: a number that is its own negative, because its positive partner fell off the edge. Taking the absolute value of the most negative integer hands you back a negative number, a real and famous bug.

Section 3 · Running off the edge, silently

Now the opening obstacle resolves exactly. 100 + 50 wants 150, but a signed 8-bit value stops at +127, and the pattern that would be 150 sits in the upper half, so it reads as 150 - 256 = -106. The addition was perfect; the result landed on a part of the clock we read as negative, with no warning. Hold this against Class 4: the float saturated to a loud inf you can test for, while the integer wraps to an ordinary-looking wrong number with no sentinel. The float spent reserved bit patterns to afford that sentinel; the integer spent nothing, so silence is what it has.

Two ways to run out of room

Overflow arrives faster than you expect, and it depends on the width, the signedness, and the operation. Multiplication leaps to the edge where addition only walks. Try it yourself: pick a width and a kind, feed it two numbers, and see the true answer, the stored answer, and whether it wrapped.

Overflow lab

width8-bit16-bit

kindsignedunsigned

opa + ba × b

Try 100 + 50 in signed 8-bit (the opener). Try 200 × 200 in signed 16-bit and watch multiplication blow past the range two modest numbers should never reach. Switch to unsigned and the same patterns read differently.

The sharp nuance: signed overflow is undefined, and your check can vanish

On real two's complement hardware a signed integer physically wraps, exactly as the lab shows. But the C language declares signed overflow undefined behavior, which an optimizing compiler reads as a promise from you that it never happens. The consequence is startling, so here it is as a real program. Both functions ask the same question, "did adding one make the number smaller?", which can only be true if the addition wrapped:

int      signed_check(int x)      { return x + 1 < x; }
unsigned unsigned_check(unsigned x) { return x + 1 < x; }

Compiled three ways and fed the maximum value, the signed one lies:

gcc -O0          signed_check(INT_MAX) = 0     unsigned_check(UINT_MAX) = 1
gcc -O2          signed_check(INT_MAX) = 0     unsigned_check(UINT_MAX) = 1
gcc -O2 -fwrapv  signed_check(INT_MAX) = 1     unsigned_check(UINT_MAX) = 1

At INT_MAX the addition truly wraps on the hardware, so the honest answer is 1, yet the compiler returns 0, and it does so even with optimization fully off. The proof is in what it actually emitted for the signed function at -O2:

signed_check:                 unsigned_check:
    xorl  %eax, %eax              cmpl  $-1, %edi   ; is the input UINT_MAX?
    ret                          sete  %al
                                 ret

The signed function is xorl %eax,%eax; ret, which means "put 0 in the return register and leave." The input is never even examined. The compiler reasoned "x + 1 < x could only be true on overflow, overflow cannot happen, so this is always false," and deleted the check. The unsigned function kept a real comparison, because unsigned wrap is defined behavior it must honor. The flag -fwrapv revokes the promise, defining signed overflow as wrapping, and the moment it does, the check returns. So the practical rule is firm: never detect signed overflow after the operation, because that test can be compiled away. Check before (if (b > 0 && a > INT_MAX - b)), and use unsigned types when you actually want guaranteed wrapping.

And the fraction just disappears

The second integer surprise is gentler. Between consecutive whole numbers the integers hold nothing, so a fractional result has nowhere to land, and division copes by truncating toward zero.

Division truncates toward zero

Both arrows point toward zero, not downward: -7/2 is -3, not -4. The leftover is handed back separately as the remainder, a % b.

Because the remainder is dropped the instant the division runs, order of operations starts to matter, and a few everyday expressions turn into bugs:

// order matters: do multiplies before divides
7 / 2 * 2  =  6     // divide first, lose the .5, then double the loss
7 * 2 / 2  =  7     // multiply first, nothing lost

// the percentage bug
part/total*100  =  0    // 3/7 = 0 first, then 0 * 100
part*100/total  =  42   // 300 / 7 = 42, correct

// rounding a division instead of truncating: add half the divisor
7 / 2        =  3      (7 + 1) / 2  =  4

// the negative-modulo index bug
-1 % 8           =  -1    // negative subscript, crashes or corrupts
((-1 % 8) + 8) % 8  =  7   // safe wrap into 0..7

Every one of these pains comes from there being nothing between the whole numbers. That is the precise hole the rest of the class fills.

Section 4 · Fixed-point: teaching the integer to hold a fraction

We now have two tools, and for a small chip that needs fractions, both disappoint. The integer is fast and exact and predictable, but cannot hold a fraction at all. The float holds fractions and a staggering range, but on a chip with no FPU it is slow (every operation is the software ritual from Class 2) and its grid is uneven, with precision a moving target. We want fractions, at integer speed, on a predictable grid. That is the hole fixed-point fills, with one move.

Agree on a hidden scale, and let the stored integer count multiples of it. The value you mean is stored_integer × scale. The integer is completely real, an ordinary int16 the hardware adds at full speed. The scale is not in the bits anywhere; it lives in your head and your code. The register holds 384, you and your code agree it means 384 × (1/256) = 1.5. You already do this: a shop writes whole cents and keeps the divide-by-100 in its head; a thermostat stores 215 and means 21.5 degrees.

For a binary scale we write the format Qm.n: m integer bits, n fraction bits, scale 2 to the power -n. The "point" is imagined sitting n bits from the right, and decoding is literally "divide the stored integer by 2ⁿ," which for a power of two is a single bit shift, the cheapest thing a chip owns. That is exactly why chips prefer binary fixed-point over the shopkeeper's decimal cents.

Anatomy of a Q8.8 number

The point is not stored anywhere. The bits are a plain 16-bit integer; only you and your code know to divide by 256. Move the point and you change the format.

The two operations you need are trivial. To encode, divide by the scale and round: raw = round(value / scale). To decode, multiply back: value = raw × scale. A batch through Q8.8:

 value     store (round v*256)   read back        error
5    ->     128         ->   0.500000      0          exact
125  ->      32         ->   0.125000      0          exact
 -2.75   ->    -704         ->  -2.750000      0          exact
0    ->   25600         -> 100.000000      0          exact
55   ->    5517         ->  21.550781      0.000781
14159->     804         ->   3.140625      0.000965
1     ->     26         ->   0.101562      0.001562

Encoding rounds, so you land exactly only on clean multiples of the scale; everything else snaps to the nearest grid point with a small error. That error is something you dial by choosing where the point sits. Slide the fraction bits below and store any value to watch the format, the scale, the range, and the error all move together.

Fixed-point explorer: move the point, store a value

fraction bits n

store value

A 16-bit budget. Every bit you give the fraction is one you take from the range. Slide left for a huge range and coarse steps (a map coordinate), slide right for a fine step and a tiny range (an audio sample in -1 to 1). There is no exponent choosing this for you; you set the point by hand.

Because each fraction bit halves the step, it roughly halves the error. The same number, pi, at three precisions:

fraction bits -> 3.125000   error 0.017
fraction bits -> 3.140625   error 0.001
fraction bits -> 3.141602   error 0.00001

Now the defining property, the deep contrast with all of Unit 1. A float's grid breathes: the gap grows with magnitude, giving constant relative precision. Fixed-point is the opposite: a plain ruler with the same absolute gap everywhere.

The two grids, side by side

In Q8.8, consecutive values near 0.3 differ by 1/256, and near 100 they also differ by 1/256, identical. A float's gap at those two places differs by a factor of hundreds. This is exactly the subnormal ruler from Class 4, promoted to the whole number system.

A value is exact only when it is a clean multiple of the scale:

binary Q.8   0.5   -> 128/256 = 0.50000000   exact
binary Q.8   0.25  ->  64/256 = 0.25000000   exact
binary Q.8   0.125 ->  32/256 = 0.12500000   exact
binary Q.8   0.3   ->  77/256 = 0.30078125   not exact   (0.3 repeats in base 2, like a float)
decimal/100  0.3   ->  30/100 = 0.30         exact      (power-of-ten scale)
any scale    1/3   = 0.3333...               never exact (repeats in base 2 and base 10)

So fixed-point makes exact whatever fractions are clean in the base of its scale. A binary scale stumbles on 0.1 and 0.3 exactly as a float does; a decimal scale nails those (which is why money is integer cents) but not 1/3.

Section 5 · Computing with fixed-point

Storing is half the job. The moment you compute, the hidden scale starts to matter.

Add and subtract are free, as long as the scales match, because the scale factors out: a×s + b×s = (a+b)×s. So you add the raw integers, nothing else. But the scale is invisible, so combining two different scales is silently wrong, and you must align them first:

// 1.5 in two formats has two different raw integers
1.5 in Q8.8 = 384        1.5 in Q4.4 = 24

naive: 384 + 24 = 408  ->  read as Q8.8 = 1.59     // nonsense
align: 24 << 4 = 384  (Q4.4 up to Q8.8),  then 384 + 384 = 768 -> 3.0

This is the heart of fixed-point: you are now the exponent. A float carried its scale in its own bits and aligned operands automatically. Fixed-point has no such field, so you track every value's scale and shift things into a common scale by hand. It is dimensional analysis, with no compiler help and no warning when you get it wrong.

Multiply is where the scale stops being free. Multiplying the values multiplies the scales too, so the raw product carries the square of the scale. Track it like units: Q8.8 × Q8.8 = Q16.16, then a shift returns you home.

A fixed-point multiply, tracking the scale

Two cautions ride along with that shift. The intermediate product overflows the obvious register (two 16-bit values need up to 32 bits), so you form it wider first. And the shift truncates toward zero, a one-directional bias like Class 3's, so you add half a unit before shifting to round to nearest. Step a real multiply through and watch all of it.

Q8.8 multiply, step by step

a b round the shift

Try 1.1 times 1.1 with the rounding toggle on and off, and watch the truncated answer sit a notch low while the rounded one lands closer.

Divide is the mirror: dividing cancels the scales, so the fraction truncates away unless you shift the numerator up by n first, again in a wider register.

naive: 3.0 / 2.0  ->  768 / 512 = 1   ->  read as Q8.8 = 0.0039   // fraction gone
pre-shift: (768 << 8) / 512 = 196608 / 512 = 384  ->  1.5         // correct

And a running total is still integer addition, so it overflows like any integer, which means accumulators stay wider than their inputs: 400 additions of raw 100 overflow an int16 to −25536, while an int32 reaches the true 40000. Inputs narrow, accumulators wide: the recurring shape of careful fixed-point code, and in a moment, of every quantized neural network.

Section 6 · The payoff: exact money, no-FPU speed, and quantized ML

Two questions remain, and they justify the whole class. Did fixed-point fix Class 1's original sin? And why does this matter for machine learning?

The sin first. 0.1 + 0.2 in three number systems:

float32       : 0.30000001   // Class 1's wrong answer (0.1 repeats in base 2)
binary Q.8    : 0.30078125   // still wrong, same base-2 reason
decimal cents : 0.30         // exact: integers 10 + 20 = 30, nothing to round

The arithmetic never left the integers, so there was nothing to round. This is why every bank and ledger stores money as integer cents and never as a float.

Now the real payoff. On a chip with no FPU, a float operation is the software ritual from Class 2, many instructions for one multiply, while an integer multiply is a single native instruction. And a small integer is smaller: four int8 values take 4 bytes where four float32 values take 16, a clean fourfold saving. So to run a neural network on integer hardware you quantize: store every weight and activation as a small integer with a chosen scale, which is exactly fixed-point with the scale picked per tensor from the data. Quantize one value to see it land on the int8 grid:

Quantization explorer: a real value onto the int8 grid

value range ±

Symmetric int8 quantization: scale = range / 127, code = round(value / scale), and the value comes back as code times scale. The largest value in range maps to 127; everything else snaps to one of 255 integer levels.

The operation a network spends its life on is the dot product, and here is the whole pipeline, every step of it something you built in this class.

A quantized dot product, the heart of a layer

That loop quantizes (Section 4's scale), multiplies int8 by int8 and accumulates into a wider int32 (Section 5's widen-the-accumulator), and rescales once at the end by the product of the two scales, because multiplying the values multiplied their scales. The rounding we insisted on is why good quantization rounds rather than truncates.

The accumulator width is not optional. Each int8 × int8 product can reach 127 × 127 = 16129, and a real layer sums hundreds of them, so the worst case for 256 terms is over four million. An int16 accumulator, ceiling 32767, would overflow after just two large terms, silently, the Section 3 wrap with no warning. Multiply small, accumulate wide.

Where Unit 1 ends

The hardware shadow here is the most direct of all: fixed-point is not a separate number system the silicon implements, it is just integers plus a scale you carry in your head, so on a chip with no FPU it runs at full native integer speed while floats crawl through software. That single fact is why the entire edge-ML world reaches for quantization. And it closes Unit 1. We began with the smooth real line and the finite list, built the float up from its three fields, learned how it rounds and where it cliffs, then stepped underneath it to the integer and fixed-point world the float is made of and the edge actually runs on. The number system is now understood top to bottom, from the real line down to a single transistor's wrap-around clock.

A fixed-width integer lives on a two's complement clock where subtraction is just addition and overflow silently wraps instead of saturating. Bolt on a hidden scale and you get fixed-point, a uniform grid where add and subtract are plain integer ops and multiply is a multiply-then-shift through a wider register, with no exponent field, so you align the scales and manage the point yourself. That exact machinery, scaled integers multiplied small and accumulated wide, is the quantization that lets a neural network run on integer hardware with no FPU.

See if it stuck

Eight questions, all answerable from this post. Tap an answer and it tells you straight away whether it holds, and why.

A short, self-marking quiz