The question I started with
Espressif makes a chip called the ESP32-S3. It costs about three dollars. It goes into smart home devices, IoT sensors, small robots, and a thousand other things. In their marketing, Espressif calls it an AI-capable chip. They point to a hardware feature called PIE (Processor Instruction Extensions) that can process eight 16-bit multiplications simultaneously in a single clock cycle.
That sounds impressive. But I wanted to know if it actually matters. Does that hardware acceleration translate into a real speedup on real code? And if so, at what point does it stop helping? What are the limits of calling this chip "AI-capable"?
The way to answer this properly is to benchmark the operation that AI relies on most: matrix multiplication. Build it four different ways, from the naive schoolbook version all the way up to using the dedicated hardware, measure each one on the actual chip, and see what the numbers say.
That is what this project is.
What is matrix multiplication and why does AI care about it
A matrix is just a grid of numbers. Matrix multiplication takes two of these grids and produces a third one. If you have matrix A and matrix B, multiplying them gives you C. Each number in C is computed by taking an entire row from A, an entire column from B, multiplying them element-by-element, and summing up all those products. This operation is called a dot product.
Here is the simplest possible version of the code:
for (int i = 0; i < n; i++) // for each row of A
for (int j = 0; j < n; j++) // for each column of B
for (int k = 0; k < n; k++) // dot product
C[i][j] += A[i][k] * B[k][j];
Three nested loops. For an n×n matrix, this does n³ multiplications. Double n, and you do eight times as much work. This is why large matrix multiplications are expensive.
Now, why does AI care? A neural network is, at its core, a series of matrix multiplications. Every layer of a neural network takes an input vector, multiplies it by a weight matrix, and passes the result to the next layer. The "learning" in machine learning is the process of finding the right values for all those weight matrices. When you run a trained model to make a prediction, you are multiplying your input through every layer's weight matrix in sequence.
This means that if a chip can do matrix multiplication fast, it can run neural networks fast. If it struggles with matrix multiplication, it struggles with everything AI-related. Matrix multiplication is the benchmark that matters.
The hardware and why memory is everything
The ESP32-S3 is a small microcontroller chip built around two Xtensa LX7 processor cores running at up to 240 MHz. On the chip itself there is 512 KB of SRAM. This is the chip's internal memory, fast and immediately accessible. There is also 8 MB of external PSRAM connected to the chip via a high-speed bus. This is much larger but meaningfully slower to access.
The most important thing to understand about this chip, and honestly about most chips, is the memory hierarchy. The CPU can compute at 240 million cycles per second. But the data it operates on has to come from somewhere, and where that data lives makes an enormous difference to how fast things actually run.
What is a cache and why does it exist?
Imagine the CPU as a chef working in a kitchen. The CPU registers (where actual computation happens) are like the chef's hands. Whatever the chef is currently working with is right there, instantly. The L1 cache is like the countertop. Small, right in front of the chef, very fast to grab from. Internal SRAM is like a shelf across the room, a little slower to walk to but still in the same kitchen. PSRAM is like a storage room down the hallway. More space, but every trip takes real time.
A cache works by keeping recently or frequently used data on the countertop so you don't have to keep walking to the storage room. When the CPU needs a piece of data, it first checks the cache. If it is there (a cache hit) it gets it instantly. If it is not there (a cache miss) it has to fetch it from slower memory, which takes many more cycles. The art of writing fast code on memory-constrained hardware is largely the art of making sure the data you need is already in cache when you need it.
On the ESP32-S3, the data cache is 32 KB. This was confirmed by inspecting the sdkconfig file. It is configurable between 16 KB and 32 KB and defaults to 32 KB in ESP-IDF v5.x. The cache line size is 32 bytes, meaning that when the CPU fetches a piece of data that is not in cache, it loads 32 consecutive bytes at a time from slower memory.
Here is the memory hierarchy laid out with the speed difference that matters for this benchmark:
That 40-cycle penalty for a PSRAM access is the number that drives everything in this benchmark. The CPU can do a multiplication in 1 cycle. But if the data it needs is in PSRAM, it sits idle for 40 cycles waiting. This means that for memory-bound workloads, the chip is running at roughly 2.5% of its theoretical compute speed. 40 cycles of waiting for every 1 cycle of work.
What is PIE SIMD and what does it actually do?
SIMD stands for Single Instruction, Multiple Data. Normally, one multiplication instruction multiplies two numbers and produces one result. A SIMD instruction multiplies many pairs of numbers simultaneously in a single clock cycle using wider registers in the CPU.
The ESP32-S3's PIE extension includes 128-bit vector registers. Since a 16-bit integer takes 16 bits, you can fit 8 of them into one 128-bit register. The instruction EE.VMULAS.S16.QACC takes two of these packed registers, multiplies all 8 pairs simultaneously, and accumulates the results. This is what "8-wide SIMD" means. One instruction does the work of eight.
The theoretical speedup from this is 8×. But whether you actually get 8× depends entirely on whether the CPU can be fed with fresh data fast enough. A SIMD unit that is waiting for data from PSRAM is no faster than a scalar unit waiting for the same data. The hardware is only as fast as the pipeline feeding it.
The four implementations, explained from scratch
Each implementation isolates exactly one variable. The goal is to see the contribution of each layer individually: the compiler, the memory access pattern, and the dedicated hardware. All matrices were allocated in PSRAM since that is where you would have to put them for any realistically sized model. Internal SRAM is far too small.
The textbook three-loop implementation. No tricks, no restructuring. The compiler is told explicitly to do nothing with -O0. No loop unrolling, no register reuse, no reordering of operations. This is what you get if you copy the algorithm out of a textbook and run it.
The critical problem with this implementation is how it accesses matrix B. In the inner loop, we read B[k][j]. As k increases, we are jumping across rows of B, reading memory locations that are far apart from each other. Each step in the inner loop touches a piece of memory 2×n bytes away from the previous one. For large matrices this means almost every access is a cache miss, and every cache miss is a 40-cycle trip to PSRAM. The CPU is mostly waiting.
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
int32_t acc = 0;
for (int k = 0; k < n; k++) {
acc += (int32_t)A[i*n+k] * (int32_t)B[k*n+j]; // B[k][j]: jumping rows = cache miss
}
C[i*n+j] = (int16_t)acc;
}
}
Identical code, compiled with -O3. This tells GCC to apply every optimization it knows: unroll loops so you do multiple iterations in one pass, schedule instructions so the CPU pipeline stays busy, keep values in registers as long as possible, and eliminate redundant work. The algorithm has not changed. Only the compiler's output has.
This isolates the compiler's contribution cleanly. Any difference between naive and opt is purely the compiler's work. Any remaining gap between opt and the other implementations is the algorithmic work that the compiler cannot do on its own.
The key insight this implementation teaches: the compiler is very good at instruction-level optimization, but it cannot fix a bad memory access pattern. If your data is in PSRAM and you are accessing it in a stride-jumping pattern, the compiler cannot reorganize your algorithm well enough to change that fundamental bottleneck.
This is where things get interesting. The idea is simple: instead of iterating over the full matrices in one pass, break them into small sub-matrices called tiles, and process one tile-triple at a time. Choose a tile size that fits in the L1 cache, and now all the data for the inner computation is already resident in cache. No PSRAM trips during the hot inner loop.
The intuition behind cache-blocking
Think about the naive algorithm. For each output element C[i][j], you touch one entire row of A and one entire column of B. For a 256×256 matrix, a single column of B is 256 elements × 2 bytes = 512 bytes. The full matrix B is 256 × 512 bytes = 128 KB. This does not fit in a 32 KB cache. So as you compute each output element, you are repeatedly fetching the same data from PSRAM over and over.
Tiling reorders the computation. Instead of finishing all of C before moving on, you compute a small block of C at a time, called a tile. The key property is that while computing one tile of C, you only need a tile of A and a tile of B. If those three tiles together fit in the cache, every element gets loaded once from PSRAM, used many times in cache, and that is it. The PSRAM traffic is dramatically reduced.
#define TILE 32 // 3 tiles x 32^2 x 2 bytes = 6 KB total, fits well in 32 KB cache
for (int ii = 0; ii < n; ii += TILE)
for (int jj = 0; jj < n; jj += TILE)
for (int kk = 0; kk < n; kk += TILE)
// inner 3 loops work on one 32x32 block each, all in cache
for (int i = ii; i < ii+TILE; i++)
for (int j = jj; j < jj+TILE; j++) {
int32_t acc = C[i*n+j];
for (int k = kk; k < kk+TILE; k++)
acc += (int32_t)A[i*n+k] * (int32_t)B[k*n+j];
C[i*n+j] = (int16_t)acc;
}
The tile size T=32 is not arbitrary. Three 32×32 tiles of i16 (16-bit integers) use 3 × 32² × 2 bytes = 6 KB. That fits comfortably inside the 32 KB cache, leaving plenty of room for the stack and code.
I also tried T=64, which looked good on paper. 24 KB fits within 32 KB. It was 2.3× slower at n=256. The reason: the 32 KB cache is not exclusively yours. FreeRTOS, the benchmark harness code, the call stack, and program instructions all share the same cache. The actual amount of cache available for your matrix data is considerably less than 32 KB. T=32 fits reliably because it is conservative enough to survive the competition. T=64 was right on the edge and lost. This is exactly why you measure instead of calculate.
This implementation uses Espressif's esp-dsp library, which provides a hand-optimized dot product function dsps_dotprod_s16 that internally uses the EE.VMULAS.S16.QACC PIE instruction. This is the 8-wide SIMD instruction that processes eight 16-bit multiplications simultaneously.
Before the multiplication, matrix B is transposed into a temporary matrix Bt. This is important: the dot product function needs both row vectors to be contiguous in memory. One row of A and one row of Bt (which represents a column of B). By transposing B first, every access in the inner loop is sequential, which is exactly what the SIMD unit needs to stay fed.
Only i16 (16-bit integers) benefit from this path. The PIE unit has no equivalent instruction for 32-bit integers or floats. For those types, this implementation falls back to the same scalar code as opt_O3, which is why the i32 and f32 numbers are identical between the two.
.option directive. That is RISC-V syntax and does not exist in the Xtensa toolchain. GCC's inline assembly constraint system does not recognize Q registers (the PIE vector registers) as named clobbers, because they are TIE extension registers that exist outside GCC's standard model of the machine. After several failed attempts, the decision was made to use Espressif's own esp-dsp library, which wraps all of this correctly and is what Espressif themselves recommend for production use. The file is called matmul_espdsp.c rather than matmul_asm.c to be honest about what it actually does.
Results
All measurements were taken at 240 MHz. Each data point is the median of 10 runs. Timing was done using the CCOUNT register, a hardware cycle counter that increments every clock cycle, cross-validated against esp_timer, which confirmed a calibration of 239.8 cycles per microsecond.
i16 benchmark, raw cycles
| n | naive_O0 | opt_O3 | tiled_O3 | espdsp |
|---|---|---|---|---|
| 8×8 | 25,288 | 5,994 | 4,441 | 6,648 |
| 16×16 | 190,248 | 41,458 | 26,745 | 22,256 |
| 32×32 | 1,481,994 | 312,066 | 187,321 | 98,248 |
| 64×64 | 11,694,619 | 2,427,644 | 1,498,096 | 516,551 |
| 128×128 | 103,655,528 | 29,446,056 | 13,270,140 | 4,541,825 |
| 256×256 | 2,438,528,277 | 2,058,759,211 | 106,912,458 | 148,902,291 |
Speedup over naive_O0 (i16)
| n | opt_O3 | tiled_O3 | espdsp |
|---|---|---|---|
| 8×8 | 4.2× | 5.7× | 3.8× |
| 16×16 | 4.6× | 7.1× | 8.5× |
| 32×32 | 4.7× | 7.9× | 15.1× |
| 64×64 | 4.8× | 7.8× | 22.6× |
| 128×128 | 3.5× | 7.8× | 22.8× |
| 256×256 | 1.2× | 22.8× | 16.4× |
The opt_O3 speedup collapsing from 4.8× at n=64 down to 1.2× at n=256 is the most telling single number in the whole benchmark. The compiler did not get worse. The hardware changed. The data left the cache.
Speedup breakdown at 128×128 i16, the sweet spot
256×256 across all data types
| impl | i16 | i32 | f32 |
|---|---|---|---|
| naive_O0 | 2,438M | 2,715M | 2,439M |
| opt_O3 | 2,058M | 2,072M | 2,060M |
| tiled_O3 | 106M | 442M | 430M |
| espdsp | 148M | 2,072M | 2,072M |
Notice that i32 and i16 are nearly identical for naive and opt. At 256×256 all variants are bottlenecked by PSRAM and the data type does not matter when you are spending most of your time waiting for memory. For tiled, i16 is 4× faster than i32 and f32 because the smaller element size means more elements fit in a tile, giving better cache utilization per byte. For espdsp, i32 and f32 have no SIMD path at all. They run exactly as fast as opt_O3.
What the numbers actually say
The compiler is powerful, until the data leaves the cache
From n=8 up to n=64, -O3 delivers a consistent 4.7–4.8× speedup. This is real and meaningful. Loop unrolling, instruction scheduling, and register allocation all contribute. The compiler is doing genuine work.
Then at n=128 the speedup drops to 3.5x. At n=256 it is 1.2x. The compiler has not gotten worse. What changed is that the three matrices at 256×256 together occupy 192 KB of memory, far more than the 32 KB cache, and more than the 512 KB internal SRAM. Every access to B in the inner loop has to fetch from PSRAM, taking about 40 cycles. The CPU is stalled waiting for data roughly 97% of the time.
The lesson: compiler optimization operates on instructions. When the bottleneck is memory latency, no amount of instruction-level cleverness can compensate. The compiler cannot rewrite your algorithm to have a better memory access pattern. That is the programmer's job.
Memory layout mattered more than hardware acceleration
This was the most surprising result. Tiling delivers 22.8× over naive and 19× over -O3 alone at 256×256. The dedicated SIMD hardware maxes out at 22.8× over naive at its best. They end up at the same headline number, but tiling gets there by addressing the memory bottleneck, while SIMD gets there only when the memory bottleneck is already absent.
Put differently: if you could only apply one optimization to a large matrix multiplication on this chip, the right answer is cache-blocking, not SIMD. No new hardware required. Just reorganize the computation to respect the memory hierarchy.
This is a specific instance of a general principle called the memory wall. The gap between CPU compute speed and memory bandwidth has been widening for decades, and on memory-constrained embedded hardware like this, memory access patterns dominate performance more than anything else.
There is a cliff at 256×256 that looks nothing like normal scaling
Matrix multiplication is an O(n³) algorithm. Double the matrix dimension and the work goes up by 8×. This is exactly what we see for every size transition up to 128×128. The naive implementation scales at 7.5–8.9×, right on theoretical prediction. At the 128 to 256 transition, it scales 23.5×. Nearly three times worse than theory.
| Transition | Actual scaling (naive i16) | Theoretical (O(n³)) |
|---|---|---|
| 8 → 16 | 7.5× | 8× |
| 16 → 32 | 7.8× | 8× |
| 32 → 64 | 7.9× | 8× |
| 64 → 128 | 8.9× | 8× |
| 128 → 256 | 23.5× | 8× |
The 128×128 matrices (three of them) fit just inside the 512 KB SRAM. The 256×256 matrices do not fit. They spill entirely into PSRAM. This is the memory cliff. It is visible in every implementation except tiled, because tiled was designed specifically to avoid it. The cliff does not represent the algorithm getting worse. It represents the hardware's memory hierarchy becoming the dominant factor.
PIE SIMD gives 2.9x over tiling, but only when data is in cache
At n=64 and n=128, where the tiled working set fits in cache and the SIMD unit can be fed continuously, espdsp is 2.9× faster than tiled. This is a real and meaningful hardware advantage. At 128×128 this is 4.5M cycles vs 13.3M, completing in 18.9 ms at 240 MHz. Fast enough for keyword detection, gesture classification, and small sensor-based inference in real time.
The theoretical maximum from 8-wide SIMD is 8×. The actual speedup is 2.9×, which is 37% efficiency. The gap comes from several real costs: B must be transposed before the computation (adding overhead that scales with n²), the dot product function is called once per output element (n² calls total), each call has function call overhead, and the loop boundaries around the SIMD instructions add extra work per tile. For a more efficient implementation you would want a kernel that batches many output elements together. But even at 37% efficiency, the hardware is clearly doing real work.
SIMD loses to tiling at 256×256, and this is worth understanding
At n=256, espdsp (148M cycles) is noticeably slower than tiled (106M cycles). This seems counterintuitive. Shouldn't hardware SIMD always help? But it makes complete sense once you think about what is happening.
At n=256, the matrices are in PSRAM. The espdsp implementation calls the dot product function 65,536 times (once per output element of a 256×256 matrix). Each call incurs overhead: function prologue and epilogue, argument passing, the transpose access pattern. And crucially, the SIMD unit is still stalling on PSRAM fetches, just like the scalar code. The vector unit can multiply 8 numbers at once, but it cannot hide the 40-cycle PSRAM latency any better than a scalar multiply can.
Tiling, by contrast, reorganizes the computation so that PSRAM is accessed as infrequently as possible. It wins because it addressed the root cause. SIMD accelerated the compute but left the memory bottleneck intact.
A SIMD engine is only as fast as the data pipeline feeding it.
What this means for the AI claim
Espressif says the ESP32-S3 is AI-capable. Based on these measurements, that claim is accurate, but the accuracy comes with real limits that are worth understanding clearly.
For matrix sizes up to 128×128 with i16 data, the chip delivers 22.8× over a naive baseline and completes a full multiplication in under 20 ms. That is fast enough for:
- Keyword spotting (small audio classification models)
- Gesture detection (IMU-based, lightweight)
- Anomaly detection on sensor streams
- Small image classifiers on low-resolution inputs
Beyond 128×128, the chip hits a fundamental memory wall. PSRAM bandwidth cannot feed the CPU fast enough regardless of how the code is structured. At 256×256 the best implementation (tiled) is still 22.8x faster than naive, but the absolute numbers are much worse, and real models with larger layers will compound this.
The ESP32-S3 is not a competitor to dedicated inference accelerators, to the Raspberry Pi, or to anything designed to run modern computer vision or language models. It was never meant to be. It was designed for small, always-on inference at milliwatt power levels and three-dollar price points. At that job, the hardware earns its label honestly.
The deeper lesson from this benchmark has nothing to do with the ESP32-S3 specifically. It is a general principle that shows up everywhere in systems programming: on hardware with a significant gap between compute speed and memory bandwidth, the bottleneck is almost never the CPU. Optimizing memory access patterns gave a larger speedup than hardware SIMD acceleration. Writing code that respects the memory hierarchy, thinking carefully about where data lives and how it moves, is the most important performance skill on memory-constrained hardware.
The things that went wrong
Two things slowed this project down more than the actual benchmarking work. Both are worth documenting because they are the kind of thing you do not find in tutorials.
GCC 14.2.0 internal compiler error
Early in the project, the build would crash with an internal compiler error deep inside GCC's IRA (Integrated Register Allocator) pass, triggered by esp_lcd_panel_rgb.c. This only happened with a specific combination of flags: -mdisable-hardware-atomics (required for octal PSRAM), combined with -Og (debug optimization), combined with having the esp_lcd component included in the build.
The fix was to exclude esp_lcd entirely from the build in CMakeLists.txt, since the project does not use it. This is a known bug in that version of the Xtensa GCC toolchain. It is not a project configuration error. The compiler itself is crashing. The lesson was to always check whether a build error is coming from your code or from the toolchain, and to not spend hours debugging your own files when the problem is in a component you do not even use.
Xtensa PIE inline assembly turned out to be a dead end
The original plan was to write the SIMD implementation using custom inline assembly, calling the PIE instructions directly. This hit two walls immediately. First, the Xtensa assembler does not understand the .option directive. That directive is RISC-V syntax and simply does not exist in the Xtensa toolchain. Second, GCC's inline assembly constraint system requires you to declare which registers your assembly code reads and writes. The PIE vector registers (Q0–Q7) are TIE extension registers that exist completely outside GCC's standard machine description. There is no valid constraint string for them. GCC simply does not know they exist.
After trying several approaches, standalone assembly files, compiler intrinsics, different versions of the toolchain, the conclusion was clear: using Espressif's own esp-dsp library was the right answer all along. The library is production-tested, actively maintained, and wraps the PIE instructions correctly. The final implementation is named matmul_espdsp.c specifically to be honest about this: it uses the library, not hand-written assembly.