It's a shame the article chose to compare solely against AMD CPUs, because AMD and Intel have very different L3 architectures. AMD CPUs have their cores oranised into groups, called a CCX, each of which have their own small L3 cache. For example the Turin-based 9755 has 16 CCXs each with 32MB of L3 cache. Far less cache per core than the mainframe CPU being described. In contrast to this, Intel uses an approach that's a little closer to the Telum II CPU being described - a Granite Rapids AP chip such as 6960P has 432 MB of L3 cache shared between 72 physical cores, each with its own 2MB L2 cache. This is still considerably less cache, but it's not quite as stark a difference as the picture painted by the article.<p>This doesn't really detract from the overall point - stacking a huge per-core L2 cache and using cross-chip reads to emulate L3 with clever saturation metrics and management is very different to what any x86 CPU I'm aware of has ever done, and I wouldn't be surprised if it works extremely well in practice. It's just that it'd have made a stronger article IMO if it had instead compared dedicated L2 + shared L2 (IBM) against dedicated L2 + shared L3 (intel), instead of dedicated L2 + sharded L3 (amd).
Most impressive.<p>I would enjoy an ELI5 for the market differences between commodity chips and these mainframe grade CPUs. Stuff like design, process, and supply chain, anything of interest to a general (nerd) audience.<p>IBM sells 100s of Z mainframes per year, right? Each can have a bunch of CPUs, right? So Samsung is producing 1,000s of Telums per year? That seems incredible.<p>Given such low volumes, that's a lot more verification and validation, right?<p>Foundaries have to keep running to be viable, right? So does Samsung bang out all the Telums for a year in one burst, then switch to something else? Or do they keep producing a steady trickle?<p>Not that this info would change my daily work or life in any way. I'm just curious.<p>TIA.
The mainframe in 2025 is absolutely at the edge of technology. For some algorithms in ML where massive GPU parallelism is not a benefit, could even
get a strong comeback.<p>I got so jealous of some colleagues, once even considered getting into mainframe work. CPU at 5.5 Ghz continuously, (not peak...) massive caches, really, really non stop...<p>Look at this tech porn: "IBM z17 Technical Introduction" - <a href="https://www.redbooks.ibm.com/redbooks/pdfs/sg248580.pdf" rel="nofollow">https://www.redbooks.ibm.com/redbooks/pdfs/sg248580.pdf</a>
> Telum II and prior IBM mainframe chips handle server tasks like financial transactions, but curiously seem to prioritize single threaded performance.<p>IBM was doing SMT (aka Hyperthreading) for decades, long before x86 did. I can't get a number for Telum II, but the original Telum implemented 16-way SMT per core[0], so your 8 core Telum can do 128 threads. I expect similar from Telum II.<p>[0] <a href="https://www.ibm.com/linuxone/telum" rel="nofollow">https://www.ibm.com/linuxone/telum</a>
It must be fun being a hardware engineer for IBM mainframes: Cost constraints for your designs can be mostly be left aside, as there's no competition, and your still existing customers have been domesticated to pay you top dollar every upgrade cycle, and frankly, they don't care.<p>Cycle times are long enough so you can thoroughly refine your design.<p>Marketing pressures are probably extremely well thought, as anyone working on Mainframe marketing is probably either an ex-engineer or almost an engineer by osmosis.<p>And the product is different enough from anything else, that you can try novel ideas, but not different enough that your design skills are useless elsewhere, and you can't leverage other's advancement idea.
What's particularly impressive about Telum II isn’t just the cache size or even the architecture—it’s the deliberate trade-off IBM makes for ultra-low latency L2, almost as a replacement for traditional L3. That decision makes a lot of sense in a mainframe context where deterministic performance, low tail latency, and tight SLA adherence matter more than broad throughput per watt.<p>It also feels like a return to form: IBM has always optimized for workload-specific performance over general-purpose benchmarks. While x86 designs scale horizontally and are forced to generalize across consumer and datacenter workloads, IBM is still building silicon around enterprise transaction processing. In a way, this lets them explore architectural territory others wouldn’t touch—like banking on huge, fast L2 caches and deferring cross-core coordination to clever interconnect and software layers.
Virtual L3 and L4 swinging gigabytes around to keep data at the hot end of the memory-storage hierarchy even post L2 or L3 eviction? Impressive! Exactly the kind of sophisticated optimizations you should build when you have billions of transistors at your disposal. Les Bélády's spirit smiles on.
Interesting to compare this to ZFS's ARC / MFUvsMRU / Ghost / L2ARC / etc. strategy for (disk) caching. IIR, those were mostly IBM-developed technologies.