I got to talk with Ivan after he gave an earlier version of this talk at Asilomar and the good news was he's the real deal, pretty much every question I could think to throw at him he had a solid answer for, the bad news was I felt the same way about the Mill architecture as I did about Intel's iapx432 architecture [1], which was elegant to a fault.<p>That said, I got the sense that this was what Intel was going for when they did Larrabee [2] and just missed because of the focus on Graphics. Unlike Larrabee is suspect OOTBC will need to build it themselves like Chip did for the Propeller [3].<p>That said, the challenge of these bespoke architectures are the requirement for software, or first a GCC port :-). I believe Ivan said they had a port that talked to their simulator, but I don't know if that was an optimizing thing like SGI's compiler for Itanium or a proof of concept thing.<p>The weird thing is of course "Why?", and one might say "But Chuck, faster and cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not necessarily faster than x86) and look back at ARM and think 99% of this was getting the ecosystem built, not the computer architecture. So who can afford to invest the billions to build the eco-system? Who would risk that investment? (Google might but that is a different spin on things).<p>So playing around with the Zynq-7020 (same chip that is on the Parallea but not the Epiphany co-processor) I can see a nice dual core ARM9 where you have full speed access to a bespoke DSP if you want for the 'tricky' bits. Will that be "good enough" for the kinds of things this would also excel at? I don't know, so I don't know how to handicap OOTBC's chances for success. But I really enjoy novel computer architectures, like this one and Chuck Moore's '1000 forth chickens' chip [4] (it was a reference to the Seymour Cray's quote, "Would you have 1,000 chickens pull your plow or an Ox?"<p>A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.<p>[1] <a href="http://en.wikipedia.org/wiki/Intel_iAPX_432" rel="nofollow">http://en.wikipedia.org/wiki/Intel_iAPX_432</a><p>[2] <a href="http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Differences_with_CPUs" rel="nofollow">http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...</a><p>[3] <a href="http://www.parallax.com/propeller/" rel="nofollow">http://www.parallax.com/propeller/</a><p>[4] <a href="http://www.colorforth.com/S40.htm" rel="nofollow">http://www.colorforth.com/S40.htm</a>
First off, I love the post! Super meaty goodness.<p>I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:<p><pre><code> The compiler controls when ops issue
</code></pre>
One of the big issues with VLIW was that the compiler had to be intimately aware of processor architecture. So when you upgraded your '886 to a '986 you needed new binaries because the '986 had more registers or executions units. [I assume Itanium fixed some of this, but it also sunk my interest in VLIW.]<p>Is this architecture going to face the same issue?<p>Edit: I watching the video and heard that "nearly all of what the super scalar is doing is [not calculating]". One of the other VLIW issues was that chip area was dominated by cache-area, so all the stuff about [not calculating] shrank and shrank relatively as cache area grew (see: <a href="http://techreport.com/review/15818/intel-core-i7-processors" rel="nofollow">http://techreport.com/review/15818/intel-core-i7-processors</a>). This claim concerns me.<p>Edit V2: but damn... Exciting stuff.
I like this guy and his presentation style, it is good to see someone delivering seriously on an old idea: queue machines. (The "Belt" is the queue machine model revisited. I can't see a theoretical breakthrough here). But the Mill's realization that combines a queue model with VLIW and embedded spiller machinery is worth attention.<p>Some criticism however: no sign yet of how variable latencies in memory will be tolerated. Requiring fixed latency for FUs is problematic with cache misses and FU sharing between pipelines. Also his comparison between a 64-item belt and the 300+ registers of OoOE is unfair, since the 300+ registers will likely tolerate more latency than the smaller belt.<p>I wrote a review of what I get from his first two talks here: <a href="http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/" rel="nofollow">http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/</a>
Wow, the one submission I've made anywhere that hit the front page, and I'm kicking myself for the title. 'Belt' is really their name for the category of machine architecture in the sense of register and stack machines. 'Mill' is the name of their specific architecture/family.<p>As far as the meat goes, I'm gratified that I'm not the only one to think this is mindbogglingly elegant. I see some misconceptions in this thread, but I'm finding it difficult to explain the divergences without beginning to inexpertly regurgitate large portions of this and the previous talk (on their instruction encoding).
As someone who has worked implementing the same algorithms on a VLIW DSP (TI C64x) and ARM and x86 a few comments after watching the beginning of the talk. It's been a while so this is from memory...<p>The different VLIW execution units can only process a subset of the instructions set. That means you need the right mix of instructions in your algorithm to take advantage of the full throughput. If you have any sort of serial dependency you won't be able to take advantage of all the execution slots. It basically excels at - signal processing (and even a subset of that). That said, when you hit the sweet spot it's pretty good.<p>When someone like TI compares their DSP to ARM they usually tend to ignore SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost another order of magnitude performance on those super-scalar CPUs if you're dealing with vectors of smaller quantities (16 bit or 8 bit). So while on paper the VLIW DSP should blow the general purpose superscalar out of the water for signal processing algorithms at comparable clock rates it's not quite like that when you can use SIMD. It also takes a lot of manual optimization to get the most out of both architectures.<p>So when you're in the VLIW's sweet spot your performance/power/price is pretty good. But the general purpose processors can get you performance in and out of that sweet spot (you're probably sucking more power but that's life).<p>You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...<p>EDIT: Another note is that for many applications the external memory bandwidth is really important. The DSPs benefit from their various DMAs and being able to parallelize processing and data transfer but generally x86 blows away all those DSPs and ARM. I guess in a modern system you may also want to throw the GPU into the comparison mix.
Very interesting that the architecture has no condition codes, or any global state for that matter. Also, that the belts are logical, in that there really isn't a chunk of memory where the belt lives, but the members of a belt live where they live and know "which belt" they belong to and "where on that belt" they are.<p>Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.
This looks fascinating - and "the real deal" (not crackpottery) but I'm hesitant about a number of items.<p>It seems like quite a reach to pick a 'latest and greatest' IA processor (why is this $885? Because it <i>can</i> be) as a point of contrast. We can hit half the performance envelope and/or about an eighth the price while keeping many of the features that make a desktop processor what it is (like, for example, a whopping big cache hierarchy). Picking on x86 seems like a good way of overstating the case; if we're getting a new architecture, we may as well pick on ARM or Tilera or what have you.<p>Am also, as an early believer in the Itanium, somewhat nervous about static scheduling for any purpose. Dynamic branch prediction is very accurate on today's architectures; this does not mean that static scheduling can emulate this accuracy and enjoy the benefits.
This was a very nice presentation indeed. Too bad they could not tell us how they handle the memory hierarchy. Because that part is crucial for any statically scheduled architecture.<p>Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)<p>Finally, they will not get anywhere near peak performance with general purpose code. The parallelism is just not there at the instruction level. They would do well on high performance computing or digital signal processing with enough floating point units and memory bandwidth.<p>Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.
At 17 mins in, when he talked about rename registers, it reminds me of SSAs. It's quite fascinating to see that something like that happens in the hardware level as well
Interesting idea.
A limitation of the temporal addressing is that variables live on entry to a basic block have to be put in the same position on the belt on <i>each path</i> which could enter the block. This is okay for loops, but it's going to require lots of shuffling about in conditional code. Still could be worth it, of course.
I don't get how this is different than an in-order register-based VLIW architecture, except that it makes branching more difficult to deal with because you need to synchronize belt-deltas between branches (which sounds like the dual of a register move to me)?<p>I work daily with a VLIW architecture and the only place I ever see a plain "move" instruction is in loops. Everywhere else, the compiler just churns through the registers as if they were in a belt anyway – just the names are absolute instead of relative.<p>I can imagine this might simplify the processor logic some – results are primarily read from one of the first few belt locations, and are always written to the first location. "Register moves" aside, is this the primary benefit?
My takeaway from this presentation:<p>- To achieve instruction-level parallelism, traditional architectures (his example: Haswell) often employ very messy techniques like register renaming which create a huge amount of complexity increasing power-consumption.<p>- He has focused on one such technique called Very Long instruction word (VLIW), and has taken it to an extreme: the technique he proposes is to throw away general purpose registers, and replace it with a "belt": a write-once linear tape of memory (implemented using stacks).<p>- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.<p>All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.<p>So, let's look at what the unworkable <i>problems</i> with Haswell are, and what's being done to fix them. Yes, nobody can seem to be able to figure out how to reduce power consumption on performant x86 microarchitectures beyond a point. It's a very old architecture, and I'm hopeful about the rise of ARM. The solution is not to throw away general purpose registers, but rather to cut register renaming and make ILP easier to implement by using a weak memory model (which is exactly what ARM does). ARM64 is emerging and the successes of x86 are slowly percolating to it [1]. Moreover, the arch/arm64 tree in linux.git is 1.5 years old and is under active development; we even got virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it works without processor extensions (it's not pure pvops). ARM32 already rules embedded and mobile devices, and manufacturers are interested in taking it to heavy computing. In short, the roadmap is quite clear: ARM64 is the future.<p>The core of the Linux memory barriers model described in Documentation/memory-barriers.txt [2] (heavily inspired by the relaxed DEC Alpha) should tell you what you need to know about why a weak memory model is desirable.<p>[1]: <a href="http://www.arm.com/files/downloads/ARMv8_Architecture.pdf" rel="nofollow">http://www.arm.com/files/downloads/ARMv8_Architecture.pdf</a><p>[2]: <a href="https://github.com/torvalds/linux/blob/master/Documentation/memory-barriers.txt" rel="nofollow">https://github.com/torvalds/linux/blob/master/Documentation/...</a>
This guy looks and is awesome! Can't wait to have them produce many-core Mill CPUs, that I'd like to throw algorithms onto. Then I'm gonna Buy Buy Buy
Wow, this addresses quite a number of my concerns with aging CPU architectures, specifically with throwaway registers for fundamentally low-latency operations like chaining ALU functions.
After Bret Victor's inspiring talk yesterday, it's fitting to see a new CPU architecture make it to the top of HN.<p>Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.