The End of Moore’s Law and Faster General Purpose Computing, and a Road Forward [pdf]

80 点作者 banjo_milkman超过 5 年前

17 条评论

We've built up layers and layers and layers of inefficiencies in the entire OS and software stack since the gigahertz wars took us from 66Mhz to multiple GHz in the 90s.The software industry is awful at conserving code and approaches through the every-five-years total redo of programming languages and frameworks. Or less for Javascript.That churn also means optimization from hardware --> program execution doesn't happen. Instead we plow through layers upon layers of both conceptual abstraction layers and actual software execution barriers.Also, why the hell are standardized libraries more ... standardized? I get lots of languages are different in mechanics and syntax... But a standardized library set could be optimized behind the interface repetitively, be optimized at the hw/software level, etc.Why do ruby, python, javascript, c#, java, rust, C++, etc etc etc etc etc not have evolved to an efficient underpinning and common design? Linux, windows, android, and iOS need to converge on this too. It would be less wasted space in memory, less wasted space in OS complexity, less wasted space in app complexity and size. I guess ARM/Intel/AMD would also need to get in the game to optimize down to chip level.Maybe that's what he means with "DSLs", but to me "DSLs" are an order of magnitude more complex in infrastructure and coordination if we are talking about dedicated hardware for dedicated processing tasks while still having general task ability. DSLs just seem to constrain too much freedom.

评论 #20889705 未加载

omarhaneef超过 5 年前

For those who have not looked yet: John Hennessey presentation. Argues -- with a lot of detail -- that Moore's law has closed out, that energy efficiency is the next key metric, and that specialized hardware (like TPU) might be the future.When I buy a machine, I am now perfectly happy buying an old CPU, and I think this shows you why. You can buy something from as far back as 2012, and you're okay.However, I do look for fast memory. SSDs at least, and I wish he had added a slide about the drop in memory speed. Am I at inflection?Perhaps the future is: you buy an old laptop with specs like today and then you buy one additional piece of hardware (TPU, ASIC, Graphics for gaming etc).

评论 #20888114 未加载

评论 #20888524 未加载

评论 #20889403 未加载

评论 #20888237 未加载

评论 #20888452 未加载

评论 #20888692 未加载

评论 #20888448 未加载

banjo_milkman超过 5 年前

This ties in nicely with chiplets: <a href="https://semiengineering.com/the-chiplet-race-begins/" rel="nofollow">https://semiengineering.com/the-chiplet-race-begins/</a> - a way to integrate dies in a package, where the dies can use specialized processes for different functions - e.g. analog or digital or memory or accelerators or CPUs or networking etc. This would make it easier to iterate memory/CPU/GPU/FPGA/accelerator designs at different rates, and reduce development costs (don't need to support/have IP for every function, just an accelerated set of operations on an optimized process within each chiplet). But it will need progress on inter-chiplet PHY/interface standardization.

deepnotderp超过 5 年前

So yes, if you compare matrix multiply in Python vs SIMD instructions, you will find a big improvement. Much harder to do that for more general purpose workloads.And it doesn't scale: <a href="https://spectrum.ieee.org/nanoclast/semiconductors/processors/the-accelerator-wall-a-new-problem-for-a-post-moores-law-world" rel="nofollow">https://spectrum.ieee.org/nanoclast/semiconductors/processor...</a>And in many cases, if you normalize all the metrics, e.g. precision, process node, etc. You'll find that the advantage of ASICs is greatly exaggerated in most cases and is often within ~2-4X of the more general purpose processor. E.g. small GEMM cores in the Volta GPU actually beat the TPUv2 on a per chip basis. Anton 2, normalized for process, is within 5x ish of manycore MIMD processors in energy efficiency.In other cases, e.g. the marquee example of bitcoin ASICs, that only works because of extremely low memory and memory bandwidth requirements.

prvc超过 5 年前

A possibly stupid question from a neophyte: what was the driving force behind Moore's law when it was in operation? Did it become a self-fulfilling prophecy by becoming a performance goal after becoming enshrined in folklore, or is there an underlying physical reason?

评论 #20905577 未加载

评论 #20890897 未加载

评论 #20889821 未加载

sifar超过 5 年前

Slide 36 compares the TPU with a CPU/GPU. This is apples to oranges comparison. One uses an 8bit Integer multiply while the other uses a 32b Floating Point multiply which inherently uses at least >4X more energy[1]. If you scale the TPU by 4, it is not an order of magnitude better. The proper comparison should be between the TPU and an equivalent DSP doing 8b computations. That would show if eliminating the energy consumed due to the Register File accesses is significant.I suspect most of the energy saving comes from having a huge on chip memory.[1] From slide 21Function Energy in Pj8-bit add 0.0332-bit add 0.1FP Multiply 16-bit 1.1FP Multiply32-bit 3.7Register file *6L1 cache access 10L2 cache access 20L3 cache access 100Off-chip DRAM access 1,300-2,600

SemiTom超过 5 年前

Big chipmakers are turning to architectural improvements such as chiplets, faster throughput both on-chip and off-chip, and concentrating more work per operation or cycle, in order to ramp up processing speeds and efficiency <a href="https://semiengineering.com/chiplets-faster-interconnects-and-more-efficiency/" rel="nofollow">https://semiengineering.com/chiplets-faster-interconnects-an...</a>Scaling certainly isn’t dead. There will still will be chips developed at 5nm and 3nm, primarily because you need to put more and different types of processors/accelerators and memories on a die. But this isn’t just about scaling of logic and memory for power, performance and area reasons, as defined by Moore’s Law. The big problem now is that some of the new AI/ML chips are larger than reticle size, which means you have to stitch multiple die together. Shrinking allows you to put all of this on a single die. These are basically massively parallel architectures on a chip. Scaling provides the means to make this happen, but by itself it is a small part of total the power/performance improvement. At 3nm, you’d be lucky to get 20% P/P improvements, and even that will require new materials like cobalt and a new transistor structure like gate-all-around FETs. A lot of these new chips are promising for orders of magnitude improvement—100 to 1,000X, and you can’t achieve that with scaling alone. That requires other chips, like HBM memory, with a high speed interconnect like an interposer or a bridge, as well as more efficient/sparser algorithms. So scaling is still important, but not for the same reasons it used to be.

评论 #20888600 未加载

DSingularity超过 5 年前

It is not that I disagree with Hennessy, but I think it is premature to conclude that general-purpose processors have reached the end of the road. There is a healthy middle in between specialized and general-purpose design. Exploiting that middle is what I think will deliver the next generation of growth. That is exactly what naturally occurred with SoC and mobile design.The raw computational capabilities of the TPU don't really prove anything. Of course co-design wins. Whether it is vison or NLP -- NN training has dominant characteristics. The arithmetic is known: GeMM. The control is known: SGD. Tailoring control and memory-hierarchy to this is a no-brainer and of course the economic incentives at Google push them in this direction and of course the expertise available at Google powered this success. For other applications it is not so clear.Finding similar dominance in other applications is trickier. To accelerate an application with a specialized architecture you need dominating characteristics in the apps memory-access, computational, and control profiles.

yogthos超过 5 年前

It's odd that the presentation doesn't discuss alternatives to using silicon. Ultimately, this is akin to saying that there are limits on how small a vacuum tube we can make. We already know of a number of other potential computing platforms such as graphene, photonics, memristors, and so on. These things have already been discovered, and they have been shown to work in the lab. It's really just a matter of putting the effort into producing these technologies at scale.Another interesting aspect of moving to a more efficient substrate would be that power requirements for the devices will also lower as per Koomey's law <a href="https://en.wikipedia.org/wiki/Koomey%27s_law" rel="nofollow">https://en.wikipedia.org/wiki/Koomey%27s_law</a>

评论 #20888932 未加载

评论 #20888082 未加载

评论 #20888564 未加载

dragontamer超过 5 年前

"WASTED WORK ON THE INTEL CORE I7", slide#12 (page 13 in pdf) is fascinating to me. But I want to know how the data was collected, and what the % wasted work actually means.40% wasted work, does that mean that they checked the branch-predictor and found that 40% of the time was spent on (wrongfully) speculated branches?It also suggests that for all of the power-efficiency faults of branch predictors (aka: running power-consuming computations when it was "unnecessary"), the best you could do is maybe a 40% reduction in power consumption (no task seems to be 40% inefficient).

评论 #20889715 未加载

roenxi超过 5 年前

Still be too early to call the end of the march of microprocessors though.<a href="https://www.scienceabc.com/humans/the-human-brain-vs-supercomputers-which-one-wins.html" rel="nofollow">https://www.scienceabc.com/humans/the-human-brain-vs-superco...</a>The limits they are running up against are indeed crisises, but they're probably going to be able to find that they can copy whatever it is that biology is doing and squeeze out quite a bit more. The tradeoffs will get a lot weirder though.

评论 #20887927 未加载

justicezyx超过 5 年前

Amin's keynote is relevant here: <a href="https://onfconnect2019.sched.com/event/RzZl" rel="nofollow">https://onfconnect2019.sched.com/event/RzZl</a>The basic form of computing is becoming distributed. More are coming.

mikewarot超过 5 年前

I'm amazed that it's less than a picojoule to do an 8 bit add.

评论 #20889285 未加载

singularity2001超过 5 年前

So what's the name of the metric flop/sec/USD because that keeps on growing exponentially thanks to GPUs/TPUs, a paradigm shift predicted by Ray Kurzweil.

yalogin超过 5 年前

Is there a video of this talk available somewhere?Also can someone tell me what p4 is? Looks like almost every company and a bunch of universities are "contributors" there.

评论 #20891081 未加载

评论 #20891147 未加载

评论 #20890962 未加载

almost_usual超过 5 年前

One of the more interesting things I’ve read on HN in awhile. Seems like this will result in a large paradigm shift for the computing industry.

评论 #20887868 未加载

Accujack超过 5 年前

There's an internet meme about "Imminent death of Moore's law predicted".All Moore's law talks about is the density of transistors on a chip, and it's never been a linear progression of numbers. Recently I've seen news articles about some research into 5nm processes and other methods for increasing density of components on silicon, so it seems Moore's law (really Moore's rule of thumb or Moore's casual observation) isn't done yet.